What NLP & Text Analytics Companies Got Funded in 2015?

Who got funded, and for what?

Good sector research — tracking those questions and also the demand-side market opportunity — is the foundation for smart-money investment. And it’s good research that guides savvy product developers and marketers who have to square technical possibility with competitive market reality. Someone has to be willing to pay for what you’re building and selling, whether that someone is an investor who’s anticipating future profits or (even better!) a customer who likes and needs your work.

Visualization of Wikipedia editing activity by robot user Pearle. Image by Fernanda B. Viégas.

Visualization of Wikipedia editing activity by robot user Pearle. Image by Fernanda B. Viégas.

So it pays to follow the money, to track investment and M&A activity, and see what insights and trends you can infer.

In natural language processing (NLP) and text and sentiment analysis — as in other tech spaces —  investment funds technology productization, company building, and market entry and expansion. Academic and government research funding is important, but it’s angel, VC, institutional, and acquisition money that’s a best indicator of growth directions. It’s that money — investment in emerging text analytics players — that I’ve tracked, for 2015 just as in previous years.

Here is my compilation, with overlapping/adjacent fields including natural language processing (NLP), search, and semantic applications part of the mix. I include brief elaboration and link to sources. Where you place your bets is up to you.

Follow the Money: 2015 Investments in NLP and Text and Sentiment Analysis

I’ll preface my list by offering industry and application focal points, a set of points that is unsurprisingly headed by analysis of online and social media followed, equally unsurprisingly, by funding toward broadly-usable tech tools, which are split out in two forms as second and third on my focal-point list. My categorization of 2015 funding and acquisition events is as follows:

  • five are in social/media analysis (Sysomos, Signal, NetBase, Viralheat, Strossle)
  • four deliver text/sentiment analysis as-a-service, via an API, or a toolkit (Bitext, Sentisis, Cortical.io, MonkeyLearn)
  • four are broadly multi-functional (Dataminr, Linguasys, Treparel, Sensai)
  • four involve AI for speech or conversational interfaces (Wit.ai, Api.ai, DigitalGenius, Semantic Machines)
  • four are in media & publishing (Automated Insights, Appinions, Wibbitz, TEMIS)
  • three are in HR (Textio, Jobandtalent, Texternel)
  • three are in life sciences (Health Fidelity, TEMIS, NarrativeDx)
  • three break out as having a customer experience/consumer insights focus (Medallia, NewBrandAnalytics, OdinText)
  • one specializes in legal information (Lex Machina)
  • one involves office tools (Equivio)
  • one focuses on personality profiling (Receptiviti)
  • one focuses on financial news (Clueda), and
  • two I’ll break out as early-stage startups (Semantile, Ingen.io)

That’s a broad distribution, and yes, I’m double-counting TEMIS. It’s also a distribution that doesn’t include any home runs. If you’re looking for a billion dollar IPO, don’t pin your hopes on text analytics, although per my May, 2015 VentureBeat article, Where are the text analytics unicorns?, a few solution focused, text-analytics reliant companies are on break-out track (namely Medallia and Clarabridge; Sprinklr too). Plus NLP is at the core of just-about everything Google and Baidu do and much of Facebook’s, IBM’s, Amazon’s, and Microsoft’s businesses.

Review my funding list and assess for yourself:

  1. Facebook Acquires Wit.ai To Help Its Developers With Speech Recognition And Voice Interfaces.” (January 5, 2015)

  2. Health Fidelity took a $19.26 million investment from UPMC, the University of Pittsburg Medical Center. (January 9, 2015) The company applies “groundbreaking NLP and analytics technology,” based on the MedLEE system licensed from Columbia University, as part of a “risk adjustment solution.”

  3. Microsoft acquires text analysis startup Equivio, plans to integrate machine learning tech into Office 365.” (January 20, 2015)

  4. Inveready leads a $900,000 Seed Round for Bitext,” which a Spanish text/sentiment analysis specialist. (January 29th, 2015)

  5. Microsoft, Amazon vets raise $1.5M to help recruiters optimize job postings with Textio” (February 2, 2015) followed by “Textio Raises $8 Million in Series A Financing Led by Emergence Capital.” (December 16, 2015)

  6. Newly Independent Sysomos Re-emerges to Transform Social Intelligence.” (February 10, 2015) A key distinction for Sysomos is that the company’s social-media platform was built around own text analytics.

  7. In the natural language/content generation world: “Vista Acquires Automated Insights, The Startup Behind The AP’s ‘Robot’ News Writing,” for a reported $80 million in cash. (February 12, 2015)

  8. With a media-analysis focus, applying NLP, UK company Signal took at $1.4 million seed round (October 9, 2015),that after taking £1.2m in seed funding in September 2014 (reported as “Signal Raises $1.85 Million Seed Round” in March 2015).

  9. Strong social media analysis involves text analytics: “Dataminr Raises $130 Million in Growth Capital from Leading Financial Industry Investors.” (March 17, 2015)

  10. NetBase Lands an Additional $24M in Series E Funding” (March 13, 2015) followed by “NetBase Secures $9M in Additional Series E Funding.” (April 29, 2015) My reading is that NetBase, which is built around sophisticated NLP and knowledgebase technology, is aiming for an IPO as soon as market conditions are right.

  11. ¿Qué Es? Sentisis‘ NLP SaaS Platform For Spanish Pulls In $1.3M.” (March 18, 2015) Sentisis is an Argentine start-up that is working to build a customer base in Latin America and North America.

  12. Cision Acquires Viralheat to Provide the Industry’s Most Comprehensive Social Suite.” Cision is a PR/media measurement conglomerate. (March 23, 2015)

  13. Sensai Raises $900K To Help Data Scientists Query Unstructured Data.” (Reported as $1.3 million seed funding by CrunchBase.) (March 31, 2015) Co-founder Monica Anderson has a long background in machine language understanding.

  14. Content Marketing Platform ScribbleLive Acquires Appinions.” (April 15, 2015) Appinions started life in 2007 as Jodange, commercializing opinion-mining work by Cornell University computational linguist Prof. Claire Cardie.

    Treparel's KMX Patent Analytics interface

    Treparel’s KMX Patent Analytics interface

  15. Research and Analytics Firm Evalueserve acquires Treparel,” a Netherlands text analytics provider. (May 4, 2015)

  16. MonkeyLearn is a machine learning platform on the cloud that allows software companies and developers to easily extract actionable data from text,” a spin-off of development shop Tryolabs. The company took $250 thousand in seed funding, adding to $300 thousand in 2014 seed funding. (May 27, 2015)

  17. Clueda AG, a German financial news analysis specialist, sold itself in May 2015 to Baader Bank AG. Clueda was founded in 2012, applying cognitive methods to real-time text analysis.

  18. Expert System has signed an agreement to acquire French text analytics company TEMIS for €12m EV (€4m cash, €1m debt acquired and €7m in shares)” (May 6, 2015) followed by “Expert System Announces €5 million bond issue with 4% fixed annual interest rate, maturing in 2024.” (July 31, 2015) The combined Italian-French company is actively expanding in North America.

  19. Israeli start-up Wibbitz, which applies NLP to the task of generating video from text, took an $8 million Series B round. (May 21, 2015)

  20. For its platform, applying linguistic analysis and machine learning,” Jobandtalent Secures a $25M Boost! (May 25, 2015)

  21. Social customer experience provider Sprinklr acquired NewBrand Analytics in June. See CEO Ragy Thomas’s “Introducing NewBrand: Why the Industry’s Leading Location-Specific Text Analytics Means Better Customer Experience.” (June 2, 2015)

  22. NarrativeDx took $650 thousand in seed funding in June, followed by $718 thousand in debt financing July, according to CrunchBase. (June 8, 2015) (I suspect the $800 thousand seed funding listed is a duplication.) The start-up applies NLP for sentiment analysis to patient surveys, social media, and other material.

  23. Early-stage Ingen.io, which is “building a ‘cognition cloud’ with a knowledge graph representation of the world at the center,” took €200 thousand in seed funding (June 8, 2015) and later in the year won a €100 thousand grant from ODInE – The Open Data Incubator.

  24. Artificial intelligence startup DigitalGenius raises $3M to help companies automate customer service.”(June 22, 2015) TechCrunch’s report puts it this way: “DigitalGenius is building an automated customer service platform driven by artificial intelligence (AI) and natural language processing (NLP).”

  25. Market research specialist OdinText received Connecticut Innovations financing via a convertible note, amount not disclosed but likely in the $.5 to $1 million range. (July 1, 2015)

  26. Medallia announces unicorn status with $150 million funding round.”(July 21, 2015) Medallia is a customer experience solution provider with strong own-built text analytics capabilities.

  27. CareerBuilder Joins Forces with Textkernel to Deliver Powerful, Multilingual Semantic Search and Matching Technologies to Recruiters,” acquiring a majority stake. (July 21, 2015)

  28. Aspect Software Announces Acquisition of the Technology Assets of LinguaSys, a Leading Provider of Natural Language Understanding (NLU) and Interactive Text Response (ITR) Technology.” (August 11, 2015) LinguaSys deploys the Carabao language toolkit acquired from Melbourne, Australia’s Digital Sonata along with the services of DS’s Vadim Berman.

  29. Api.ai Raises $3M For Its Siri-Like Conversational UI, Makes Developer Usage Free.” (August 19, 2015)

  30. Receptiviti, which aims to help you “understand the people behind their language data,” took $690 thousand in a venture round. (August 20, 2015) The company commercializes Univ. of Texas Prof. James Pennebaker’s Linguistic Inquiry and Word Count (LIWC) software.

  31. Sprinkle Acquires Saplo and Changes Name,” to Strossle. (August 31, 2015) Text analytics provider Saplo was founded in 2008. The name change from Sprinkle to Strossle? Must be a Swedish thing.

  32. Cortical.io: USD 1.8 million for brain-inspired algorithm made in Austria.” (October 30, 2015) The company’s aim is to deliver real-time NLP via the Retina engine.

  33. Lex Machina applies proprietary NLP and machine learning to extract information from legal materials. The company was acquired by LexisNexis for an undisclosed amount, possibly in the $30-35 million range. (November 23, 2015)

  34. Artificial intelligence startup Semantic Machines raises $12.3 million.” Semantic Machines is building out “conversational AI” technology. (December 23, 2015)

  35. Semantile, “a stealth-mode enterprise semantic relevance engine company,” took a seed funding round late in the year. (December 28, 2015)

I’ll add that my count is that of these thirty-five, twenty-three are US based, nine have their homes in Europe, and one each is Argentine, Canadian, and Israeli.


But that’s not all. Many companies apply third-party text analytics as a core part of their product line, Tracx, Coveo, and Ontotext, listed below, among them. By contrast, Clarabridge and IBM have proprietary text analytics and made buys that expand platform capabilities, in social analytics and search relevance, respectively.

  1. Tracx raises $18 million in funding.” Tracx’ business is social media analytics and social relationship management; the platform relies on outside text analysis. (February 2, 2015)

  2. IBM acquired the blekko “advanced Web-crawling, categorization, and intelligent filtering technology” and team added it to the Watson stack. (March 27, 2015) Check out “Data, Data, Everywhere – Now a Better Way to Understand It.”

  3. Clarabridge Acquires Engagor,” not a text analytics play, but rather a social-analytics buy that expands text analytics/customer experience provider Clarabridge’s portfolio and European presence both. (May 21, 2015)

  4. Precyse, which relies on NLP within a set of health information management solutions, was acquired by Pamplona Capital Management for an undisclosed amount. (June 30, 2015)

  5. Coveo, a recognized leader in intelligent search, has secured $35 million in Series D financing to fund the company’s aggressive growth,” according to a company press release. While Coveo relies on outside NLP software, we’re in close-enough territory for inclusion here. (November 5, 2015)

  6. The parent of graph database vendor Ontotext, which has strong text analytics capabilities, albeit like Coveo’s sourced from outside (the GATE project), went public on the Bulgarian Stock Exchange. That’s Sirma Group Holding. (November 23, 2015)

A 2016 Start

Finally, a couple of 2016 events worth noting:

  1. French company “Dictanova raises 1.2 million euros to accelerate development.” (January 8, 2016) Dictanova was founded in 2011 to commercialize NLP tech developed at the Univ. of Nantes, for Voice of the Customer applications.

  2. Customer experience text-analytics provider Attensity sold its Europe arm Attensity Europe GmbH, to CEO Thomas Dreikauss and IMCap Partners for an undisclosed amount. (January 13, 2016)

See Also…

That’s it for my run-down of 2015 investment and M&A activity in NLP and text analytics. Thanks for reading, and please do tip me off to any activity I’ve missed. And for additional insights, for the year to come, see my 2016 technology and market directions assessment, Text, Sentiment & Social Analytics in the Year Ahead: 10 Trends.

The Rest of the Qlik Data Narratives Story

I feel I own a small piece of the Narratives for Qlik story, because Narratives makes real certain BI/IT points I’ve promoted for years.


Illustrating Narratives for Qlik

QlikTech is a business intelligence software provider. The story is that an extension from Narrative Science dynamically generates explanatory text that is seamlessly integrated into Qlik Sense data visualizations. Per my quote in the launch press release: Narratives for Qlik – tech that turns your data into a story – is a groundbreaking BI innovation. The ability to provide a new means of communicating insights is really compelling.

What’s unwritten — the rest of the story — concerns Qlik’s distinctive architecture. Qlik isn’t just a visual-BI tool, it’s an extensible data-analysis platform. You can build on it, you being software developers, consultants, and everyday users and not just Qlik staff. And you can distribute your extensions via a marketplace, Qlik Market, that’s open to anyone.

Qlik’s leading visual-BI rival, Tableau, has a support community, and TIBCO Spotfire both hosts a community and provides company-built extensions (here’s an interesting one for Javascript graphics), but neither company supports anything close to the level of open partnership enabled by Qlik’s platform architecture and encouraged by Qlik company culture.

But wait, there’s even more to the story!

Qlik superstar Donald Farmer’s blog article, Off The Charts!, ably makes the case for the Narratives for Qlik ability to provide “a textual interpretation of your visual content.” Donald communicates key advantages provided via added narrative text, but I think I can add a couple that he missed.

Donald writes that “even good visualizations have three important limitations”:

  1. A need for explanation: “The viewer must understand the visual language in use,” but “not everyone shares [a data analyst’s] insightful reading of even a simple bar chart.”
  2. A narrative gloss, describing [nuanced chart] details, can be a handy addition to the visual overview.”
  3. Visualizations simply do not tell the whole story in your data. They cannot capture the flow and context of a human conversation which is, in fact, our most fundamental form of collaboration.”

The additional advantages I see — my even more to the story — are captured in two use cases:

  1. BI accessibility to persons with visual impairments. Standards such as the U.S. government’s Section 508 mandate data-delivery in a form that can, for instance, be read via assistive devices. (For a strong, BI-relevant background, read Stephen Few’s blog article Data Visualization and the Blind.) Most often, this form is a data table, containing row- and column-labeled text. But even easier than that, I’d say, would be a narrative rendering of chart contents, which is exactly what Narratives for Qlik delivers.
  2. Presentability! Have you ever tried to explain a complex data graphic to an audience? There’s often a lot of pointing and hand-waving involved. You risk bombing if you improvise an explanation on-the-fly, if you don’t study and prepare. But that explanation: It’s precisely what Narratives for Qlik will generate for you automatically. Whether you show the narrative during your presentation or use it in your preparation, I imagine Narratives for Qlik will provide much-appreciated presentability help.

(I have seen suggestions that Tableau visualizations are not Section 508 compliant. At least one company response was evasive.)

Myself, I have long preached IT accessibility — I got a strong dose of the gospel when I helped build the Census Bureau’s first American FactFinder Web site starting in 1998 — as well as the advantages of open platform architectures and the concept of data storytelling. These three elements build out the rest of the Qlik data narratives story.

Disclosure: I consulted to Qlik in the latter half of 2012, into 2013, regarding strategies for bringing text analytics to the Qlik platform.

Pitch Me and I’ll Want Answers to These 12 Questions

I get pitched a lot. No, I’m not giving out money; the companies that contact me crave a different currency: attention. And if not attention or funding, contacts are after advice, referrals, feedback, endorsements, or help in the form of a conference talk, article, or social posting.

"Babe Ruth Boston pitching" by Frances P. Burke - Francis P. Burke Collection. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Babe_Ruth_Boston_pitching.jpg#/media/File:Babe_Ruth_Boston_pitching.jpg

“Babe Ruth Boston pitching” by Frances P. Burke

I’m not special. Thousands of us — analysts, writers, and other industry intermediaries — who listen, interpret, and then advise tech companies and buyers — get pitched. Long live influencer marketing!

Myself, I welcome the contacts; often I initiate them. I and other industry analysts aim both to stay on top of trends and to quickly grasp the reality and potential of the new (or improved) tech, solutions, and companies we encounter.

You need our help, and we need yours. Let’s set up a briefing!

I especially like hearing from start-ups, whether they’re seeking to craft and position a product or are on a quest for early-stage funding. These folks are often unsure of market needs and the competitive landscape. Sometimes they need help valuing, packaging, and pricing their assets and devising a route to market. Other comers, by contrast, are oversure and in evangelism overdrive. They’re not seeking advice; they know what story they want to tell.

In all cases, ability to communicate value proposition and differentiation is key. Best if you know the competitive scene and tech trends, including what research is close to commercialization, and are able to say how you’re different and better and going to succeed. But if you build your pitch around competition and trends alone — “We’re going to disrupt Sector X with our patent-pending deep learning gobbledygook,” or something like that — you’d better have serious star-power and a track record and something to show. If not, your message will fall flat.

So if you brief me — whether your company is early-stage or established or in-between — whether you’re seeking money, advice, exposure, or connections — use the opportunity to explain why and how you’re better. Make sure also to cover the basics.

Key points will be covered, I think — basics, differentiators, and promise — if you address…

12 things I’ll want to know if you brief me (start-up version):

  1. Who are you — you, your team, advisors, and partners?
  2. What are you selling or planning to sell?
  3. What assets do you own — IP, codebase, reputation?
  4. What outside assets do you rely on, whether commercially obtained or free, open source?
  5. What’s your target market? What business problems or technical challenges do you address and for whom?
  6. What’s your route to market? That is, how do customers find and buy from you, or how do you intend that they will?
  7. Who are a few notable customers, if you’re already out in the market, and how are they using your tech?
  8. Who’s the competition?
  9. What makes you special?
  10. How might you fail?
  11. Could you tell me about your development roadmap?
  12. What questions do you have for me?

Keep in mind, I’m rarely looking (only) for straight answers to these or any other questions. Nor is a prospective investor, business partner, or customer. Part of the exercise is to gauge your confidence, competence, planning ability, and, frankly, whether you’re for real. Evasive non-answers are telling, by the way.

But believe me, I know that I’m not the sole judge of reality. No single analyst, writer, investor, or advisor is. If you can create your own reality, cool! We’re all in this game to learn. Myself, if you teach me and convince me, you’ll help me do my own job and you’ll help me help you. I’m sure others like me think the same.

Text, Sentiment & Social Analytics in the Year Ahead: 10 Trends

Text, sentiment, and social analytics help you tune in, at scale, to the voice of the customer, patient, public, and market. The technologies are applied in an array of industries — in healthcare, finance, media, and consumer markets. They distill business insight from online, social, and enterprise data sources.

It’s useful stuff, insight extracted from text, audio, images, and connections.


The analytics state-of-the-art is pretty good although uptake in certain areas — digital analytics and market research are examples — has lagged. But even in areas of strong adoption such as customer experience and social listening and engagement, there’s room for growth, for both technical innovation and more-of-the-same uptake. This market space means opportunity for new entrants and established players alike.

We could examine each analytics species in isolation, but better to look at the full, combined impact area. The technologies and applications overlap. Social analyses that neglect sentiment are incomplete, and to get at online, social, and survey sentiment, you need text analytics.

This article, a look-ahead technology and market assessment, surveys high points for the year to come, with a run-down of —

10 text, sentiment, and social analytics trends to watch for in 2016

  1. Multi-lingual is the rule. While English-only analytics hold-outs remain — and true, it’s best to do one language really well than to cover many poorly — machine learning (ML) and machine translation have facilitated the leap to multi-lingual, the new norm. But if you do need to work across languages, do some digging: Many providers are strong in core languages but weak in others. Choose carefully.
  2. Text analysis gains recognition as a key business-solution capability — for customer experience, market research and consumer insights, digital analytics and media measurement — and providers will increasingly compete on their analytics. Build or buy subscribe: both are viable options. While you could call this trend point quantified qualitative, what really matters is that text analysis is baked into the business solution.
  3. Machine learning, stats, and language engineering coexist. Tomorrow belongs to deep learning — to recurrent neural networks and the like — but today long-established language-engineering approaches still prevail. I’m referring to taxonomy, parsers, lexical and semantic networks, and syntactic-rule systems. (Two of my consulting clients are commercializing in these areas: eContext, providing taxonomy based classification infrastructure, and Contextors, implementing a very-high-precision English-language parser.)  So we have a market where “a thousand flowers bloom, a hundred schools of thought contend…” and even co-exist. Cases in point: Even crowd-sourcing standard-bearer CrowdFlower is embracing machine learning, and start-up Idibon makes a selling point of combining traditional and new: “you can construct custom taxonomies and tune them with machine learning, rules, and your existing dictionaries/ontologies.”
  4. Image analysis enters the mainstream. Leading-edge providers are already applying the tech to deciphering brand signals in social-posted media — check out Pulsar and Crimson Hexagon — and image analysis ability, via deep learning, was a major selling point in IBM’s 2015 AlchemyAPI acquisition. Indeed, hot ML start-up Metamind pivoted in 2015 from NLP to a focus on image analysis, recognizing the extent of the opportunity.
  5. A break-out for speech analytics, with video to come. The market loves to talk about omni-channel analytics and about the customer journey, involving multiple touchpoints, and of course social and online media are awash in video. The spoken word — and non-textual speech elements including intonation, rapidity, volume, and repetition — carry meaning, accessible via speech analysis and speech-to-text transcription. Look for break-out adoption in 2016, beyond the contact center, by marketers, publishers, and research & insights professionals and as an enabler for high-accuracy conversational interfaces.
  6. Expanded emotion analytics. Advertisers have long understood that emotion drives consumer decisions, but until recently, broad, systematic study of reactions has been beyond our reach. Enter emotion analytics, either a sentiment analysis subcategory or sister category, depending on your perspective. Affective states are extracted from images and video via facial-expression analysis, or from speech or text, with the aim of quantifying emotional reactions to what we see, hear, and read. Providers include Affectiva, Emotient, and Realeyes for video, Beyond Verbal for speech, and Kanjoya for text; adopters in this rapidly expanding market include advertisers, media, marketers, and agencies.
  7. ISO emoji analytics. Given text, image, speech, and video — and Likes — why use emoji? Because they’re compact, easy to use, expressive, and fun! Like #hashtags, they complement and add punch to longer-form content. 💌! That’s why Internet slang is dead (ROTFL!) and Facebook is experimenting with emoji Reactions, and more of a good thing: we’re seeing variants like Line stickers. Needed: emoji analytics. The tech is emerging, via start-ups such as Emogi. (Check out Emogi’s illuminating 2015 Emoji Report: 🎯). Although (⚠️) most others don’t go beyond counting and classification to get at emoji semantics — the sort of analysis done by Instagram engineer Thomas Dimson and by the Slovene research organization CLARIN.SI — some of them, for instance SwiftKey, deserve a look. More to come in 2016!
  8. Deeper insights from networks plus content is both a 2016 trend point and most of the title I gave to a 2015 interview with Preriit Souda, a data scientist at market-research firm TNS. Preriit observes, “Networks give structure to the conversation while content mining gives meaning.” Insight comes from understanding messages and connections and how connections are activated. So add a graph database and network visualization tools to your toolkit — there’s good reason Neo4jD3.js, and Gephi (to name a few open-source options) are doing well, and building on a data-analytics platform such as QlikView is also an option — to be applied in conjunction with text and digital analytics: A to-do item for 2016.
  9. In 2016, you’ll be reading (and interacting with) lots more machine-written content. The technology is called natural language generation (NLG); the ability to compose articles — and e-mail, text messages, summaries, and translations — algorithmically from text, data, rules, and context. NLG is a natural for high-volume, repetitive content — think financial, sports, and weather reporting, and check out providers Arria, Narrative Science, Automated Insights, Data2Content, and Yseop — and also to hold up the machine’s end of your conversation with your favorite virtual assistant — with Siri, Google Now, Cortana, or Amazon Alexa — or with an automated customer-service or other programmed response system. These latter systems fall in the natural-language interaction (NLI) category; Artificial Solutions is worth a look.
  10. Machine translation matures. People have long wished for a Star Trek-style universal translator, but while 1950s researchers purportedly claimed that machine translation would be a solved problem within three or five years, accurate, reliable MT has proved elusive. (The ACM Queue article Natural Language Translation at the Intersection of AI and HCI nicely discusses the machine translation state of the human-computer union.) I wouldn’t say that the end is in sight, but thanks to big data and machine learning, 2016 (or 2017) should be the year that major-language MT is finally good enough for most tasks. That’s an accomplishment!

Every one of these trends will affect you, whether directly — if you’re a text, sentiment, or social analytics researcher, solution provider, or user — or indirectly, because analysis of human data is now woven into the technology fabric we rely on every day. The common thread is more data, used more effectively, to create machine intelligence that changes lives.

Time to Bring Knowledge to Knowledge Management: #NewKM

My first reaction, on encountering a recent article, Why the Timing is Right for Knowledge Management Portals, was a sinking feeling — “been there, done that” — a reaction both to the idea of resurrecting the failed portal concept and to the thought that respectable folks still see knowledge as manageable, in this, the Internet era, when facts, opinions, and expertise move at light speed.

KnowledgeWinsThere’s little exciting in KM as it has long (although serviceably) been defined, as the industry has long conceived it. Industry’s idea is that an enterprise can beneficially manage knowledge by a) storing and organizing documents and providing a search function and b) cataloging employee abilities and facilitating collaboration. This approach works for some, but in my view, it delivers half-truths. It ignores the information inside documents. It ignores enterprise-relevant knowledge and expertise that resides outside an organization’s boundaries, out in the wild-and-wooly online and social universe. It largely ignores the social voice of the customer, business partner (and competitor) information, the wisdom of communities of practice and industry authorities, and the like.

KM’s short-comings aren’t going to be overcome solely, or primarily, by better data hygiene or consistent approaches to applying metadata, or by putting a new face — a reworked portal — on the same old searchable document sets. What’s needed?

A NewKM Need

My view: It’s time to bring knowledge to knowledge management, via:

  1. Analytics, specifically, exhaustive information extraction (and not just searchable documents) and then data mining to identify links and associations;
  2. An end to artificial boundaries, to neglect of extra-mural information;
  3. Purpose-driven, ad-hoc communities and collaboration (and not just rosters of experts); and
  4. Actual facts and connections, as captured in social and knowledge graphs.

The building blocks of the property graph, per Neo4j

(Information extraction is the resolution of entities, pattern-based information such as events, topics, concepts, sentiment, and relationships of interest within source media, whether text, images, audio, or video. IE may involve structural, statistical, and machine learning (ML) methods, that is machine intelligence or AI. Whatever the method applied, the aim is to discover relevant data wherever it occurs.

(The graphs I’m referring to are network and property graphs, data structures that capture entities of interest — whether people, places, and organizations or products, components, and parts or something else — and their attributes and interconnections, as nodes, annotations, and edges.)

Seth Earley, whose “Why the Timing…” portal article I cited, is deservedly a recognized information-management authority. His views and mine do align to an extent, judging from a concluding line in his article. Seth observes that “organizations are weighed down with legacy technologies” and that access difficulty stems primarily from “the underlying structures of corporate content and data.” He expects that “knowledge management portals will continue to evolve with machine learning, natural language processing (NLP), and social collaboration integration.” Replace “management portals” in that sentence and you’re golden. It’s knowledge we should focus on! Rewriting: Knowledge discovery will continue to evolve with machine learning (ML), natural language processing (NLP), and social collaboration integration.

As for portals as an access mechanism, they will remain a choke-point, given all the enterprise-relevant information they can’t get at. And while the tired document-centered data structures that sit behind KM portals will become more flexible via ML, NLP, and collaboration, what you get out of them will continue to be records rather than knowledge.

Screen Shot 2015-12-01 at 5.14.50 PM

Knowledge — search-retrieved, interrelated facts — as captured in Google’s Knowledge Graph

Some Get It, Somewhat, and Some Don’t

Judging from the agenda of last November’s KMWorld conference (where I spent a day), unlike Seth Earley, the broad KM community largely doesn’t get analytics, openness, network, or knowledge bases. The KM community seems largely inward focused, ignorant of the applicability to KM of the machine intelligence innovations discussed above, which not incidentally have long-since be proven by Google, Facebook, IBM, and a host of providers in the semantic space, as well as by businesses applying them in customer experience, consumer insights, social/media analysis, life sciences, and a spectrum of other initiatives. But fortunately there are KM exceptions. For one —

Safeharbor Knowledge Solutions, which I learned about via a Brainspace blog article, differentiates document management and (true) knowledge management, explaining,

A knowledge base is not just a document repository – it’s a body of knowledge that is continuously evolving. Knowledge consists of answers shared by experts, information hidden away in emails, ideas and feedback found in article comments and community forum discussions. A knowledge base application is designed to capture knowledge as it’s created and make it easy to find.

So that’s my ingredient #4, above.

While Brainspace isn’t positioned as a KM provider — text analytics forms the core of their product line — the blog article I mentioned relates to KM: The Key to Knowledge Management and Innovation is Knowledge Flow, Part 1. Brainspace’s Flow concept extends to external market intelligence and enterprise social networks so let’s award Brainspace half a point on ingredient #2.

Brainspace gets more points for pointing us to another article, Build Better Knowledge Management, by Christian Buckley, a long-time KM industry participant. Buckley writes,

The problem with knowledge management (KM) is… a user experience that fails to align the needs of the complex, non-linear playback mechanisms of the human brain with our systems of record…

To build the next generation KM platform, we need solutions that can:

  • Improve the distribution of knowledge and ideas, quickly and seamlessly
  • Automatically identify patterns and themes within that content
  • Expand upon, refine and convert that knowledge based on those patterns, and in context to our requirements, ultimately making it searchable (i.e. findable)
  • Correlate those patterns and themes, and take appropriate action — with those actions also tracked and measured, as an extension of the ideas

Brainspace’s Brandon Gadoci writes, however, that “Buckley is speaking wishfully. Most companies have no such platform.” True, and that’s a challenge and an opportunity, to bring knowledge — patterns and themes, refined, findable, supporting action — to knowledge management. What portal is going to, per Christian Buckley, “improve the distribution of knowledge and ideas, quickly and seamlessly”?

But all the same, none of this KM-insider evangelism breaks out of the enterprise-as-knowledge-island KM self-limitation. For that, we should seek truly new elements in…

A NewKM Agenda

The agenda that will advance KM involves analytics and information extraction — bridging boundaries, crossing into the online and social world  — and an admission that records and documents are merely contains and that searchability does not constitute knowledge. The tech to support this agenda is out there, freely available and quite capable, flexible, and performant. So a NewKM agenda would have to start with the realization that a closed mindset — records as closed books and needless barriers — hinders knowledge management. Extract knowledge from documents. Structure it for query and analysis and not just retrieval. Work, collaboratively, across corporate boundaries. And find the knowledge in knowledge management.

There’s Truth in Big Data, But Not (Just) What You’d Expect

An article by “Dean of Big Data” Bill Schmarzo, The Mid-market Big Data Call to Action, provides a helpful quick take on the state of big data uptake, contrasting perceived experiences at smaller and larger organizations.


Big Bang Data exhibit at CCCB. Photo by Kippelboy.

Bill presents certain truths that are independent of big data — “It is easier for smaller organizations to drive cross-organizational collaboration and sharing” and “Smaller organizations have a better focus on delivering business results,” for example — yet also illuminating are certain implicit assumptions, points we should be challenging.

In the spirit of friendly discussion, I’ll offer a few data truths I see that are contrary to the article’s premises:

  1. It’s no longer acceptable to equate big data and Hadoop. (Not that I think it ever was.)

    Bill conflates the two in the conclusion he draws from responses to a webinar poll question. He asks, “Where are you in the process of integrating big data with your existing data warehouse environment?” — nothing about Hadoop there — and then concludes, citing poll results, “Over 80% of the attendees still do NOT have any meaningful Hadoop plans.”

    Nowadays, translated into particular technologies, big data means Hadoop, Spark, and Kafka — plus other technologies in their ecosystems of course — or non-Apache software with similar volume, velocity, and variety handling capabilities.

    And yes, I’m with Bill: There are still only 3 defining Vs for big data.

  2. Hadoop and other big data technologies can and do exist outside the data warehouse environment.

    I note that Bill concludes that the 80% who respond to the above question, either “in early discussions” or with “no plans/don’t know,” are not using Hadoop, as if they couldn’t be using it separately from the existing data warehouse. But also I wonder about those “don’t know” responses. Business users and managers focus on the user interface, whether graphical or a query or language. Particularly if you’re using SQL on Hadoop — via Apache Hive or numerous other options — SQL being the traditional data warehousing query language — you may be unaware of your use of Hadoop in your DW environment.

  3. Smaller organizations don’t necessarily face less-significant agility obstacles. It’s not absolute quantity or size that matters, it’s the number and seriousness of obstacles relative to an organization’s size.

    Take the statement, “Smaller organizations have a smaller number of HIPPOs with which to deal.” HIPPO =Highest Paid Person’s Opinion. Not a scientific sampling, but I can tell you that among my consulting clients, in companies with a employee count ranging from a handful to a few hundred, what the CEO says holds absolute sway regardless of the opinion’s technical soundness. The smaller an organization, the greater the immediate impact of any one individual’s opinion, good or bad.

    And “it is easier for small organizations to institute the organizational and cultural change necessary to actually act on the analytic insights.” Not at all. When small organizations institute significant organizational and cultural changes, they may be remaking the whole company. They’re all-in.

    Regardless, I find that small organizations are actually LESS likely than a larger organization to act on analytical insights. That’s both because they have less data to work with, so that there’s less to fuel analyses, and also because they’re much closer to the market and more reliant on insights derived from qualitative observation and direct market interactions.

  4. Data silos are not a big data killer. Done right, they’re simply a waypoint.

    Data silos can be an efficiency booster! When you have an operational task to accomplish, you purpose-design a data store that’s optimal for the task. Certainly, design according to standards that facilitate data integration or, at least, data exchange. Design foresight will ensure data can flow from silos into whatever integrated big data environment you implement. But if you elevate secondary data-use possibilities to the first tier — and especially if you do that in the name of a faddish concept like big data — you create risk and potential delay and performance compromise.

Finally, beware of selection bias, of drawing broad conclusions from a narrow sampling of a target population. People who attend a vendor-produced webinar on data management and analysis, if not actively shopping for capabilities they don’t have, likely feel their organizations fall short of the state of the art. So the good wisdom about organizational dynamics, data warehousing, and big data that emerged in Bill’s article is surely a testament to the the discernment he has gained via long and deep exposure to the topics, not just insights that jumped out of the numbers. Data is a springboard for insight and not a replacement for informed judgment.

More Advice on Giving Good Speeches

Jonathan Becher closes a blog article, Good Advice on Giving Good Speeches, with an invitation, “Any presentation tips you want to add?”

Well yes, I do have a few to add. Thanks for asking!

“Mr. Stiggins, getting on his legs as well as he could, proceeded to deliver an edifying discourse for the benefit of the company.” (Pickwick Papers)

I’ll recap the ten points Jonathan relays, examples he has drawn from Seymour Schulich’s Get Smarter: Life and Business Lessons. To them, I’ll add ten of my own. The points from Jonathan’s blog are these:

  • Be brief
  • Try to communicate one main idea
  • Create a surprise
  • Use humour
  • Slow it down
  • Use cue cards and look up often
  • Self-praise is no honour
  • Never speak before the main dinner course is served
  • Reuse good material
  • Use positive body language

Adding to these important points, my own guidance, drawn from my experience as a speaker and conference organizer:

  • Rehearse — not necessarily your whole talk, but certainly your opening and close and key, transitional moments.
  • No unplanned digressions! You risk getting lost with no route back to your prepared talk.
  • Brief, spontaneous asides are OK if relevant.
  • No borderline-appropriate jokes or sarcasm. They won’t come across.
  • Modulate your tone, volume, and pace. If you speak softly (but audibly), people will focus. Do speak louder and faster and voice emotion when those pragmatic tools will help you communicate.
  • Smile.
  • If you’re using slides, look at your audience, not at your slides.
  • Do not read your slides! (But you may read brief quotations.) The audience came to hear you. Bullet points, they could read themselves.
  • Prepare for interruptions. Handle questions quickly or divert them until after your talk so that you don’t squander your speaking time.
  • Do not go long. And do not finish early.

Finally, always remember:

A speech is experienced by the audience. As the saying goes, “They may forget what you said, but they will never forget how you made them feel.” Craft their experience and not only your words. Your audience will thank you, and most important, they’ll forgive you (per one of the points Jonathan relayed) for delaying their dinner.

Research and Insights: How Ipsos Loyalty Applies Text Analytics

Text Analytics clocks in as the #4 “emerging methods” priority for market researchers in the 2015 GRIT (Greenbook Research Industry Trends) report. Only mobile surveys, online communities, and social media analytics poll higher… although of course text analytics is key to social media analytics done right, and it’s also central to #5 priority Big Data analytics. GRIT reports text analytics as tied with Big Data analytics for #1 method “under consideration.”

On the customer-insights side, Temkin Group last year reported, “When we examined the behaviors of the few companies with mature VoC programs, we found that they are considerably more likely to use text analytics to tap into unstructured data sources… Temkin Group expects an increase in text analytics usage once companies recognize that these tools enable important capabilities for extracting insights.”

Jean-François Damais, Global Client Solutions at Ipsos Loyalty

Jean-François Damais, Global Client Solutions at Ipsos Loyalty

Clearly, the market opportunity is huge, as is the market education need.

Who better to learn from than active practitioners, from experts such as Jean-François Damais, who is Deputy Managing Director, Global Client Solutions at Ipsos Loyalty.

Jean-François is co-author, with his colleague Fiona Moss, of a recently released Ipsos Guide to Text Analytics, which seeks to explain the options, benefits, and pitfalls of setting up text analytics, with case studies. And Jean-François will be speaking at the 2015 LT-Accelerate conference, which looks at text, sentiment, and social analytics for consumer, market, research, and media insights, 23-24 November in Brussels. As a conference preview, he has consented to this interview, on —

Research and Insights: How Ipsos Loyalty Applies Text Analytics

Seth Grimes> You’ll be speaking at LT-Accelerate on text analytics in market research. You wrote in your presentation description that text analytics work at Ipsos has grown 70% each of the last two years. What proportion of projects now involve text sources? What sources, and looking for what information and insights?

Jean-François Damais> Virtually all market research projects involve some analysis of text. In the customer experience space in some of our key markets (i.e US, Canada, UK, France, Germany), I would say that 70-80% of research projects require significant Text Analytics capabilities to extract and report insights from customer verbatim in timely fashion. However, other markets (i.e Eastern Europe, LATAM, MENA, APAC) are lagging behind so the picture is a bit uneven. Generally speaking, text analytics plays a key role in Enterprise Feedback Management, which is about collecting and reporting customer feedback within organisations in real-time to drive action and growing at a very rapid pace.

In addition the use of text analytics to analyse social media user generated content is increasing significantly. But interestingly more and more clients now want to leverage text analytics to integrate learnings across even more data sources to get a 360 view of customers and potential customers. So on top of survey and social we quite often analyse internal data held by organisations such as complaints or compliments data, FAQs etc…and bring everything back together to create a more holistic story.

Text Analytics can really help when it comes to data integration. But of course technology is an enabler but will not give you all the answers. We believe that analytical expertise is needed to set up and carry out the analysis in the right way, but also to interpret, validate and contextualise text analytics. This is key.

Text Analytics plays a central roles, according to Ipsos

Text Analytics plays a central role, according to Ipsos

Seth> Despite the impressive expansion of text analytics use at Ipsos, my impression is that research suppliers and clients often don’t understand the technology’s capabilities, and the tool providers haven’t done a great job educating them. Does this match your impression, or are you seeing something different?

Jean-François> I would agree with you on the whole. There are still a lot of misconceptions and half knowledge in the industry. I do feel that text analytics providers would benefit from being more transparent about the benefits and limitations of their software, and how they can be applied to meet a business need. Currently it feels that everyone is ready to make a lot of promises that are difficult to live up to and I sometimes feel that this is counter productive. I am referring to the focus on accuracy levels, level of quality across languages, level of human input needed, how unique or better or one size fits all  one’s technology is compared to the rest of the market etc.

In 2014, we conducted a comprehensive review of many of the text analytics tools currently available and identified pros and cons for each.  Although each of the tools presented us with different strengths, challenges and functionalities, we gained the following learnings:

  • There is no perfect technology. Knowing the strengths and weaknesses of the technology used is key to getting valuable results.
  • There is no miracle “press a button” type solution, even the best tools need some human intervention and analytical expertise
  • There is no “one size fits all” tool – depending on the type of data or requirements some tools and technologies might be better suited than others.

My colleague Fiona Moss and I have recently written a POV on how to successfully deploy Text Analytics. The full paper is online.

The benefits of text analytics technology are huge and I do agree that focus should be put on educating users and potential users to make the most out of it.

How did you personally get started with text analytics, and what advice can you offer researchers who are starting with the technology now?

I got started in 2009 when Ipsos Loyalty launched text analytics services to its clients. At the time this capability was very much a niche offering and seen by most organisations as an added value and nice to have. But things have gone a long way since then and Text Analytics capabilities now support some of our biggest client engagements and are now a key tool in our toolkit.

Here is what I would say to any keen researcher (or client)

  • Know your purpose
  • Manage your organization’s expectations
  • Place the analyst at the heart of the process
  • Choose the right text analytics tool (s) given your objective
  • Learn the strengths and weaknesses of the tool (s) you are using
  • Don’t give up!

Q4 – Where do you see the greatest opportunities and the biggest challenges, when it comes to text sources and the information they capture, and for that matter, with the range of structured and unstructured sources?

To some extent what applies to text analytics applies to big data more generally. There has been a significant increase over the last few years in the volume and variety of sources of unstructured data, including feedback from customers, potential customers, employees, members of the public and information systems. There is a huge value that quite often lies buried in this data. So the opportunity comes from the ability to extract actionable insights and intelligence. So whilst the potential is huge, there are a number of pitfalls organisations need to avoid. One of the most dangerous is the belief that technology in itself, regardless how state of the art, is enough to derive good and actionable insights.

Quoting an Ipsos case study you wrote: “Even when data has been matched to a suitable objective, analysis can be a daunting task.” What key best practices do you apply for data selection and insights extraction, from social sources in particular?

The analysis of social media presents itself with significant challenges that go well beyond text analytics. The traditional approach to social media monitoring has been to trawl for everything – the temptation to do so is huge, as we now have access to web trawling technology which can span the web and return a wealth of data at the “press of a button”.  Unfortunately in most cases this leads to analysis paralysis as the data collected is huge and mostly irrelevant, with a lot of redundancies. This type of information overkill with no insights is discouraging, time consuming and costly.

We try to structure our “social intelligence” offer around a few principles designed to address some of these challenges. The first thing is to search for specifics. Mining web data or big data more generally speaking is very different from analysing structured research data coming from structured questionnaires. You just cannot analyse everything, or cross tabulate everything by everything. The vast amount and diverse nature of such data means that we need a different approach and knowing what you are looking for is key. If you want specific answers you need specific questions. It is also about adapting and evolving.  It does take time to test and refine the set up in order to obtain valuable insights and answers. Companies should not underestimate the amount of time it takes to design, analyse and report social media insights.

The text analytics process, according to IpsosWhat’s the proper balance between software-delivered analysis and human judgment, when it comes to study design, data collection, data analysis, and decision making? Are there general rules or do you determine the best approach on a study-by-study basis?

As mentioned above, we firmly believe that analytical expertise is needed to make the most out of text analytics software.  However the amount of human intervention varies according to what type of analysis is required. If it is just about exploring and counting key concepts / patterns in the data then minimal intervention is needed. If it is about linking different data sources and interpreting insights then a significant human element is needed.

Technology is very important, but it is a means to an end. It is the knowledge of the data, how to manipulate and interpret the results and how to tailor these to the individual business questions that leads to truly actionable results. This places the analyst at the heart of the process in most of the projects that we run for clients.

Q7 – Finally, I’ve been working in sentiment analysis and emerging emotion-focused techniques for quite some time, but the market remains somewhat skeptical. What’s your own appraisal of sentiment/emotion technologies, in general or for specific problems?

No technology is perfect but we can make it extremely useful by knowing how to apply it. Here again I think the realisation comes with experience. We work with clients who tell us that text analytics have brought in significant and tangible benefits – both in terms of time / cost savings and also additional insights and integration. My view is that as a whole the industry should focus a bit more on communicating these tangible benefits and a little bit less on who has the best sentiment engine and the highest level of accuracy.

Thanks Jean-François!

Ipsos Loyalty’s Jean-François Damais will be speaking at the LT-Accelerate conference, 23-24 November in Brussels. The program features brand, agency, researcher and solution provider speakers on the application of language technologies — in particular, text, sentiment and social analytics — to a range of business and governmental challenges. Join us there!

Since you’ve read to this point, check out my interviews with three other LT-Accelerate speakers:

An Inside View of Language Technologies at Google

Natural language processing, or NLP, is the machine handling of written and spoken human communications. Methods draw on linguistics and statistics, coupled with machine learning, to model language in the service of automation.

OK, that was a dry definition.

Fact is, NLP is at, or near, the core of just about every information-intensive process out there. NLP powers search, virtual assistants, recommendations, and modern biomedical research, intelligence and investigations, and consumer insights. (I discuss ways it’s applied in my 2013 article, All About Natural Language Processing.)

No organization is more heavily invested in NLP — or investing more heavily — than Google. That’s why a keynote on “Language Technologies at Google,” presented by Google Research’s Enrique Alfonseca, was a natural for the up-coming LT-Accelerate conference, which I co-organize. (LT-Accelerate takes place 23-24 November in Brussels. Join us!)

Enrique Alfonseca

Enrique Alfonseca of Google Research Zurich

I invited Enrique to respond to questions about his work. First, a short bio —

Enrique Alfonseca manages the Natural Language Understanding (NLU) team at Google Research Zurich, working on information extraction and applications of text summarization. Overall, the Google Research NLU team “guides, builds, and innovates methodologies around semantic analysis and representation, syntactic parsing and realization, morphology and lexicon development. Our work directly impacts Conversational Search in Google Now, the Knowledge Graph, and Google Translate, as well as other Machine Intelligence research.”

Before joining the NLU team, Enrique held different positions in the ads quality and search quality teams working on ads relevance and web search ranking. He launched changes in ads quality (sponsored search) targeting and query expansion leading to significant ads revenue increases. He is also an instructor at the Swiss Federal Institute of Technology (ETH) at Zurich.

Here, then, is —

An Inside View of Language Technologies at Google

Seth Grimes> Your work has included a diverse set of NLP topics. To start, what’s your current research agenda?

Enrique Alfonseca> At the moment my team is working on question answering in Google Search, which allows me and my colleagues to innovate in various different areas where we have experience. In my case, I have worked over the years on information extraction, event extraction, text summarization and information retrieval, and all of these come together for question answering — information retrieval to rank and find relevant passages on the web, information extraction to identify concrete, factual answers for queries, and text summarization to present it to the user in a concise way.

Google Zurich

Google Zurich, according to Google

Seth> And topics that colleagues at Google Research in Zurich are working on?

Enrique> The teams at Zurich work in a very connected way to the teams at other Google offices and the products that we are collaborating with, so it is hard to define a boundary between “Google Research in Zurich” and the rest of the company. This said, there are very exciting efforts in which people in Zurich are involved, in various areas of language processing (text analysis, generation, dialogue, etc.), video processing, handwriting recognition and many others.

Do you do only “pure” research or has your agenda, to some extent, been influenced by Google’s product roadmap?

A 2012 paper from Alfred Spector, Peter Norvig and Slav Petrov nicely summarizes our philosophy to research. On the one hand, we believe that research needs to happen and actually happens in the product teams. A large proportion of our Software Engineers have a master or a Ph.D. degree and previous experience working on research topics, and they bring this expertise into product development to areas as varied as search quality, ads quality, spam detection, and many others. At the same time, we have a number of longer-term projects working on answers to the problems that Google, as a company, should have solved in a few years from now. In most of these, we take complex challenges and subdivide in smaller problems that one can handle and make progress quickly, with the aim of having impact in Google products along the way, in a way that moves us closer to the longer-term goals.

To give an example, when we started working on event models from text, we did not have a concrete product in mind yet, although we expected that understanding the meaning of what is reported in news should have concrete applications. After some time working on it, we realised that it was useful to make sure that the information from the Knowledge Graph that is shown in web search was always up-to-date according to the latest news. While we do not have yet models for high-precision, wide-coverage deep understanding of news, the technologies built along the way have already proven to be useful for our users.

Do you you get involved in productizing research innovations? Is there a typical path from research into products, at Google?

Yes, we are responsible of bringing to production all the technologies that we develop. If research and production are handled separately, there are at least two common causes of failure.

By having the research team not so close to the production needs, it is possible that their evals and datasets are not fully representative of the exact needs of the product. This is particularly problematic if a research team is to work on a product that is being constantly improved. Unless working directly on the product itself, it is likely that the settings under which the research team is working will quickly become obsolete and positive results will not translate into product improvements.

At the same time, if the people bringing research innovations to product are not the researchers themselves, it is likely that they will not know enough about the new technologies to be able to make the right decisions, for example, if product needs require you to trade-off some accuracy to reduce computation cost.

Your LT-Accelerate presentation, Language Technologies at Google, could occupy both conference days, just itself. But you’re planning to focus on information extraction and a couple of other topics. You have written that information extraction has proved to be very hard. You cite challenges that include entity resolution and consistency problems of knowledge bases. Actually, first, what are definitions of “entity resolution” and “knowledge base”?

We call “entity resolution” the problem of finding, for a given mention of a topic in a text, the entry in the knowledge base that represents that topic. For example, if your knowledge base is Wikipedia, one may refer to this entry in English text as “Barack Obama”, “Barack”, “Obama”, “the president of the US”, etc. At the same time, “Obama” may refer to any other person with the same surname, so there is an ambiguity problem. In literature people also refer to this problem with other names, like entity linking or entity disambiguation. Two years ago, some colleagues at Google released a large corpus of entity resolution annotations in a large web corpus that includes 11 billion references to Freebase topics that has already been exploited by researchers worldwide working on Information Extraction.

When we talk about knowledge bases, we refer to structured information about the real world (or imaginary worlds) on which one can ground language analysis of texts, amongst many other applications. These typically contain topics (concepts and entities), attributes, relations, type hierarchies, inference rules… There have been decades of work on knowledge representation and on manual and automatic acquisition of knowledge, but these are far from solved problems.

So ambiguity, name matching, and pronouns and other anaphora are part of the challenge, all sorts of coreference. Overall, what’s the entity-resolution state of the art?

Coreference is indeed a related problem and I think it should be solved jointly with entity resolution.

Depending on the knowledge base and test set used, results vary, but mention-level annotation currently has an accuracy between 80% and 90%. Most of the knowledge bases, such as Wikipedia and Freebase, have been constructed in large part manually, without a concrete application in mind, and issues commonly turn up when one tries to use them for entity disambiguation.

Where do the knowledge-base consistency issues arise? In representation differences, incompatible definitions, capture of temporality, or simply facts that disagree? (It seems to me that human knowledge, in the wild, is inconsistent for all these reasons and more.) And how do inconsistencies affect Google’s performance, from the user’s point of view?

Different degrees of coverage of topics, and different levels of detail in different domains, are common problems. Depending on the application, one may want to tune the resolution system to be more biased to resolve mentions as head entities or tail entities, and some entities may be artificially boosted simply because they are in a denser, more detailed portion of the network in the knowledge base. On top of this, schemas are thought out to be ontologically correct but exceptions happen commonly; many knowledge bases have been constructed by merging datasets with different levels of granularity, giving rise to reconciliation problems; and Wikipedia contains many “orphan nodes” that are not explicitly related to other topics even though they are clearly related to them.

Is “curation” part of the answer — along the lines of the approaches applied for IBM Watson and Wolfram Alpha, for instance — or can the challenges be met algorithmically? Who’s doing interesting work on these topics, outside Google, in academia and industry?

There is no doubt that manual curation is part of the answer. At the same time, if we want to cover the very long tail of facts, it would be impractical to try to enter all that information manually and to keep it permanently up-to-date. Automatically reconciling existing structured sources, like product databases, books, sports results, etc. is part of the solution as well. I believe it will eventually be possible to apply information extraction techniques over structured and unstructured sources, but that is not without challenges. I mentioned before that the accuracy of entity resolution systems is between 80% and 90%. That means that for any set of automatically extracted facts, at least 10% of them are going to be associated to the wrong entity — an error that will accumulate on top of any errors from the fact extraction models. Aggregation can be helpful in reducing the error rate, but will not be so useful for the long tail.

On the bright side, the area is thriving — it is enough to review the latest proceedings of ACL, EMNLP and related conferences to realize that there is fast progress in the area. Semantic parsing of queries to answer factoid questions from Freebase, how to integrate deep learning models in KB representation and reasoning tasks, better combinations of global and local models for entity resolution… are all problems in which important breakthroughs have happened in the last couple of years.

Finally, what’s new and exciting on the NLP horizon?

On the one hand, the industry as a whole is quickly innovating in the personal assistant space: a tool that can interact with humans through natural dialogue, understands their world, their interests and needs, answers their information needs, helps in planning and remembering tasks, and can help control their appliances to make their lives more comfortable. There are still many improvements in NLP and other areas that need to happen to make this long-term vision a reality, but we are already starting to see how it can change our lives.

On the other hand, the relation between language and embodiment will see further progress as development happens in the field of robotics, and we will not just be able to ground our language analyses on virtual knowledge bases, but on physical experiences.

Thanks Enrique!

Google Research’s Enrique Alfonseca will be speaking at the LT-Accelerate conference, 23-24 November in Brussels. The program features brand, agency, researcher and solution provider speakers on the application of language technologies — in particular, text, sentiment and social analytics — to a range of business and governmental challenges. Join us there!

Since you’ve read to this point, check out my interviews with two other LT-Accelerate speakers:

Language Use, Customer Personality, and the Customer Journey

A bit of pseudo-wisdom misattributed to industrial engineer Edwards Deming says “You can’t manage what you can’t measure.” I’d put it differently and perhaps more correctly: “You can’t manage, or measure, what you can’t model.”

Models can be formal or practical, exact or imperfect, descriptive, predictive, or prescriptive. Whatever adjectives describe the models you apply, those models should derive from observation, with a strong dose of considered judgment, and aim to produce usable insights. Among the most sought-after insights today: Individuals’ attitudes, emotions, and intents.

Scott Nowson, global innovatin lead at Xerox Research Center Europe

Scott Nowson, global innovation lead at Xerox Research Center Europe

Scott Nowson is Global Innovation Lead at Xerox, stationed at the Xerox Research Centre Europe. He holds a Ph.D. in informatics from the University of Edinburgh, works in machine learning for document access and translation, and is interested in “personal language analytics,” “a branch of text mining in which the object of analysis is the author of a document rather than the document itself.”

Scott has consented to my interviewing him about his work as a teaser for his presentation at the up-coming LT-Accelerate conference, which I co-organize and which takes place November 23-24, 2015 in Brussels. His topic is customer modelling, generally of “anything about a person that will enable us to provide a more satisfactory customer experience,” and specifically of —

Language Use, Customer Personality, and the Customer Journey

Seth Grimes> You’re the global lead at Xerox Research for customer modeling. What customers, and what about them are you modeling? What data are you using and what insights are you searching for?

Scott Nowson> Xerox has a very large customer care outsourcing business with 2.5 million customer interactions per day, wherein, among other things, we operate contact centres for our clients. So the starting point for our research work in this area is the end-customer: the person who phones a call centre looking for help with a billing inquiry, or who uses social media or web-chat to try to solve a technical issue.

We’re interested in modelling anything about a person that will enable us to provide a more satisfactory customer experience. This includes, for example, automatically determining their level of expertise so that we can deliver a technical solution in the way that’s easiest — and most comfortable — for them to follow: not overly complex for beginners, nor overly simplified for people with experience. Similarly, we want to understand aspects of a customer’s personality and how we can tailor communication with each person to maximise effectiveness. For example, some personality types require reassurance and encouragement, while others will respond to more assertive language in conversations with no “social filler” (e.g. “how are you?”).

We learn from many sources, including social media — which is common in this field. However, we can also learn about people from direct customer care interactions. We are able, for example, to run our analyses in real-time while a customer is chatting with an agent.

There are “customers” in this sense — individuals at the end of a process or service — across many areas of Xerox’s business: transportation, healthcare administration, HR services, to name just a few. So while customer care is our focus right now, this personalisation — this individualised precision — is important to Xerox at many levels.

Seth> Your LT-Accelerate talk, titled “Language Use, Customer Personality, and the Customer Journey,” concerns multi-lingual technology you’ve been developing. Does your solution apply a single modeling approach across multiple languages, then? Could you please say a bit about the technical foundations?

Scott> There are applications for which only low-level processing is required, so that we may use a common, language-agnostic approach, particularly for rapid-prototyping. However, for much of what we do, a much greater understanding of the structure and semantics of language used is required. Xerox, and the European Research Centre in particular, has a long history with multi-lingual natural language processing research and technology. This is where we use our linguistic knowledge and experience to develop solutions which are tuned to specific languages, which can harness their individual affordances. There are languages in which the gender of a speaker/writer is morphologically encoded. In Spanish, for example to say “I am happy” a male would say “Yo estoy contento” whereas a female would say “Yo estoy contenta.” We would overlook this valuable source of information if we merely translated an English model of gender prediction.

On this language-specific feature foundation, the analytics we build on top can be more generally applied. Having a team that is constantly pushing the boundary of machine learning algorithms means that we always have a wide variety of options to use when it comes to the actual modelling of customer attributes. We will conduct experiments and benchmark each approach looking for the best combination of features and models for each task in context.

Is this research work or is it (also) deployed for use in the wild?

The model across the Xerox R&D organization is to drive forward research, and then use the cutting edge techniques we create to develop prototype technology. We will typically then transfer these to one of the business groups within Xerox who will take them to the market. Our customer modelling work can be applied across many businesses within our Xerox services operations, although, as I mentioned customer care is our initial focus. We are currently envisioning a single platform which combines our multiple strands of customer focused research, though we expect to see aspects incorporated into products within the next year. So the advanced customer modelling is currently research, but hopefully running wild soon.

How do you decide which personality characteristics are salient? Does the choice vary by culture, language, data source or context (say a Facebook status update versus an online review), or business purpose?

That’s a good question, and it’s certainly true that not all are salient at one time. Much of the work on computational personality recognition has dealt with the Big 5 — Extraversion, Agreeableness, Neuroticism (as opposed to Emotional Stability), Openness to Experience, and Conscientiousness. This is largely the most well accepted model in Psychology, and has its roots in language use so the relationship with what we do is natural. However, the Big 5 is not the only model: Myers-Briggs types are commonly used in HR, while DiSC is commonly referenced in sales and marketing literature. The use of these in any given situation varies.

We’re currently undertaking a more ethnographically driven program of research to understand which traits would be most suitable in which given situation. Adapting to which traits (or indeed other attributes) will have the most impact on the customer experience.

At the same time, in our recent research we’ve shown that personality projection through language varies across data source. We’ve shown for example, that the language patterns which convey aspects of personality in, say, video blogs, are not the same as in every day conversation. Similarly in different languages, it’s not possible to simply translate cues. This may work in sentiment — you might lose subtlety, but “happy” is a positive word in just about any language — but just as personalities vary between cultures, so do their linguistic indicators.

How do you measure accuracy and effectiveness, and how are you doing on those fronts?

Studies have traditionally divided personality traits — which are scored on a scale — into classes: high-scorers, low-scorers, and often a mid-range class. However, recent efforts such as the 2015 PAN Author profiling challenge have returned the task to regression: calculating where the individual sits on the scale, determining their trait score. We participated in the PAN challenge, and were evaluated on unseen data alongside 20 other teams on four different languages. The ranking was based on mean-squared error, how close our prediction was to the original value. Our performance varied across the languages of the challenge, from 3rd on Dutch Twitter data to 10th on English – on which the top 12 teams scored similarly, which was encouraging. Since submission we’ve continued to make improvements to our approach, using different combinations of feature sets and learning algorithms to significantly lower our training error rate.

Is there a role for human analysts, given that this is an automatic solution? In data selection, model training, results interpretation? Anywhere?

Our view, on both the research and commercial fronts, is that people will always be key to this work. Data preparation for example — labelled data can be difficult to come by when you consider personality. You can’t ask customers to complete long, complex surveys. One alternative approach to data collection is the use of personality perception — wherein the personality labels are judgments made by third parties based on observation of the original individual. This has been shown to strongly relate to “real” self-reported personality, and can be done at a much greater scale. It also makes sense from a customer care perspective: humans are good at forming impressions, and a good agent will try to understand the person with whom they are talking as much as possible. Thus labelling data with perceived personality is a valid approach.

Of course this labelling need not be done by an expert, per se. Typically the judgements are made by completing a typical personality questionnaire but from the point of view of the person being judged. The only real requirement is cultural: there’s no better judge of the personality of, say, a native French speaker than that of another French person.

Subsequently, our approach to modelling is largely data-driven. However, there is considerable requirement on human expertise in the use and deployment of such models. How we interpret the information we have about customers – how we can use this to truly understand them – requires human insight. We have researchers from more psychological and behavioural fields with whom we are working closely. This extends naturally to the training of automated systems in such areas.

We will always require human experts — be they in human behaviour, or in hands-on customer care — to help train our systems, to help them learn.

To what extent do you work with non-textual data, with human images, speech, and video and with behavioral models? What really tough challenges are you facing?

Our focus, particularly in the European labs, has been language use in text. For our purposes this is important because it’s a relatively impoverished source of data. Extra-linguistic information such as speech rate or body language is important in human impression-making. However, one of our driving motivations is supporting human care agents establish relationships with customers when they use increasingly popular text-based mediums such as web chat. It’s harder to connect with customers in the same way as on the phone, and our technology can help this.

However, we are of course looking beyond text. Speech processing is a core part of this, but also other dimensions of social media behaviour, pictures etc. We’re also looking at automatically modelling interests in the same way.

Perhaps our biggest concern in this work is back with our starting point, the customer, and understanding how this work will be perceived and accepted. There is a lot of debate right now around personalization versus privacy, and it’s easy for people to argue “big brother” and the creepiness factor, particularly when you’re modelling at the level of personality. However, studies have shown that people are increasingly comfortable with the notion that their data is being used and in parallel are expecting more personalised services from the brands with which they interact. Our intentions in this space are altruistic — to provide an enriched, personalised customer experience. However, we recognise that it’s not for everyone. Our ethnographic teams I mentioned earlier are also investigating the appropriateness of what we’re doing. By studying human interactions in real situations, in multiple domains and cultures (we have centres around the world) we will understand the when, how, and for whom of personalisation. The bottom line is a seamless quality customer experience, and we don’t want to do anything to ruin that.

Thanks Scott!

Xerox’s Scott Nowson will be speaking at the LT-Accelerate conference, 23-24 November in Brussels. The program features brand, agency, researcher, and solution provider speakers on the application of language technologies — in particular, text, sentiment, and social analytics — to a range of business and governmental challenges. Join us there!