Text Analytics clocks in as the #4 “emerging methods” priority for market researchers in the 2015 GRIT (Greenbook Research Industry Trends) report. Only mobile surveys, online communities, and social media analytics poll higher… although of course text analytics is key to social media analytics done right, and it’s also central to #5 priority Big Data analytics. GRIT reports text analytics as tied with Big Data analytics for #1 method “under consideration.”
On the customer-insights side, Temkin Group last year reported, “When we examined the behaviors of the few companies with mature VoC programs, we found that they are considerably more likely to use text analytics to tap into unstructured data sources… Temkin Group expects an increase in text analytics usage once companies recognize that these tools enable important capabilities for extracting insights.”
Clearly, the market opportunity is huge, as is the market education need.
Who better to learn from than active practitioners, from experts such as Jean-François Damais, who is Deputy Managing Director, Global Client Solutions at Ipsos Loyalty.
Jean-François is co-author, with his colleague Fiona Moss, of a recently released Ipsos Guide to Text Analytics, which seeks to explain the options, benefits, and pitfalls of setting up text analytics, with case studies. And Jean-François will be speaking at the 2015 LT-Accelerate conference, which looks at text, sentiment, and social analytics for consumer, market, research, and media insights, 23-24 November in Brussels. As a conference preview, he has consented to this interview, on —
Research and Insights: How Ipsos Loyalty Applies Text Analytics
Seth Grimes> You’ll be speaking at LT-Accelerate on text analytics in market research. You wrote in your presentation description that text analytics work at Ipsos has grown 70% each of the last two years. What proportion of projects now involve text sources? What sources, and looking for what information and insights?
Jean-François Damais> Virtually all market research projects involve some analysis of text. In the customer experience space in some of our key markets (i.e US, Canada, UK, France, Germany), I would say that 70-80% of research projects require significant Text Analytics capabilities to extract and report insights from customer verbatim in timely fashion. However, other markets (i.e Eastern Europe, LATAM, MENA, APAC) are lagging behind so the picture is a bit uneven. Generally speaking, text analytics plays a key role in Enterprise Feedback Management, which is about collecting and reporting customer feedback within organisations in real-time to drive action and growing at a very rapid pace.
In addition the use of text analytics to analyse social media user generated content is increasing significantly. But interestingly more and more clients now want to leverage text analytics to integrate learnings across even more data sources to get a 360 view of customers and potential customers. So on top of survey and social we quite often analyse internal data held by organisations such as complaints or compliments data, FAQs etc…and bring everything back together to create a more holistic story.
Text Analytics can really help when it comes to data integration. But of course technology is an enabler but will not give you all the answers. We believe that analytical expertise is needed to set up and carry out the analysis in the right way, but also to interpret, validate and contextualise text analytics. This is key.
Seth> Despite the impressive expansion of text analytics use at Ipsos, my impression is that research suppliers and clients often don’t understand the technology’s capabilities, and the tool providers haven’t done a great job educating them. Does this match your impression, or are you seeing something different?
Jean-François> I would agree with you on the whole. There are still a lot of misconceptions and half knowledge in the industry. I do feel that text analytics providers would benefit from being more transparent about the benefits and limitations of their software, and how they can be applied to meet a business need. Currently it feels that everyone is ready to make a lot of promises that are difficult to live up to and I sometimes feel that this is counter productive. I am referring to the focus on accuracy levels, level of quality across languages, level of human input needed, how unique or better or one size fits all one’s technology is compared to the rest of the market etc.
In 2014, we conducted a comprehensive review of many of the text analytics tools currently available and identified pros and cons for each. Although each of the tools presented us with different strengths, challenges and functionalities, we gained the following learnings:
- There is no perfect technology. Knowing the strengths and weaknesses of the technology used is key to getting valuable results.
- There is no miracle “press a button” type solution, even the best tools need some human intervention and analytical expertise
- There is no “one size fits all” tool – depending on the type of data or requirements some tools and technologies might be better suited than others.
My colleague Fiona Moss and I have recently written a POV on how to successfully deploy Text Analytics. The full paper is online.
The benefits of text analytics technology are huge and I do agree that focus should be put on educating users and potential users to make the most out of it.
How did you personally get started with text analytics, and what advice can you offer researchers who are starting with the technology now?
I got started in 2009 when Ipsos Loyalty launched text analytics services to its clients. At the time this capability was very much a niche offering and seen by most organisations as an added value and nice to have. But things have gone a long way since then and Text Analytics capabilities now support some of our biggest client engagements and are now a key tool in our toolkit.
Here is what I would say to any keen researcher (or client)
- Know your purpose
- Manage your organization’s expectations
- Place the analyst at the heart of the process
- Choose the right text analytics tool (s) given your objective
- Learn the strengths and weaknesses of the tool (s) you are using
- Don’t give up!
Q4 – Where do you see the greatest opportunities and the biggest challenges, when it comes to text sources and the information they capture, and for that matter, with the range of structured and unstructured sources?
To some extent what applies to text analytics applies to big data more generally. There has been a significant increase over the last few years in the volume and variety of sources of unstructured data, including feedback from customers, potential customers, employees, members of the public and information systems. There is a huge value that quite often lies buried in this data. So the opportunity comes from the ability to extract actionable insights and intelligence. So whilst the potential is huge, there are a number of pitfalls organisations need to avoid. One of the most dangerous is the belief that technology in itself, regardless how state of the art, is enough to derive good and actionable insights.
Quoting an Ipsos case study you wrote: “Even when data has been matched to a suitable objective, analysis can be a daunting task.” What key best practices do you apply for data selection and insights extraction, from social sources in particular?
The analysis of social media presents itself with significant challenges that go well beyond text analytics. The traditional approach to social media monitoring has been to trawl for everything – the temptation to do so is huge, as we now have access to web trawling technology which can span the web and return a wealth of data at the “press of a button”. Unfortunately in most cases this leads to analysis paralysis as the data collected is huge and mostly irrelevant, with a lot of redundancies. This type of information overkill with no insights is discouraging, time consuming and costly.
We try to structure our “social intelligence” offer around a few principles designed to address some of these challenges. The first thing is to search for specifics. Mining web data or big data more generally speaking is very different from analysing structured research data coming from structured questionnaires. You just cannot analyse everything, or cross tabulate everything by everything. The vast amount and diverse nature of such data means that we need a different approach and knowing what you are looking for is key. If you want specific answers you need specific questions. It is also about adapting and evolving. It does take time to test and refine the set up in order to obtain valuable insights and answers. Companies should not underestimate the amount of time it takes to design, analyse and report social media insights.
What’s the proper balance between software-delivered analysis and human judgment, when it comes to study design, data collection, data analysis, and decision making? Are there general rules or do you determine the best approach on a study-by-study basis?
As mentioned above, we firmly believe that analytical expertise is needed to make the most out of text analytics software. However the amount of human intervention varies according to what type of analysis is required. If it is just about exploring and counting key concepts / patterns in the data then minimal intervention is needed. If it is about linking different data sources and interpreting insights then a significant human element is needed.
Technology is very important, but it is a means to an end. It is the knowledge of the data, how to manipulate and interpret the results and how to tailor these to the individual business questions that leads to truly actionable results. This places the analyst at the heart of the process in most of the projects that we run for clients.
Q7 – Finally, I’ve been working in sentiment analysis and emerging emotion-focused techniques for quite some time, but the market remains somewhat skeptical. What’s your own appraisal of sentiment/emotion technologies, in general or for specific problems?
No technology is perfect but we can make it extremely useful by knowing how to apply it. Here again I think the realisation comes with experience. We work with clients who tell us that text analytics have brought in significant and tangible benefits – both in terms of time / cost savings and also additional insights and integration. My view is that as a whole the industry should focus a bit more on communicating these tangible benefits and a little bit less on who has the best sentiment engine and the highest level of accuracy.
Ipsos Loyalty’s Jean-François Damais will be speaking at the LT-Accelerate conference, 23-24 November in Brussels. The program features brand, agency, researcher and solution provider speakers on the application of language technologies — in particular, text, sentiment and social analytics — to a range of business and governmental challenges. Join us there!
Since you’ve read to this point, check out my interviews with three other LT-Accelerate speakers:
- An Inside View of Language Technologies at Google, with Enrique Alfonseca of Google Research Europe
- Language Use, Customer Personality, and the Customer Journey, with Scott Nowson, Global Innovation Lead at Xerox Research Centre Europe
- Gain Deeper Insights from Networks PLUS Content, with TNS data scientist Preriit Souda
Natural language processing, or NLP, is the machine handling of written and spoken human communications. Methods draw on linguistics and statistics, coupled with machine learning, to model language in the service of automation.
OK, that was a dry definition.
Fact is, NLP is at, or near, the core of just about every information-intensive process out there. NLP powers search, virtual assistants, recommendations, and modern biomedical research, intelligence and investigations, and consumer insights. (I discuss ways it’s applied in my 2013 article, All About Natural Language Processing.)
No organization is more heavily invested in NLP — or investing more heavily — than Google. That’s why a keynote on “Language Technologies at Google,” presented by Google Research’s Enrique Alfonseca, was a natural for the up-coming LT-Accelerate conference, which I co-organize. (LT-Accelerate takes place 23-24 November in Brussels. Join us!)
I invited Enrique to respond to questions about his work. First, a short bio —
Enrique Alfonseca manages the Natural Language Understanding (NLU) team at Google Research Zurich, working on information extraction and applications of text summarization. Overall, the Google Research NLU team “guides, builds, and innovates methodologies around semantic analysis and representation, syntactic parsing and realization, morphology and lexicon development. Our work directly impacts Conversational Search in Google Now, the Knowledge Graph, and Google Translate, as well as other Machine Intelligence research.”
Before joining the NLU team, Enrique held different positions in the ads quality and search quality teams working on ads relevance and web search ranking. He launched changes in ads quality (sponsored search) targeting and query expansion leading to significant ads revenue increases. He is also an instructor at the Swiss Federal Institute of Technology (ETH) at Zurich.
Here, then, is —
An Inside View of Language Technologies at Google
Seth Grimes> Your work has included a diverse set of NLP topics. To start, what’s your current research agenda?
Enrique Alfonseca> At the moment my team is working on question answering in Google Search, which allows me and my colleagues to innovate in various different areas where we have experience. In my case, I have worked over the years on information extraction, event extraction, text summarization and information retrieval, and all of these come together for question answering — information retrieval to rank and find relevant passages on the web, information extraction to identify concrete, factual answers for queries, and text summarization to present it to the user in a concise way.
Seth> And topics that colleagues at Google Research in Zurich are working on?
Enrique> The teams at Zurich work in a very connected way to the teams at other Google offices and the products that we are collaborating with, so it is hard to define a boundary between “Google Research in Zurich” and the rest of the company. This said, there are very exciting efforts in which people in Zurich are involved, in various areas of language processing (text analysis, generation, dialogue, etc.), video processing, handwriting recognition and many others.
Do you do only “pure” research or has your agenda, to some extent, been influenced by Google’s product roadmap?
A 2012 paper from Alfred Spector, Peter Norvig and Slav Petrov nicely summarizes our philosophy to research. On the one hand, we believe that research needs to happen and actually happens in the product teams. A large proportion of our Software Engineers have a master or a Ph.D. degree and previous experience working on research topics, and they bring this expertise into product development to areas as varied as search quality, ads quality, spam detection, and many others. At the same time, we have a number of longer-term projects working on answers to the problems that Google, as a company, should have solved in a few years from now. In most of these, we take complex challenges and subdivide in smaller problems that one can handle and make progress quickly, with the aim of having impact in Google products along the way, in a way that moves us closer to the longer-term goals.
To give an example, when we started working on event models from text, we did not have a concrete product in mind yet, although we expected that understanding the meaning of what is reported in news should have concrete applications. After some time working on it, we realised that it was useful to make sure that the information from the Knowledge Graph that is shown in web search was always up-to-date according to the latest news. While we do not have yet models for high-precision, wide-coverage deep understanding of news, the technologies built along the way have already proven to be useful for our users.
Do you you get involved in productizing research innovations? Is there a typical path from research into products, at Google?
Yes, we are responsible of bringing to production all the technologies that we develop. If research and production are handled separately, there are at least two common causes of failure.
By having the research team not so close to the production needs, it is possible that their evals and datasets are not fully representative of the exact needs of the product. This is particularly problematic if a research team is to work on a product that is being constantly improved. Unless working directly on the product itself, it is likely that the settings under which the research team is working will quickly become obsolete and positive results will not translate into product improvements.
At the same time, if the people bringing research innovations to product are not the researchers themselves, it is likely that they will not know enough about the new technologies to be able to make the right decisions, for example, if product needs require you to trade-off some accuracy to reduce computation cost.
Your LT-Accelerate presentation, Language Technologies at Google, could occupy both conference days, just itself. But you’re planning to focus on information extraction and a couple of other topics. You have written that information extraction has proved to be very hard. You cite challenges that include entity resolution and consistency problems of knowledge bases. Actually, first, what are definitions of “entity resolution” and “knowledge base”?
We call “entity resolution” the problem of finding, for a given mention of a topic in a text, the entry in the knowledge base that represents that topic. For example, if your knowledge base is Wikipedia, one may refer to this entry in English text as “Barack Obama”, “Barack”, “Obama”, “the president of the US”, etc. At the same time, “Obama” may refer to any other person with the same surname, so there is an ambiguity problem. In literature people also refer to this problem with other names, like entity linking or entity disambiguation. Two years ago, some colleagues at Google released a large corpus of entity resolution annotations in a large web corpus that includes 11 billion references to Freebase topics that has already been exploited by researchers worldwide working on Information Extraction.
When we talk about knowledge bases, we refer to structured information about the real world (or imaginary worlds) on which one can ground language analysis of texts, amongst many other applications. These typically contain topics (concepts and entities), attributes, relations, type hierarchies, inference rules… There have been decades of work on knowledge representation and on manual and automatic acquisition of knowledge, but these are far from solved problems.
So ambiguity, name matching, and pronouns and other anaphora are part of the challenge, all sorts of coreference. Overall, what’s the entity-resolution state of the art?
Coreference is indeed a related problem and I think it should be solved jointly with entity resolution.
Depending on the knowledge base and test set used, results vary, but mention-level annotation currently has an accuracy between 80% and 90%. Most of the knowledge bases, such as Wikipedia and Freebase, have been constructed in large part manually, without a concrete application in mind, and issues commonly turn up when one tries to use them for entity disambiguation.
Where do the knowledge-base consistency issues arise? In representation differences, incompatible definitions, capture of temporality, or simply facts that disagree? (It seems to me that human knowledge, in the wild, is inconsistent for all these reasons and more.) And how do inconsistencies affect Google’s performance, from the user’s point of view?
Different degrees of coverage of topics, and different levels of detail in different domains, are common problems. Depending on the application, one may want to tune the resolution system to be more biased to resolve mentions as head entities or tail entities, and some entities may be artificially boosted simply because they are in a denser, more detailed portion of the network in the knowledge base. On top of this, schemas are thought out to be ontologically correct but exceptions happen commonly; many knowledge bases have been constructed by merging datasets with different levels of granularity, giving rise to reconciliation problems; and Wikipedia contains many “orphan nodes” that are not explicitly related to other topics even though they are clearly related to them.
Is “curation” part of the answer — along the lines of the approaches applied for IBM Watson and Wolfram Alpha, for instance — or can the challenges be met algorithmically? Who’s doing interesting work on these topics, outside Google, in academia and industry?
There is no doubt that manual curation is part of the answer. At the same time, if we want to cover the very long tail of facts, it would be impractical to try to enter all that information manually and to keep it permanently up-to-date. Automatically reconciling existing structured sources, like product databases, books, sports results, etc. is part of the solution as well. I believe it will eventually be possible to apply information extraction techniques over structured and unstructured sources, but that is not without challenges. I mentioned before that the accuracy of entity resolution systems is between 80% and 90%. That means that for any set of automatically extracted facts, at least 10% of them are going to be associated to the wrong entity — an error that will accumulate on top of any errors from the fact extraction models. Aggregation can be helpful in reducing the error rate, but will not be so useful for the long tail.
On the bright side, the area is thriving — it is enough to review the latest proceedings of ACL, EMNLP and related conferences to realize that there is fast progress in the area. Semantic parsing of queries to answer factoid questions from Freebase, how to integrate deep learning models in KB representation and reasoning tasks, better combinations of global and local models for entity resolution… are all problems in which important breakthroughs have happened in the last couple of years.
Finally, what’s new and exciting on the NLP horizon?
On the one hand, the industry as a whole is quickly innovating in the personal assistant space: a tool that can interact with humans through natural dialogue, understands their world, their interests and needs, answers their information needs, helps in planning and remembering tasks, and can help control their appliances to make their lives more comfortable. There are still many improvements in NLP and other areas that need to happen to make this long-term vision a reality, but we are already starting to see how it can change our lives.
On the other hand, the relation between language and embodiment will see further progress as development happens in the field of robotics, and we will not just be able to ground our language analyses on virtual knowledge bases, but on physical experiences.
Google Research’s Enrique Alfonseca will be speaking at the LT-Accelerate conference, 23-24 November in Brussels. The program features brand, agency, researcher and solution provider speakers on the application of language technologies — in particular, text, sentiment and social analytics — to a range of business and governmental challenges. Join us there!
Since you’ve read to this point, check out my interviews with two other LT-Accelerate speakers:
- Language Use, Customer Personality, and the Customer Journey, with Scott Nowson, Global Innovation Lead at Xerox Research Centre Europe
- Gain Deeper Insights from Networks PLUS Content, with TNS data scientist Preriit Souda
A bit of pseudo-wisdom misattributed to industrial engineer Edwards Deming says “You can’t manage what you can’t measure.” I’d put it differently and perhaps more correctly: “You can’t manage, or measure, what you can’t model.”
Models can be formal or practical, exact or imperfect, descriptive, predictive, or prescriptive. Whatever adjectives describe the models you apply, those models should derive from observation, with a strong dose of considered judgment, and aim to produce usable insights. Among the most sought-after insights today: Individuals’ attitudes, emotions, and intents.
Scott Nowson is Global Innovation Lead at Xerox, stationed at the Xerox Research Centre Europe. He holds a Ph.D. in informatics from the University of Edinburgh, works in machine learning for document access and translation, and is interested in “personal language analytics,” “a branch of text mining in which the object of analysis is the author of a document rather than the document itself.”
Scott has consented to my interviewing him about his work as a teaser for his presentation at the up-coming LT-Accelerate conference, which I co-organize and which takes place November 23-24, 2015 in Brussels. His topic is customer modelling, generally of “anything about a person that will enable us to provide a more satisfactory customer experience,” and specifically of —
Language Use, Customer Personality, and the Customer Journey
Seth Grimes> You’re the global lead at Xerox Research for customer modeling. What customers, and what about them are you modeling? What data are you using and what insights are you searching for?
Scott Nowson> Xerox has a very large customer care outsourcing business with 2.5 million customer interactions per day, wherein, among other things, we operate contact centres for our clients. So the starting point for our research work in this area is the end-customer: the person who phones a call centre looking for help with a billing inquiry, or who uses social media or web-chat to try to solve a technical issue.
We’re interested in modelling anything about a person that will enable us to provide a more satisfactory customer experience. This includes, for example, automatically determining their level of expertise so that we can deliver a technical solution in the way that’s easiest — and most comfortable — for them to follow: not overly complex for beginners, nor overly simplified for people with experience. Similarly, we want to understand aspects of a customer’s personality and how we can tailor communication with each person to maximise effectiveness. For example, some personality types require reassurance and encouragement, while others will respond to more assertive language in conversations with no “social filler” (e.g. “how are you?”).
We learn from many sources, including social media — which is common in this field. However, we can also learn about people from direct customer care interactions. We are able, for example, to run our analyses in real-time while a customer is chatting with an agent.
There are “customers” in this sense — individuals at the end of a process or service — across many areas of Xerox’s business: transportation, healthcare administration, HR services, to name just a few. So while customer care is our focus right now, this personalisation — this individualised precision — is important to Xerox at many levels.
Seth> Your LT-Accelerate talk, titled “Language Use, Customer Personality, and the Customer Journey,” concerns multi-lingual technology you’ve been developing. Does your solution apply a single modeling approach across multiple languages, then? Could you please say a bit about the technical foundations?
Scott> There are applications for which only low-level processing is required, so that we may use a common, language-agnostic approach, particularly for rapid-prototyping. However, for much of what we do, a much greater understanding of the structure and semantics of language used is required. Xerox, and the European Research Centre in particular, has a long history with multi-lingual natural language processing research and technology. This is where we use our linguistic knowledge and experience to develop solutions which are tuned to specific languages, which can harness their individual affordances. There are languages in which the gender of a speaker/writer is morphologically encoded. In Spanish, for example to say “I am happy” a male would say “Yo estoy contento” whereas a female would say “Yo estoy contenta.” We would overlook this valuable source of information if we merely translated an English model of gender prediction.
On this language-specific feature foundation, the analytics we build on top can be more generally applied. Having a team that is constantly pushing the boundary of machine learning algorithms means that we always have a wide variety of options to use when it comes to the actual modelling of customer attributes. We will conduct experiments and benchmark each approach looking for the best combination of features and models for each task in context.
Is this research work or is it (also) deployed for use in the wild?
The model across the Xerox R&D organization is to drive forward research, and then use the cutting edge techniques we create to develop prototype technology. We will typically then transfer these to one of the business groups within Xerox who will take them to the market. Our customer modelling work can be applied across many businesses within our Xerox services operations, although, as I mentioned customer care is our initial focus. We are currently envisioning a single platform which combines our multiple strands of customer focused research, though we expect to see aspects incorporated into products within the next year. So the advanced customer modelling is currently research, but hopefully running wild soon.
How do you decide which personality characteristics are salient? Does the choice vary by culture, language, data source or context (say a Facebook status update versus an online review), or business purpose?
That’s a good question, and it’s certainly true that not all are salient at one time. Much of the work on computational personality recognition has dealt with the Big 5 — Extraversion, Agreeableness, Neuroticism (as opposed to Emotional Stability), Openness to Experience, and Conscientiousness. This is largely the most well accepted model in Psychology, and has its roots in language use so the relationship with what we do is natural. However, the Big 5 is not the only model: Myers-Briggs types are commonly used in HR, while DiSC is commonly referenced in sales and marketing literature. The use of these in any given situation varies.
We’re currently undertaking a more ethnographically driven program of research to understand which traits would be most suitable in which given situation. Adapting to which traits (or indeed other attributes) will have the most impact on the customer experience.
At the same time, in our recent research we’ve shown that personality projection through language varies across data source. We’ve shown for example, that the language patterns which convey aspects of personality in, say, video blogs, are not the same as in every day conversation. Similarly in different languages, it’s not possible to simply translate cues. This may work in sentiment — you might lose subtlety, but “happy” is a positive word in just about any language — but just as personalities vary between cultures, so do their linguistic indicators.
How do you measure accuracy and effectiveness, and how are you doing on those fronts?
Studies have traditionally divided personality traits — which are scored on a scale — into classes: high-scorers, low-scorers, and often a mid-range class. However, recent efforts such as the 2015 PAN Author profiling challenge have returned the task to regression: calculating where the individual sits on the scale, determining their trait score. We participated in the PAN challenge, and were evaluated on unseen data alongside 20 other teams on four different languages. The ranking was based on mean-squared error, how close our prediction was to the original value. Our performance varied across the languages of the challenge, from 3rd on Dutch Twitter data to 10th on English – on which the top 12 teams scored similarly, which was encouraging. Since submission we’ve continued to make improvements to our approach, using different combinations of feature sets and learning algorithms to significantly lower our training error rate.
Is there a role for human analysts, given that this is an automatic solution? In data selection, model training, results interpretation? Anywhere?
Our view, on both the research and commercial fronts, is that people will always be key to this work. Data preparation for example — labelled data can be difficult to come by when you consider personality. You can’t ask customers to complete long, complex surveys. One alternative approach to data collection is the use of personality perception — wherein the personality labels are judgments made by third parties based on observation of the original individual. This has been shown to strongly relate to “real” self-reported personality, and can be done at a much greater scale. It also makes sense from a customer care perspective: humans are good at forming impressions, and a good agent will try to understand the person with whom they are talking as much as possible. Thus labelling data with perceived personality is a valid approach.
Of course this labelling need not be done by an expert, per se. Typically the judgements are made by completing a typical personality questionnaire but from the point of view of the person being judged. The only real requirement is cultural: there’s no better judge of the personality of, say, a native French speaker than that of another French person.
Subsequently, our approach to modelling is largely data-driven. However, there is considerable requirement on human expertise in the use and deployment of such models. How we interpret the information we have about customers – how we can use this to truly understand them – requires human insight. We have researchers from more psychological and behavioural fields with whom we are working closely. This extends naturally to the training of automated systems in such areas.
We will always require human experts — be they in human behaviour, or in hands-on customer care — to help train our systems, to help them learn.
To what extent do you work with non-textual data, with human images, speech, and video and with behavioral models? What really tough challenges are you facing?
Our focus, particularly in the European labs, has been language use in text. For our purposes this is important because it’s a relatively impoverished source of data. Extra-linguistic information such as speech rate or body language is important in human impression-making. However, one of our driving motivations is supporting human care agents establish relationships with customers when they use increasingly popular text-based mediums such as web chat. It’s harder to connect with customers in the same way as on the phone, and our technology can help this.
However, we are of course looking beyond text. Speech processing is a core part of this, but also other dimensions of social media behaviour, pictures etc. We’re also looking at automatically modelling interests in the same way.
Perhaps our biggest concern in this work is back with our starting point, the customer, and understanding how this work will be perceived and accepted. There is a lot of debate right now around personalization versus privacy, and it’s easy for people to argue “big brother” and the creepiness factor, particularly when you’re modelling at the level of personality. However, studies have shown that people are increasingly comfortable with the notion that their data is being used and in parallel are expecting more personalised services from the brands with which they interact. Our intentions in this space are altruistic — to provide an enriched, personalised customer experience. However, we recognise that it’s not for everyone. Our ethnographic teams I mentioned earlier are also investigating the appropriateness of what we’re doing. By studying human interactions in real situations, in multiple domains and cultures (we have centres around the world) we will understand the when, how, and for whom of personalisation. The bottom line is a seamless quality customer experience, and we don’t want to do anything to ruin that.
Xerox’s Scott Nowson will be speaking at the LT-Accelerate conference, 23-24 November in Brussels. The program features brand, agency, researcher, and solution provider speakers on the application of language technologies — in particular, text, sentiment, and social analytics — to a range of business and governmental challenges. Join us there!
The research & insights industry — that’s market research and consumer insights — is having a hard time coming to grips with social media: chaotic, unreliable, hard to quantify… and yet an incredibly rich source of unscripted conversation. As a researcher (or a research client), how do you make sense of social, particularly when you’re accustomed to methods that allow you to ask direct questions (via surveys) and guide conversations (in focus groups) and observe and measure reactions in controlled settings? We have yet to crack construction of scientific samples of social-platform users, lacking which we can’t report statistically significant findings.
Nonetheless, research & insights professionals are working to modernize methods, to accommodate social insights. TNS data scientist Preriit Souda — 2011 ESOMAR Young Researcher of the Year — is on the front lines of this work.
Preriit graciously submit to an interview — hard for him to find time, given a grueling schedule — in the run-up to the LT-Accelerate conference, taking place November 23-24 in Brussels. Preriit and other insights, customer experience, media & publishing, and technology leaders will be presenting on applications of language technologies — text, sentiment, and social analytics — to meet everyday business challenges.
Here, then, is Preriit Souda’s explanation how to obtain —
Deeper Insights from Networks PLUS Content
Seth Grimes> You have remarked that too much of today’s social media analytics relies on antiquated methods, on little more than counting. So you have advocated studying networks and content in order to derive deeper insights. Let’s explore these topics.
To start, could you please describe your social-conversation mapping work, the goals and the techniques you use, the insights gained and how you (and your clients) act on them?
Preriit Souda> Networks give structure to the conversation while content mining gives meaning to that structure.
People talk about structures of conversation styles based on network analysis. I have used networks to better understand conversations on Twitter, Facebook, Tumblr, Twitter + YouTube, Weibo, etc. While these are good analyses, if you look only at a graph, often patterns formed don’t make sense. Unless you add content mining to understand these structures, you get wrong interpretations. When you use content analysis to guide network analysis, a complete picture emerges.
In addition, clients get excited when seeing the networks (because it looks cool), but then they ask why/what/how. To answer you need content mining. For any significant insight, you need both.
For example, I worked on a campaign analysis. The campaign was handled by a big ad agency and its success was reported in a big advertising magazine. The network graph showed a decent amount of volume. But certain patterns raised questions about the conversations between certain tweeters. We looked at our text-mined data and found that these guys were artificially inflating the tweets and hence the impressions. Using both network and text mining together helped us uncover that the actual volumes reported were much less.
Further, we use text mining to understand sources of negativity or positivity. We use text mining to measure volume of brand imagery and perception changes with time and then use network graphing to see spread.
Seth> Alright, so networks plus content. Any other insight ingredients?
Preriit> Apart from studying networks and content together, use of social meta-data in collaboration is quite important. Also the idea of analysing different social networks differently (because each has a different character) and then merging “findings” is important but missing today.
Finally, clients need to use social data in conjunction with other sources of insights — survey, CRM, store data, e-commerce etc. — to get the complete picture. When social is understood in conjunction with all these pieces of the jigsaw puzzle, true impact is realized. Social media analytics needs to up its game to be a part of a larger overall picture.
We need insight-oriented analytics and not simply counting of likes and shares.
You referred to “sources of negativity or positivity.” What role does sentiment analysis play for you and for TNS clients?
I will try to answer this question using a broader term — content analysis — and then delve into opinion mining. (I like calling it tonality analysis).
Content mining is the most important part in any social media analysis we do. If you do the conversion of unstructured data accurately and insightfully, subsequent analyses will make more sense and be quite robust. Else, if your content mining is crap, all your following analysis is better not done!The basic pillar of any analysis is data. Unstructured data can’t be used directly. It has to be converted into structured data and hence your text mined data becomes data feeding your models. Nowadays, I have seen people in analytical/consulting firms building econometric models based social data. When I question them on their content mining I realize that I can’t rely much on their analysis because the very conversion of unstructured to structured data is faulty.
If you don’t spend time in being creative, insightful, comprehensive and accurate at this stage, I doubt your analysis.
Coming to your question on sentiment analysis: We look at sentiment as a part of content analysis. In some cases, clients need simple +/- while in some cases clients are more insight focused and need to understand different shades of opinions with respect to different entities (brand, product, services, etc.) while some want to further understand shades with perceived linkages with different attributes and imageries.
We create customized opinion mining algorithms for every project, client, and sector because every situation is different. Machines can’t understand the difference between someone speaking about nuclear topics from a political angle vs. a scientific angle vs. an educational angle.
Clients expect insights as robust as from traditional research methods like surveys or focus groups and other forms of research. While in a survey/focus group, you are explicitly asking people questions, in social you are mining people who are speaking in a natural environment. So we have to understand context and how what people say can be linked to answer explicit questions otherwise answered via a survey. For example, in survey people are asked questions like “Do you associate Brand X with trustworthiness?” while in social no one will use that lingo. So I have to find ways how people refer to such concepts. And then link it up to quantify opinions. So for us opinions are not simply +/- but much more than that. These things make our life difficult but also exciting.
You advocate use of text mining for meaning discovery, to get at explicit, implicit, and contextual meaning in customer conversations. Could you please give an example of each type?
Well, different people use these words in different manner. Some people might disagree with my definition or some may call it differently but what I am referring to is as follows.
Explicit meaning: Say, people using the word Barclays and talking about its bad service
Implicit meaning may be broken out as —
- Referential Implicit: People don’t use the word Barclays but share a URL (about Barclays) and express their opinion with respect to Barclays.
- Operational implicit: Saying something after seeing a YouTube video or in reaction to a Facebook post.
- Conversational implicit: Talking to people who have a very high probability of being linked only to the topic you are mining for. They might not use the words you are looking for, but there is a very high probability that they are talking about things of your interest.
- Using images to express: Sharing pictures with minimal words to express their opinion.
Contextual meaning may also be broken out —
- By Geography: Certain words mean differently in different geographies and hence the importance you give to them, in order to understand intensity, varies. Plus often we need to tweak our algorithms to take into consideration different lingo styles of people from different origins within a given geography.
- By Sector: Certain phrases or words mean differently by different subjects and context. When interpreting those words or phrases, context has to be properly understood by our algorithms.
- By time: Meanings of certain words/phrases change by time or are influenced by ongoing events. So one algorithm is right at certain times but at certain times it can be wrong. For example, when people say positive things about Lufthansa airline staff, that translates to goodwill for the airline. But during adverse times — in most cases is negatives expressed against management or the brand in totality — staff may be misperceived negatively.
What text analytics techniques should forward-looking researchers master, whether for social or survey research or media analysis?
I think I am using up a lot of your time, so I will try to keep it short. Without going into any technical details, I think linguistic library based techniques are useful along with machine learning techniques. So someone trying to enter in this area should be aware of both and be ready to use both. I feel that nowadays lot of people have a bias towards ML which is right in some cases but in some cases I don’t feel that it gives desired results. So I believe that a more combinative approach should be used.
What best practices can you share for balancing or tempering automated natural language processing, including sentiment analysis, with human judgment?
Different people look at this problem in different ways. I can talk about certain overarching steps which involve humans at different steps to improve results.
Start with a good desk research by the content analyst followed by inputs from a subject matter expert. At both stages create and refine your mining resources. Bring in social data and then further refine. Create your model and get it checked by a linguist along with the subject matter expert. Both will give their own perspectives and sometimes differences between them can help you refine your model. Test with new data across different times. (Social data is often influenced by events — some known and some unknown.) Monitor your performance till you reach around 70-90% perfection on agreed model outputs.
You’ll be speaking at the LT-Accelerate conference, topic “Impact and Insight: Surveys vs. Social Media.” What are the key challenges your presentation will address, and could you hint at key take-aways?
It’s been almost 3 years that I have been using social media data alongside surveys. It’s been a challenging ride and continues to present new challenges.
I will talk about some of things that I have talked about in questions above. I will talk about my personal experiences using social to answer client questions and possible solutions that I have found to work nicely in my context. I will also talk about some of the problems I face. I will try to use examples while trying to protect client privacies.
People can look at my past work to get a sense of my approaches and challenge me or make suggestions. My talk will be informal and I would prefer the audience be open in sharing thoughts. Here are a couple of items:
Finally, what’s on your personal agenda to learn next?
Learning Econometric Modeling and sharpening my skills in certain scripting languages.
Again, meet and hear from TNS researcher Preriit Souda — and research/insights leaders from Ipsos, DigitalMR, Deloitte, Xerox, and other organizations — at the LT-Accelerate conference, 23-24 November in Brussels.
LT-Accelerate is a unique event, the only European conference that focuses on business value in text, speech, and social data, taking place this year November 23-24 in Brussels.
LT-Accelerate participants represent brands and agencies, research, consultancies, and solution providers. The conference is designed for learning and sharing, networking, and deal-making.
Please join us! Visit lt-accelerate.com for information and to benefit from the Super Early registration discount through September 15.
– Media & publishing companies Belga News Agency, Wolters Kluwer, and Acceso
– Technology leaders Cisco Systems, Xerox, and Yahoo Research
– Global services firm Deloitte
– Innovative solution providers econob GmbH, Eptica, Gavagai, Ontotext, and Semalytix
LT-Accelerate is an international conference produced by LT-Innovate, the forum for Europe’s language technology industry, and my U.S. consultancy, Alta Plana Corporation. Participating speakers hail from Austria, Belgium, Bulgaria, France, Germany, Ireland, Portugal, Spain, Sweden, the UK, and the United States. Speakers will present in English.
Program information and registration are available online at lt-accelerate.com. Please join us 23-24 November in Brussels!
P.S. We have program space for a few additional brand/agency speakers, and we welcome solution provider exhibitors/sponsors. You or your organization? Contact us!
Customer-strategy maven Paul Greenberg made a thought-provoking remark to me back in 2013. Paul was puzzled —
Why haven’t there been any billion-dollar text analytics startups?
Text analytics is a term for software and business processes that apply natural language processing (NLP) to extract business insights from social, online, and enterprise text sources. The context: Paul and I were in a taxi to the airport following the 2013 Clarabridge Customer Connections conference.
Clarabridge is a text-analytics provider that specializes in customer experience management (CEM). CEM is an extremely beneficial approach to measuring and optimizing business-customer interactions, if you accept research such as Harvard Business Review’s 2014 study, Lessons from the Leading Edge of Customer Experience Management. Witness the outperform stats reported in tables such as the one to the right. Authorities including “CX Transformist” Bruce Temkin will tell you that CEM is a must-do and that text analytics is essential to CEM (or should that be CXM?) done right. So will Clarabridge and rivals that include Attensity, InMoment, MaritzCX, Medallia, NetBase, newBrandAnalytics, NICE, SAS, Synthesio, and Verint. Each has text analytics capabilities, whether the company’s own or licensed from a third-party provider. Their text analytics extracts brands, product/service, and feature mentions and attributes, as well as customer sentiment, from social postings, survey responses, online reviews, and other “voice of the customer” sources. (A plug: For the latest on sentiment technologies and solutions, join me at my Sentiment Analysis Symposium conference, taking place July 15-16 in New York.)
So why haven’t we seen any software companies — text analytics providers, or companies whose solutions or services are text-analytics reliant — started since 2003 and valued at $1 billion or more?
Gen. Michael Hayden, former CIA and NSA director, keynoted this year’s Basis Technology Human Language Technology Conference. Basis develops natural language processing software that is applied for search, text analytics for a broad set of industries, and investigations. That a text technology provider would recruit an intelligence leader speaker is no mystery: Automated text understanding, and insight synthesis across diverse sources, is an essential capability in a big data world. And Hayden’s interest? He now works as a principal at the Chertoff Group, an advisory consultancy that, like all firms of the type (including mine, in data analysis technologies) focuses on understanding and interpreting trends and shaping reactions and on maintaining visibility by communicating its worldview.
Data, insights, and applications were key points in Hayden’s talk. (I’m live-blogging from there now.)
I’ll provide a quick synopsis of six key trend points with a bit of interpretation. The points are Hayden’s — applying to intelligence — and the interpretation is generally mine, offered given broad applicability that I see to a spectrum of information-driven industries. Quotations are as accurate as possible but they’re not guaranteed verbatim.
Emergent points, per Michael Hayden:
1) The paradox of volume versus scarcity. Data is plentiful. Information, insights, are not.
2) State versus non-state players. A truism here: In the old order, adversaries (and assets?) were (primarily) larger, coherent entities. Today, we live and operate, I’d say, in a new world disorder.
3) Classified versus unclassified. Hayden’s point: Intelligence is no longer (primarily) about secrets, about clandestine arts. Open source (information, not software) is ascendant. Hayden channels an intelligence analyst who might ask, “How do I create wisdom with information that need not be stolen?”
4) Strategic versus specific. “Our energy is now focuses on targeting — targeted data collection and direct action.” Techniques and technologies now focus on disambiguation, that is, to create clarity.
5) Humans versus machines. Hayden does not foresee a day (soon?) when a “carbon-based machine” will not be calling the shots, informed by the work of machines.
6) The division of labor between public and private, between “blue and green.” “There’s a lot of true intelligence work going on in the private sector,” Hayden said. And difficulties are “dwarfed by the advantage that the American computing industry gives us.”
Of course, there’s more, or there would be were Hayden free to talk about certain other trend points he alluded to. Interpreting: Further, the dynamics of the intelligence world can not be satisfyingly reduced to bullet trend points, whether the quantity is a half dozen or some other number. The same is true for any information-driven industry. Yet data reduction is essential, whether you’re dealing with big data or with decision making from a set of over-lapping and potentially conflicting signals. All forms of authoritative guidance are welcome.
Big data is all-encompassing, and that seems to be a problem. The term has been stretched in so many ways that in covering so much, it has come to mean — some say — too little. So we’ve been hearing about “XYZ data” variants. Small data is one of them. Sure, some datasets are small in size, but the “small” qualifier isn’t only or even primarily about size. It’s a reaction to big data that, if you buy advocates’ arguments, describes a distinct species of data that you need to attend to.
Nowadays, all data — big or small — is understood via models, algorithms, and context derived from big data. Our small data systems now effortlessly scale big. Witness: Until five years ago, Microsoft Excel spreadsheets maxed out at 256 Columns and 65,536 rows. In 2010, the limit jumped to 16,384 columns by 1,048,576 rows: over 17 billion cells. And it’s easy to to go bigger, even from within Excel. It’s easy to hook this software survivor of computing’s Bronze Age, the 1980s, into external databases of arbitrary size and to pull data from the unbounded online and social Web.
So we see —
Small is a matter of choice, rather than a constraint. You don’t need special tools or techniques for small data. Conclusion: The small data category is a myth.
Regardless, do discussions of small data, myth or not, offer value? Is there a different data concept that works better? Or with an obsessive data focus, are we looking at the wrong thing? We can learn from advocates. I’ll choose just a few, and riff on their work.
Delimiting Small Data
Allen Bonde, now a marketing and innovation VP at OpenText, defines small data as both “a design philosophy” and “the technology, processes, and use cases for turning big data into alerts, apps, and dashboards for business users within corporate environments.” That latter definition reminds me of “data reduction,” a term for the sort of data analysis done a few ages ago. And of course, per Bonde, “small data” describes “the literal size of our data sets as well.”
I’m quoting from Bonde’s December 2013 guest entry in the estimable Paul Greenberg’s ZDnet column, an article titled 10 Reasons 2014 will be the Year of Small Data. (Was it?) Bonde writes, “Small data connects people with timely, meaningful insights (derived from big data and/or ‘local’ sources), organized and packaged –- often visually -– to be accessible, understandable, and actionable for everyday tasks.”
So (some) small data is a focused, topical derivation of big data. That is, small data is Mini-Me.
Other small data accumulates from local sources. Presumably, we’re talking the set of records, profiles, reference information, and content generated by an isolated business process. Each of those small datasets is meaningful in a particular context, for a particular purpose.
So small data is a big data subset or a focused data collection. Whatever its origin, small data isn’t a market category. There are no special small-data technique nor small data tools or systems. That’s a good thing, because data users need room to grow, by adding to or repurposing their data. Small data collections that have value tend not to stay small.
Encapsulating: Smart Data
Tom Anderson builds on a start-small notion in his 2013 Forget Big Data, Think Mid Data. Tom offers the guidance that you should consider cost in creating a data environment sized to maximize ROI. Tom’s mid data concept starts with small data and incrementally adds affordable elements that will pay off. Tom used another term when I interviewed him in May 2013, smart data, to capture the concept of (my words:) maximum return on data.
Return isn’t something baked into the data itself. Return on data depends on your knowledge and judgment in collecting the right data and in preparing and using it well.
This thought is captured in an essay, “Why Smart Data Is So Much More Important Than Big Data,” by Scott Fasser, director of Digital Innovation for HackerAgency. His argument? “I’ll take quality data over quantity of data any day. Understanding where the data is coming from, how it’s stored, and what it tells you will help tremendously in how you use it to narrow down to the bits that allow smarter business decisions based on the data.”
“Allow” is a key word here. Smarter business decisions aren’t guaranteed, no matter how well-described, accessible, and usable your datasets are. You can make a stupid business decision based on a smart data.
Of course, smart data can be big and big data can be smart, contrary to the implication of Scott Fasser’s essay title. I used smart in a similar way in naming my 2010 Smart Content Conference, which focused on varieties of big data that are decidedly not traditional, or small, data. That event was about enhancing the business value of content — text, images, audio, and video — via analytics including application of natural language processing to extract information, and generate rich metadata, from enterprise content and online and social media.
(I decided to focus my on-going organizing elsewhere, however. The Sentiment Analysis Symposium looks at applications of the same technology set to but targeting discovery of business value in attitudes, opinion, and emotion in diverse unstructured media and structured data. The 8th go-around will take place July 15-16, 2015 in New York.)
But data is just data — whether originating in media (text, images, audio, and video) or as structured tracking, transactional, and operational data — whether facts or feelings. And data, in itself, isn’t enough.
Extending: All Data
I’ll wrap up by quoting an insightful analysis, The Parable of Google Flu: Traps in Big Data Analysis, by academic authors David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani, writing in Science magazine. So happens I’ve quoted Harvard Univ Professor Gary King before, in my 4 Vs For Big Data Analytics: “Big Data isn’t about the data. It’s about analytics.”
King and colleagues write, in their Parable paper, “Big data offer enormous possibilities for understanding human interactions at a societal scale, with rich spatial and temporal dynamics, and for detecting complex interactions and nonlinearities among variables… Instead of focusing on a ‘big data revolution,’ perhaps it is time we were focused on an ‘all data revolution,’ where we recognize that the critical change in the world has been innovative analytics, using data from all traditional and new sources, and providing a deeper, clearer understanding of our world.”
The myth of small data is that it’s interesting beyond very limited circumstances. It isn’t. Could we please not talk about it any more?
The sense of smart data is that allows for better business decisions, although positive outcomes are not guaranteed.
The end-game is analysis that exploits all data — both producing and consuming smart data — to support decision-making and to measure outcomes and help you improve processes and create the critical, meaningful change we seek.
Cognitive is a next computing paradigm, responding to demand for always-on, hyper-aware data technologies that scale from device form to the enterprise.
Cognitive computing is an approach rather than a specific capability. Cognitive mimics human perception, synthesis, and reasoning capabilities by applying human-like machine-learning methods to discern, assess, and exploit patterns in everyday data. It’s a natural for automating text, speech, and image processing and dynamic human-machine interactions.
IBM is big on cognitive. The company’s recent AlchemyAPI acquisition is only the latest of many moves in the space. This particular acquisition adds market-proven text and image processing, backed by deep learning, a form of machine learning that resolves features at varying scales, to the IBM Watson technology stack. But IBM is by no means the only company applying machine learning to natural language understanding, and it’s not the only company operating under the cognitive computing banner.
Digital Reasoning is an innovator in natural language processing and, more broadly, in cognitive computing. The company’s tag line:
We build software that understands human communication — in many languages, across many domains, and at enormous scale. We help people see the world more clearly so they can make a positive difference for humanity.
Tim Estes founded Digital Reasoning in 2000, focusing first on military/intelligence applications and, in recent years, on financial markets and clinical medicine. Insight in these domains requires synthesis of facts from disparate sources. Context is key.
The company sees its capabilities mix as providing a distinctive interpretive edge in a complex world, as will become clear as you read Tim’s responses in an interview I conducted in March, to provide material for my recent Text Analytics 2015 state-of-the-industry article. Digital Reasoning has, in the past, identified as a text analytics company. Maybe not so much any more.
Call the interview —
Digital Reasoning Goes Cognitive: CEO Tim Estes on Text, Knowledge, and Technology
Seth Grimes: Let’s start with a field that Digital Reasoning has long supported, text analytics. What was new and interesting in 2014?
Tim Estes: Text analytics is dead, long live the knowledge graph.
Seth: Interesting statement, both parts. How’s text analytics dead?
Tim: I say this partially in jest: Text analytics has never been needed more. The fact is, the process of turning text into structured data is now commoditized and fragmented. As a component business, it’s no longer interesting, with the exits of Attensity and Inxight, and the lack of pure plays.
I don’t think the folks at Attensity are aware they’ve exited, but in any case, what’s the unmet need and how is it being met, via knowledge graphs and other technologies?
What is replacing text analytics is a platform need, the peer of the relational database, to go from human signals and language into a knowledge graph. The question leading enterprises are asking, especially financial institutions, is how do we go from the unstructured data on our big data infrastructure to a knowledge representation that can supply the apps I need? That’s true for enterprises, whether [they’ve implemented] an on-premise model (running on a Hadoop stacks required by large banks or companies, with internal notes and knowledge) or a cloud model with an API.
You’re starting to get a mature set of services, where you can put data in the cloud, and get back certain other meta data. But they’re all incomplete solutions because they try to annotate data, creating data on more data — and a human can’t use that. A human needs prioritized knowledge and information — information that’s linked by context across everything that occurs. So unless that data can be turned into a system of knowledge, the data is of limited utility, and all the hard work is left back on the client.
Building a system of knowledge isn’t easy!
The government tried this approach, spending billions of dollars across various projects doing it that way, and got very little to show for it. We feel we’re the Darwinian outcome of billions of dollars of government IT projects.
Now companies are choosing between having their own knowledge graphs or whether to trust a third-party knowledge graph provider, like Facebook or Google. Apple has no knowledge graph so it doesn’t offer a real solution because you can’t process your data with it so it is behind the market leaders. Amazon has the biggest platform but they also have no knowledge graph and no ability to process your data in as a service, so it also has a huge hole. Microsoft has the tech and is moving ahead quickly but the leader is Google, with Facebook is a fast follower.
That’s them. What about us, the folks who are going to read this interview?
On the enterprise side, with on-premise systems, there are very few good options to go from text to a knowledge graph. Not just tagging and flagging. And NLP (natural language processing) is not enough. NLP is a prerequisite.
You have to get to the hard problem of connecting data, lifting out what’s important. You want to get data today and ask questions tomorrow, and get the answers fast. You want to move beyond getting information about the patterns you had in the NLP today that detected what passed through it. That involves static lessons-learned, baked into code and models. The other provides a growing vibrant base of knowledge that can be leveraged as human creativity desires.
So an evolution from static to dynamic, from baseline NLP to…
I think we’ll look back at 2014, and say, “That was an amazing year because 2014 was when text analytics became commoditized at a certain level, and you had to do much more to become valuable to the enterprise. We saw a distinct move from text analytics to cognitive computing.” It’s like selling tires versus selling cars.
Part-way solutions to something more complete?
It’s not that people don’t expect to pay for text analytics. It’s just that there are plenty of open source options that provide mediocre answers for cheap. But the mediocre solutions won’t do the hard stuff like find deal language in emails, much less find deal language among millions of emails among tens of billions of relationships, that can be queried in real time on demand and ranked by relevance and then supplied in a push fashion to an interface. The latter is a solution that provides a knowledge graph while the former is a tool. And there’s no longer much business in supplying tools. We’ve seen competitors, who don’t have this solution capability, look to fill gaps by using open source tools, and that shows us that text analytics is seen as a commodity. As an analogy, the transistor is commoditized but the integrated circuit is not. Cognitive computing is analogous to the integrated circuit.
What should we expect from industry in 2015?
Data accessibility. Value via applications. Getting smart via analytics.
The enterprise data hub is interactive, and is more than a place to store data. What we’ve seen in the next wave of IT, especially for the enterprise, is how important it is to make data easily accessible for analytic processing.
But data access alone doesn’t begin to deliver value. What’s going on now, going back to mid-2013, is that companies haven’t been realizing the value in their big data. Over the next year, you’re going to see the emergence of really interesting applications that get at value. Given that a lot of that data is human language, unstructured data, there’s going to be various applications that use it.
You’re going to have siloed applications. Go after a use case and build analytic processing for it or start dashboarding human language to track popularity, positive or negative sentiment — things that are relatively easy to track. You’re going to have more of these applications designed to help organizations because they need software that can understand X about human language so they can tell Y to the end user. What businesses need are applications built backwards from the users’ needs.
But something’s missing. Picture a sandwich. Infrastructure and the software that computes and stores information are the bottom slice and workflow tools and process management are the top slice. What’s missing is the meat — the brains. Right now, there’s a problem for global enterprises: You have different analytics inside every tool. You end up with lots of different data warehouses that can’t talk to each other, silo upon silo upon silo — and none of them can learn from another. If you have a middle layer, one that is essentially unified, you have use cases that can get smarter because they can learn from the shared data.
You mentioned unstructured data in passing…
We will see more ready utilization of unstructured data inside applications. But there will be very few good options for a platform that can turn text into knowledge this year. They will be inhibited by two factors: 1) The rules or the models are static and are hard to change. 2) The ontology of the data and how much energy it takes to fit your data into the ontology. Static processing and mapping to ontologies.
Those problems are both alleviated by cognitive computing. Our variety builds the model from the data — there’s no ontology. That said, if you have one, you can apply it to our approach and technology as structured data.
So that’s one element of what you’re doing at Digital Reasoning, modeling direct from data, ontology optional. What else?
With us, we’re able to expose more varieties of global relationships from the data. We aim for it to be simple to teach the computer something new. Any user — with a press of a button and a few examples — can teach the system to start detecting new patterns. That should be pretty disruptive. And we expect to move the needle in ways that people might not expect to bring high quality out of language processing — near human-level processing of text into people, places, things and relationships. We expect our cloud offering to become much more mature.
Any other remarks, concerning text analytics?
Microsoft and Google are duking it out. It’s an interesting race, with Microsoft making significant investments that are paying off. Their business model is to create productivity enhancements that make you want to keep paying them for their software.They have the largest investment in technology so it will be interesting to see what they come up with. Google is, of course, more consumer oriented. Their business model is about getting people to change their minds. Fundamentally different business models, with one leaning towards exploitation and the other leading to more productivity, and analytics is the new productivity.
And think prioritizing and algorithms that work for us —
You might read 100 emails a day but you can’t really think about 100 emails in a day — and that puts enormous stress on our ability to prioritize anything. The counterbalance to being overwhelmed by all this technology — emails, texts, Facebook, Twitter, LinkedIn, apps, etc. — available everywhere (on your phone, at work, at home, in your car or on a plane) — is to have technology help us prioritize because there is no more time. Analytics can help you address those emails. We’re being pushed around by algorithms to connect people on Facebook but we’re not able to savor or develop friendships. There’s a lack of control and quality because we’re overwhelmed and don’t have enough time to concentrate.
That’s the problem statement. Now, it’s about time that algorithms work for us, push the data around for us. There’s a big change in front of us.
I agree! While change is a constant, the combination of opportunity, talent, technology, and need are moving us faster than ever.
Thanks, Tim, for the interview.
Disclosure: Digital Reasoning was one of eight sponsors of my study and report, Text Analytics 2014: User Perspectives on Solutions and Providers. While Digital Reasoning’s John Liu will be speaking, on Financial Markets and Trading Strategies, at the July 2015 Sentiment Analysis Symposium in New York, that is not a paid opportunity.
For more on cognitive computing: Judith Hurwitz, Marcia Kaufman, and Adrian Bowles have a book just out, Cognitive Computing and Big Data Analytics, and I have arranged for consultant Sue Feldman of Synthexis to present a Cognitive Computing workshop at the July Sentiment Analysis Symposium.
Commercial text analytics worldwide is dominated by US, UK, and Canadian companies, despite the presence of many exceptional academic and research centers in Europe and Asia. Correlate market successes not only with English-language capabilities, but also with minimal government interference in business development. I’m referring to two sorts of interference. The tech sector in eurozone countries is often over-reliant on governmental research funding, and it is hampered by business and employment rules that discourage investment and growth. Where these inhibitors are less a factor — for text analytics, notably in Singapore, Scandinavia, and Israel — commercialized text analytics thrives.
Eurozone entrepreneurs — such as Spain’s DAEDALUS — aim similarly to grow via a commercial-markets focus and by bridging quickly, even while still small, to the Anglophone market, particularly to the US. (The euro’s current weakness supports this latter choice.)
This point emerges from a quick Q&A I recently did with José Carlos González. José founded DAEDALUS in 1998 as a spin-out of work at the Universidad Politécnica de Madrid, where he is a professor, and other academic research. I interviewed him for last year’s Text Analytics 2014 story. This year’s Q&A, below, was in support of my Text Analytics 2015 report on technology and market developments
My interview with José Carlos González —
Q1) What was new and interesting, for your company and industry as a whole, and for the market, in 2014?
Through the course of 2014, we have seen a burst of interest in text analytics solutions from very different industries. Niche opportunities have appeared everywhere, giving birth to a cohort of new players (startups) with integration abilities working on top of ad-hoc or general-purpose (open or inexpensive) text analytics tools.
Consolidated players, which have been delivering text analytics solutions for years (lately in form of APIs), face the “breadth vs depth” dilemma. The challenge of developing, marketing and selling vertical solutions for specific industries has lead some companies to focus on niche markets quite successfully.
Q2) And technology angles? What approaches have advanced and what has been over-hyped?
The capability of companies to adapt general purpose semantic models to a particular industry or company in a fast and inexpensive way has been essential in 2014, to speed up the adoption of text analytics solutions.
Regarding deep learning or general purpose artificial intelligence approaches, they show slow progress beyond the research arena.
Q3) What should we expect from your company and from the industry in 2015?
Voice of the Customer (VoC) analytics — and in general, all the movement around customer experience — will continue being the most important driver for the text analytics market.
The challenge for the years to come will consist in providing high-value, actionable insights to our clients. These insights should be integrated with CRM systems to be treated along with structured information, in order to fully exploit the value of data about clients in the hands of companies. Privacy concerns and the difficulties to link social identities with real persons or companies, will be still a barrier for more exploitable results.
Q4) Any other remarks, concerning text analytics?
Regarding the European scene, the situation in 2015 is worse than ever. The Digital Single Market, one of the 10 priorities of the new European Commission, seems a kind of chimera — wished for but elusive — for companies providing digital products or services.
The new Value Added Tax (VAT) regulation, in force since January 2015, compels companies to charge VAT in the country of the buyer instead of the seller, to obtain different evidences about customer nationality and to store a large amount of data for years. These regulations, intended to prevent internet giants from avoiding paying VAT, is in fact going to make complying with VAT so difficult that the only way to sell e-products will be to sell via large platforms. Thus, small European digital companies are suffering an additional burden and higher business expenses, while the monopoly of US online platforms is reinforced. The road to hell is paved with good intentions!
I thank José for this interview and will close with a discloser and an invitation: DAEDALUS, in the guise of the company’s MeaningCloud on-demand Web service (API), is a sponsor of the 2015 Sentiment Analysis Symposium, which I organize, taking place July 15-16, 2015 in New York. If you’re concerned with the business value of opinion, emotion, and intent, in social and enterprise text, join us at the symposium!
And finally, an extra: Video of DAEDALUS’s Antonio Matarranz, presenting on Voice of the Customer in the Financial Services Industry at the 2014 Sentiment Analysis Symposium.