I’m sure there isn’t a single data scientist who hasn’t heard of natural language processing (NLP), although if you develop data analysis applications, I’ll give you a pass for the simple reason that numeric data is far easier to analyze, with huge business benefits, than data sourced from text. Parsing and identifying features in text, speech, and other “unstructured” sources is a not-insignificant task. Why go to the bother if the insights you need — answers to your most pressing business needs — are readily found via numbers-centered analytics? Well, that avoidance attitude may have worked until now, but the demands of social, consumer, and enterprise data work have brought those days to an end.
To gain complete customer, social, business, and research insights, you need to extract insights from text.
Illustrating the need
The surprise outcome of the recent U.S. election illustrates the point. National polls were off by around two percentage points, “right in line with the historical average” according to Nate Silver of FiveThirtyEight, but “state polls had considerably more problems.” Simply put, voters didn’t vote the way they, and the models, said they would. Yet certain analyses were able to detect a gap between poll responses and voters’ actual intent. Which analyses? Of social media postings and free-text survey “verbatims.” The election outcome hinged on themes, moods, and passions undetectable by conventional polling, apparent only to those who looked beyond the numbers.
A similar effect occurs in a spectrum of business domains. Numbers alone — derived from customer interactions, transactions, social shares, tracking, and the like — are incapable of surfacing the motivations, intent, and root causes behind observed actions and behaviors. To explain the numbers, and to capture mentions, relationships, events, and attitudes communicated in online, social, and enterprise text, you need NLP, applied at scale.
The Trusted Analytics way
Trusted Analytics Platform (TAP) is an open source platform that provides an NLP-capable framework accessible to data scientists, developers, and analysts. TA enables data preparation, analysis, and model-building — and provides APIs, microservice-provisioned data stores, processing pipeline, and machine learning capabilities — needed for integrated text-data analysis. The how-to is for another article. It’s the what that I’ll focus on in the next few paragraphs. The what includes:
- Topic extraction. “Documents are mixtures of topics, where a topic is a probability distribution over words,” per cognitive scientists Mark Steyvers and Tom Griffiths, and TAP-available Latent Dirichlet Allocation (LDA) is the most commonly used topic-modeling method.
- Classification. A variety of methods are commonly applied for text classification, among them TAP-available Naive Bayes, Logistic Regression, Random Forests, and Support Vector Machine algorithms.
Other common NLP tasks — entity extraction and entity resolution; term, theme, and concept extraction; attribute and relation extraction; and sentiment analysis — are naturals for TAP’s distributed analytical processing framework. (Entities include identifiable persons, places, companies, products, for instance. Resolution involves assigning a unique identifier to multiple entities that refer to a single entity in a given context. “Barack Obama,” “Mr. Obama,” and “the president,” appearing in an article, illustrate resolvable “coreference.” “The economy” and “presidential candidate” are concept and term examples, respectively.)
These tasks can equally be performed as preprocessing steps, with resulting data loaded for TAP analysis. But if you’re a developer, won’t you consider helping build them directly into the framework?
Analysis generates insight
Wherever the NLP happens, analysis is the step that generates insight. NLP-extracted elements become features for link and association analysis, clustering, relevancy ranking, and other techniques that uncover and exploit data relationships.
So for instance, quantify positive and negative attitudes toward a candidate, obtained via state-level surveys or social-media analyses. Weight by emotion intensity and by factors such as recency, that is, closeness to actual vote casting. Study trends, changes over time, in both volume and the relative position of the candidates. Then predict. The models are admittedly not trivial to build, but the analytical lift – the improvement over numbers-only methods – makes the effort worthwhile.
The outcome: Natural language processing will help you extract insights from text. Techniques are proven; tools are available – open source and commercial and some within TAP – for the spectrum of tasks. The payoff is model completeness, predictive accuracy, and business return. If you haven’t yet pursued text analytics, now’s the time, and if your experience allows, share what you know with the TAP open source community. All will benefit.
The Trusted Analytics Platform project sponsored this article.