Always mistrust sentiment accuracy claims

Sentiment analysis plays a key role in social intelligence (a generalization of social-media analytics) and in customer-experience programs, but the disparity in tool performance is wide. It’s natural that users will look for accuracy figures, and that solution providers — the ones that pretend to better performance — will use accuracy as a differentiator. The competition is suspect, for reasons I outlined in Social Media Sentiment: Competing on Accuracy. Per that article, there’s no standard yardstick for sentiment-analysis accuracy measurement. But that’s a technical point. Worth further exploring:

  • Providers, using human raters as a yardstick, don’t play by the same rules.
  • It’s fallacious that humans are the ultimate accuracy arbiters anyway. Can a machine in no way judge better (as opposed to faster or more exhaustively) than a person?
  • This focus on accuracy distracts users from the real goal, not 95% analysis accuracy but rather support for the most effective possible business decision making.

To explore —

Human benchmarks

We benchmark machine performance, on purely quantitative tasks, against natural measures: Luminosity according to a model of the sensitivity of the human eye (candela), and mechanical-engine output against the power of draft horses (horsepower). But just as a spectrometer measures light of wavelengths unseeable by humans and quantifies visible-wavelength measurements in a way humans never could, and a Saturn V rocket will (or could) take you places an animal could never go unassisted, I believe that sentiment and other human-language analysis technologies, when carefully applied, can deliver super-human accuracy. I believe it is no longer true that “The right goal is for the technology to be as good as people,” as Philip Resnik, a Univ of Maryland linguistics professor and lead scientist at social-media agency Converseon, puts it.

As Professors Claire Cardie and John Wilkerson explain, “The gold standard of text annotation research is usually work performed by human coders… In other words, the assessment is not whether the system accurately classifies events, but the extent to which the system agrees with humans where those classifications are concerned.”

“Agrees with humans”

Note the statement, “the assessment is not whether the system accurately classifies events, but the extent to which the system agrees with humans where those classifications are concerned.”

And consider a company, Metavana, that competes on accuracy, with claims of 95-96% performance on combined topic extraction and sentiment analysis. Metavana President Michael Tupanjanin says the company measures accuracy “the old fashioned way.” According to Tupanjanin, “We literally will take — we recently did about 3,000 quotes that we actually rated, and we sat down with a bunch of high school kids and actually had them go through sentence by sentence by sentence and see, how would you score this sentence?” I praise Metavana’s openness, but this approach is backwards, as we shall see. It assesses whether humans agree with the machine, not whether the machine agrees with humans, per established methods.

According to Erick Watson, the company’s director of product management, the software identifies entities and topics and then further mines sources for sentiment expressions. In the automotive sector, says Watson, the engine identifies expressions “such as ‘fuel efficient’ or ‘poor service quality’ and automatically determines which of these sentiment expressions is associated with [a] brand.” Sounds reasonable, but then Watson wrote me, “Expressions that contain no sentiment-bearing keywords are classified as neutral (e.g. ‘I purchased a Honda yesterday.’)”

I ran a Twitter poll on Watson’s ‘I purchased a Honda yesterday.’ With 22 respondents, 45% rated it neutral and 55% rated it positive. Humans may see sentiment in an expression that contains no sentiment-bearing keywords! Metavana’s summary dismissal of such expressions, coupled to an accuracy-measurement method that restricts evaluation to machine-tagged expressions (the ones the company doesn’t dismiss), inflates the company’s accuracy results.

There’s more to the accuracy appraisal.

Beyond humans

I believe that sentiment and other human-language analysis technologies, when carefully applied, can deliver super-human accuracy. True, we’re years from autonomous agents that can navigate world of sensory (data) inputs and uncertain information in order to flexibly carry out arbitrary tasks, which is what humans do. But arguably, we can design a system that can, or soon will be able to, conduct any given task — whether driving a car or competing at Jeopardy — better than a human ever could.

A first attempt at automating a process typically involves mimicking human methods, but an intelligent system may reason in ways humans don’t. In analyzing language, in particular, machines look for nuance that may emerge only when statistical analyses are applied to very large data sets. That’s the Unreasonable Effectiveness of Data when, per Google’s Peter Norvig, “the hopeless suddenly becomes effective, and computer models sometimes meet or exceed human performance.” That’s not to say that machines won’t fail, badly, in certain circumstances. It is to say that overall, in the (large) aggregate, computers can and will outperform humans both on routine tasks and by making connections — finding patterns and discovering information — that a human never would.

Think of this insight as an extension of the Mythical Man-Month corollary, that “Nine women can’t make a baby in one month.” A machine can’t make a baby at all, but one can accelerate protons to near light speed to create sufficient mass for collisions to result in the generation of unseeable, but inferable, particles, namely Higgs bosons. Machines can already throw together (fuse) text-extracted and otherwise-collected information to establish links and associations that a human (or nine hundred) would never perceive.

Philip Resnik’s attitude, the established attitude, that “the right goal is for the technology to be as good as people,” is only a starting point. We seek to create machines that are better than humans, and we should measure their performance accordingly.

The accuracy distraction

My final (but central) point is this: The accuracy quest-for-the-best is a distraction.

Social intelligence providers often claim accuracy that beats the competitions’. (Lexalytics and OpenAmplify should be pleased that they’re the benchmarks new entrant Group of Men chose to compare itself to.) Providers boast of filtering the firehouse. They claim to enable customers to transform into social enterprises, as if presenting or plugging into a widget-filled social-analytics dashboard, with simplistic +/- sentiment ratings, were the key to better business operations and decision making. Plainly stated —

The market seeks ability to improve business processes, to facilitate business tasks. Accuracy should be good enough to matter, but more important, analytical outputs should be useful and usable, aligned to business goals (positive/negative sentiment ratings often aren’t) and consumable within line-of-business applications.

I’m interested in how your technology and solutions made money for your customers, or helped them operate more efficiently and effectively, or, for that matter, saved lives or improved government services. The number that counts is demonstrated ROI.


Disclosure: Earlier this year, Converseon engaged me for a small amount of paid consulting and was a paying sponsor of my November 2011 Sentiment Analysis Symposium.

And a plug: Check out the up-coming Sentiment Analysis Symposium, slated for October 30, 2012 in San Francisco, preceded by a half-day Practical Sentiment Analysis tutorial, to be taught by Diana Maynard of of the Univ of Sheffield, UK.

Leave a Reply