The big-data analysis process reduces to three elements: Collection, Synthesis, and Insight. We gather relevant data, harmonize and link it, and use analysis findings situationally. In the online/social/sensor era, “relevant” may reflect enormous data volume. “Harmonize” responds to variety, and situational applications must often accommodate high-velocity data. Context and latency considerations complicate matters. Latency refers to acceptable data-collection, analysis, and reporting lag. Low latency is crucial in online, mobile, and enterprise interactions. And context means metadata, good-old data about data, which can boost analysis accuracy (and also aide in proper data governance).
This article is about the roles of metadata and connection in the big-data story.
Human Data: Fact, Feelings, and Intent
My particular interest is “human data,” communicated in intentionally expressive sources such as text, video, and social likes and shares, and in implicit expressions of sentiment. Implicit: We infer sentiment signals from behavior tracks (transaction records, click-/tap-streams, and geolocation) and social-network links and interactions.
Human data, from devices, online and social platforms, and enterprise transactional and operational systems, captures what Fernando Lucini characterizes as “the electronic essence of people.” Lucini is CTO of HP Autonomy. He is one of four industry authorities I interviewed as story sources. Lucini observes, “we interact with many systems, we communicate, we create,” yet analytics providers “don’t connect the dots in a way that’s truly useful, for each of us to be better served by information.”
The others I interviewed — IBM analytics strategist Marie Wallace, AlchemyAPI founder and CEO Elliot Turner, and Prof. Stephen Pulman of the University of Oxford and start-up TheySay — have similar points of view. (IBM and TheySay sponsored my recent, New York Sentiment Analysis Symposium. AlchemyAPI is sponsoring my up-coming market study, “Text Analytics 2014: User Perspectives on Solutions and Providers,” as is Digital Reasoning, mentioned later in this article.)
According to Marie Wallace, “the biggest piece of missing information isn’t the content itself, but the metadata that connects various pieces of content into a cohesive story.” What sort of metadata?
Stephen Pulman refers to properties of the message (for example, whether it’s humorous, sincere, or likely fake) and of the author, such as sex, age, and maybe also influence and ideology, which “tell us how we should treat the content of the message, as well as being interesting in themselves.”
As if expanding on Pulman’s thought, Marie Wallace asks, “if I don’t know the individual and the background behind her current communication, how can I really decide what her mood or intent is, and most importantly take effective action?”
Elliot Turner is particularly interested in intent mining, applied, for example, in efforts to predict an individual’s purchasing behavior. Turner says, “success will combine elements like a person’s interests, relationships, geography — and ultimately his identity, purchase history and privacy preferences — so that applications can plot where a person is in his ‘buyer’s journey’ and provide the best offers at the best times.”
Natural Language Processing
Natural language processing (NLP) (and parsing and interpretation for formal languages) is a route to mining the information content of text and speech, complemented by techniques that extract interesting information from sound, images, and video. (Of course, network, geospatial, and temporal data come into play: Matter for another article.) Recognizing that NLP includes both language understanding and language generation, two parts of a conversation — think about, but also beyond, “question answering” systems such as Apple Siri — I asked my interviewees, How well are we doing with NLP?, and also about our ability to mine affective states, that is, mood, emotion, attitudes, and intent.
Stephen Pulman sees “steady progress on parsing and semantic-role labeling, etc., for well-behaved text” — by “well-behaved,” he means (relatively) grammatical, correctly spelled, and slang-free — but “performance goes down pretty steeply for texts like tweets or other more casual forms of language use.”
Elliot Turner observes, “a never-ending challenge to understanding text is staying current with emerging slang and phrases,” and Marie Wallace believes, “if we look to lower quality content (like social media), with inherently ambiguous analysis (like sentiment, opinion, or intent), then it’s still a bit of a crapshoot.”
Turner says “the trend is easy to spot: The interactive question-answering capabilities made famous by IBM’s Watson will become commonplace, offered at a fraction of today’s costs and made available as easy-to-integrate Web services… We will see search and retrieval transform to become dialog-based and be highly aware of an ongoing context. Machines will stay ‘in conversation’ and not treat each search as a unique event.”
In conversational context, Fernando Lucini sees a problem of understanding how information elements link to other pieces of information: “It’s how the information connects that’s critical,” and understanding depends on our ability to tap into the right connections. He sees progress in analytical capabilities being driven swiftly by increasing demand, applying “all sorts of techniques, from unsupervised to supervised [machine learning], from statistical to linguistic and anything in between.”
One particular technique, unsupervised learning, which AlchemyAPI CEO Turner describes “enabl[ing] machines to discover new words without human-curated training sets,” is often seen as materially advancing language-understanding capabilities, but according to Autonomy CTO Lucini, the real aim is a business one, “making sure that any piece of information fulfills its maximum potential… Businesses need to have a clear view how [information availability] translates to value.”
While Marie Wallace says, “we’ve only just scratched the surface in terms of the insights that can be derived from these new advanced learning techniques,” Prof. Pulman notes, “there is usually a long way to get from a neat research finding to an improved or novel product, and the things that researchers value are often less important than speed, robustness and scalability in a practical setting.” (Pulman gave a quite interesting talk, Deep Learning for Natural Language Processing, at the March 6, 2014 Sentiment Analysis Symposium.)
I see mobile computing as opening up a world of opportunity, exploitable in conjunction with advances on a variety of technical and business fronts. Which? I asked my interviewees. The responses bring us back to this article’s starting point, metadata, context, and connection.
Marie Wallace says “Mobile is the mother load of contextual metadata that will allow us to provide the type of situational insights the contextual enterprise requires.” Add longer-established sources to the picture, and “there is a significant opportunity to be realized in providing integration and analysis (at scale) of social and business data… Once we combine interactional information with the business action, we can derive insights that will truly transform the social business.”
This combination, which I referred to as “synthesis,” is at the core of advanced big-data analytics, the key to solutions from providers that include, in addition to IBM and HP Autonomy, companies such as Digital Reasoning and Palantir.
IBMer Wallace adds, “privacy, ethics, and governance frameworks are going to be increasingly important.”
According to Fernando Lucini, mobile is great for HP Autonomy because it means “more use of information — in our case, human information.” He sees opportunity in three areas: 1) supporting “better and more real-time decisions [that] connect consumer and product,” 2) information governance, because “securing or protecting information, as well as evaluating the risk in information and then being able to act suitably and in accordance with regulation and law, is a considerable integration and synthesis challenge,” and 3) provision of self-service, cloud tools.
Stephen Pulman similarly starts with a technical observation and then draws a lesson about business practices: “One thing we have learned at TheySay is that a combination of text analysis like sentiment along with other, often numerical, data gives insights that you would not get from either in isolation, particularly in the financial services or brand management domains. Finding the right partners with relevant domain expertise is key to unlocking this potential.”
Finally, Elliot Turner discusses the opportunity created by his company’s variety of technology, providing elements such as text analysis and classification and computer vision, via cloud services: “Large public companies are exploring how to incorporate modern text analysis capabilities into their established product lines,” while “innovative startups are charging at full speed with business plans aimed squarely at disrupting… business intelligence, customer support, advertising and publishing, sales and marketing automation, enterprise search, and many other markets.”
So we learn that the opportunity found via big-data analysis takes the form of situational insights, relying on integration and analysis of social and business data. It involves connection, governance, and easier access to strong tools, guided by domain expertise. The aims are expanded capabilities for some, innovation and disruption for the upstarts. Achieve these aims, and you just might have the clear view cited by one of my interviewees — provided via savvy, integrated, all-data analyses — of the path to value.
Read the full Q&A texts by clicking or tapping on the names of interviewees Fernando Lucini (HP Autonomy), Marie Wallace (IBM), Elliot Turner (AlchemyAPI), and Stephen Pulman (University of Oxford and TheySay). Also check out my recent article, Text Analytics 2014.