Data Fusion: From Big Data to Even Bigger

Data fusion is hard. The task is to link data of diverse types, from disparate sources, in support of unified search, query, and analysis. Picture Total Information Awareness, designed not for counterterrorism (although military/intelligence remains a leading usage) but instead for marketing, competitive intelligence, customer experience, online commerce, capital markets, and research initiatives. These business applications are limited in scope but nonetheless, integration ability is becoming a critical contributor to competitiveness.

We’re talking ‘systems thinking’ and the #3 Big Data ‘V’, Variety. Technical approaches range from ‘ecosystems’ (which you’ll hear much about in 2013) to The effort is worthwhile: The notion is that a whole is worth more than the sum of its parts. So we seek to handle numbers, text, location, time, image, audio, video, and machine data — from enterprise, online, social, sensor, and server sources — not in isolation, each via a siloed application, but rather as a linked ensemble. Text, transactions, and geolocation are the key components. And the solutions?

Interfaces and Engines

BI, text analytics, semantic computing, and Semantic Web technologies, with overlapping and complementary methods, each help us respond to the variety-fusion need. Let’s explore the technologies and methods, starting with a statement that I admire all solution providers that have risen to the challenge.

Providers typically offer search or unified-information dashboards and BI interfaces. The best of them are rely on innovative visualisation that helps users make sense of complex, dynamic information. Products are applied most frequently in military/intelligence, life sciences and pharma, financial services, and other complex-data domains. The hard stuff, however, happens behind the scenes. The back-end work — identifying sources, compiling and harmonizing data, creating suitable query methods and tools — may involve momentous effort. IBM Watson, designed to play a game but with much grander targets (starting with health-care) in sight, reportedly involved 80-100 person-years effort, on top of the untold non-dedicated additional company resources.

General approaches apply semantic-computing techniques that map data items into common-ground namespaces and controlled vocabularies or master data. They discover linkages that bridge those diverse types and sources. The sum is sense-making, beyond conventional numbers-focused BI and also beyond text analytics, which I have studied for a decade now and define as business intelligence on and from text. Sense-making, further, is topical and situational, intended to provide not a rigid, one-size-fits-all analytical answer but instead one that responds to the needs of the searcher-analyst-user.

BI Evolves

Business Intelligence — the practice and the tools — has evolved to bring information from ‘unstructured’ sources (primarily text) into analysis environments.

Text Analytics

Text analytics is a key Big Data technology, but scalability is a hurdle, so is getting at all the information to be found in sources, accounting for the context in which it appeared and for the user’s needs. Further, no single source or type of data gives a complete picture.

First — All text analytics is semantic, whether linguistic, machine learning, or statistical techniques are involved. “Semantics” is usually defined as meaning. A statistical count of words or terms and their distribution can help you detect the key topics of a document or corpus. Similarly, statistical methods (cooccurrence, similarity measures, clustering) classify content in various ways, another form of ascribing meaning to text. Lexicons, rule sets, and taxonomies, which are typically associated with linguistic NLP and with various types of machine learning, rely on and also generate meaning. This said — Any business (or research or government) challenge that involves text whose volume, flow, or nature (e.g., use of inaccessible technical or human language) — anything you can make sufficient sense of by simply reading it — is an appropriate candidate for application of text analytics.

There are dozens of text analytics solution providers, from companies that range from enterprise giants IBM, HP, and SAP to a constant stream of start-ups via application-focused players such as Attensity, Clarabridge, Daedalus, Expert System, Kana, Lexalytics, Linguamatics, OpenText, and SRA; as-a-service providers AlchemyAPI, OpenAmplify, and Pingar; open-source toolkits including GATE, OpenNLP, and Python NLTK; social-intelligence providers Converseon, NetBase, and Sysomos; and search companies such as Attivio, Exalead, and Sinequa. They all do one form or another of (semantic) text analytics. Cambridge Semantics does tout Anzo’s unified information access capabilities. In doing so, the company positions Anzo to compete with Attivio, Coveo, Oracle Endeca (which relies on text-analytics technology from Lexalytics), and Sinequa, all of which have UIA plays.

The Semantic Web

A writer for a Big Data-analytics platform recently interviewed me for an article profiling a semantic technology player, one that tackles Semantic Web technologies to Semantic Web technologies are rarely used for text analytics. The core technologies — RDF, SPARQL, OWL, URIs — just don’t play much of a role. Rather, the situation is that both the analytics and the Semantic Web call on certain information-extraction and structuring technologies, for different purposes. The Semantic Web may use natural-language processing (NLP), focusing on entity and metadata extraction, to populate information stores. Text analytics uses NLP, much more broadly, to extract entities and metadata (title, author, publication date, etc.) and also topics, facts, events, relationships, sentiment, and opinions, as part of business intelligence, data mining, and automated text processing initiatives.

The Semantic Web, however, is wholly incapable of addressing these hurdles. The Semantic Web just wasn’t designed for comprehensive analytics. Considering analytics and semantics and enterprise (and personal) information needs: Sense-making is the future of Big Data.

Do Semantic Web technologies help in BI and text analytics initiatives, as a means of integrating structured and unstructured data?

Definitely, integration of database-sourced data from transactional and operational systems, and text-sourced data from social, online, and enterprise text, requires semantic capabilities: uniform identifiers, shared vocabularies/master data, shared classifications and taxonomy, and sometimes ability to do fuzzy matching. The data, whether database- or text-sourced, may be held in an RDF store or a relational or NoSQL format, and it may be SPARQL or SQL queryable or free-text searchable. The key is semantic technologies, not adherence to limited-scope Semantic Web technologies nor SQL database orthodoxy nor any other data-management dogma. True semantic computing bridges all these islands.

Semantic Technologies

What are the implementation and management issues with these semantic technologies?

It’s hard to get machines to understand human communications (in a way that’s relevant to the user, given his or her particular task), and even harder to join data that orginate in disparate sources and to detect signals that aren’t even apparent until you fuse those sources. So the biggest challenge is to find technology that fits your business problem and to apply it right. Management wise: staffing, and design of a system that gets the best out of machines with necessary human guidance, is hard.

Cambridge Semantics argues that the Semantic Web technologies were designed to be web scale. You say that the core technologies can’t address the hurdles facing text analytics, specifically scalability, getting at all of the information in sources, accounting for context, etc. Can you say a little more about why the core Semantic Web technologies fall short of that task?

Definitely, Semantic Web technologies were designed to Web scale. That doesn’t mean they can do everything, however. You’re not going to use an RDF store and SPARQL to manage & query sensor data or server log files, or to run an online store, or for a BI OLAP engine. RDF and SPARQL are good for “wide” data but not so good for aggregation and slice-and-dice (dimensional) analysis of large, homogeneous datasets. So there are many modern-day computing applications that Semantic Web technologies aren’t well suited for. Getting at all the information in sources: Right now, Semantic Web-ers focus on entity extraction — on finding names of people, companies, places, chemical compounds, etc. — with little regard for features such as sentiment (mood, tone, emotion, opinion, intent) or even for the factual content of text sources. Information extraction, for Semantic Web-ers, means parsing the names out of “Barack Obama is president of the United States” which ignoring the relationship text that connects those names. Instead, SW-ers will resolve Barack Obama and the United States to URIs (uniform resource identifiers) and be able to use the relationship only if someone has captured it in an ontology.

If you were to build a competitive intelligence application that pulls information from Web news feeds and matches it to internal data, what technologies would you use?

To build the competitive intelligence app you describe, I’d pull information as your describe, but far more exhaustively than the SW-ers so. I would resolve the entities *and the extracted topics and concepts* to URIs, controlled vocabularies, and taxonomies that the internal data systems also use, in oder to establish the linkage. What we’re describing is *semanticizing* information without the Semantic Web.

Leave a Reply