InformationWeek recently published my article, Semantic Web Business: Going Nowhere Slowly. I wrote, “The SemWeb dream centers on sharing linked data via the W3C’s Resource Description Framework protocol,” which elicited a tweet, “#RDF (Resource Description Framework) isn’t a protocol. As its name implies: it’s a Framework.”
Uh, yeah, I could’ve written “via the W3C’s Resource Description Framework framework.” That would’ve been like writing “ATM machine” or (double fault) “sandwich with au jus sauce,” a redundant repetition. (ATM stands for Automated Teller Machine, and the French au jus means “with [its own] juice.”) I’m not into redundancy, so I say lay off.
But is my characterization of RDF as a “protocol” wrong? Obviously I don’t think so. To me, a protocol is a specification or mechanism meant to guide collaborative work. (I made that definition up on the spot.) But to others, what is RDF anyway? I did some quick research using the world’s favorite research tool, Google, and learned that RDF is…
- An official recommendation. (Wikipedia disambiguation page: “Resource Description Framework, an official W3C Recommendation for Semantic Web data models.)
- A family of specifications. (Wikipedia: “The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications.”)
- A (single) specification. (Mozilla.org: “RDF is a W3C specification, and is their recommended technology for (meta)data interchange on the Web.”)
- A technology. (See above, and Kingsley Idehen: “RDF is a very misunderstood and poorly promoted technology (long story!)”
- A standard data model. (W3C: “RDF is a standard model for data interchange on the Web.”)
- An encoding standard. (rdf:about: “This is an introduction to RDF (‘Resource Description Framework’), which is the standard for encoding metadata and other knowledge on the Semantic Web.”)
- [Just] a standard. (Arto Bendiken: “RDF is a W3C standard for modeling and sharing distributed knowledge based on a decentralized open-world assumption.”)
- A method for expressing knowledge. (rdf:about: “RDF is a method for expressing knowledge in a decentralized world and is the foundation of the Semantic Web.”)
- A structure. (O’Reilly: “The Resource Description Framework (RDF) is a structure for describing and interchanging metadata on the Web.”)
- An infrastructure. (Eric Miller: “The Resource Description Framework (RDF) is an infrastructure that enables the encoding, exchange and reuse of structured metadata.”)
Sadly for me, Google turned up only three exact matches on “RDF is a protocol.” I concede. My description of RDF was… unusual. Perhaps I should have written, “the Resource Description Framework is a framework.” I promise that next time, when I mention RDF again, subsequently, down the line, in the future, I’ll be more careful of precedent. Meanwhile, I urge the rest of you, other than my tweeting correspondent, not to get hung up on pointless corrections.
Data fusion is hard. The task is to link data of diverse types, from disparate sources, in support of unified search, query, and analysis. Picture Total Information Awareness, designed not for counterterrorism (although military/intelligence remains a leading usage) but instead for marketing, competitive intelligence, customer experience, online commerce, capital markets, and research initiatives. These business applications are limited in scope but nonetheless, integration ability is becoming a critical contributor to competitiveness.
We’re talking ‘systems thinking’ and the #3 Big Data ‘V’, Variety. Technical approaches range from ‘ecosystems’ (which you’ll hear much about in 2013) to The effort is worthwhile: The notion is that a whole is worth more than the sum of its parts. So we seek to handle numbers, text, location, time, image, audio, video, and machine data — from enterprise, online, social, sensor, and server sources — not in isolation, each via a siloed application, but rather as a linked ensemble. Text, transactions, and geolocation are the key components. And the solutions?
Interfaces and Engines
BI, text analytics, semantic computing, and Semantic Web technologies, with overlapping and complementary methods, each help us respond to the variety-fusion need. Let’s explore the technologies and methods, starting with a statement that I admire all solution providers that have risen to the challenge.
Providers typically offer search or unified-information dashboards and BI interfaces. The best of them are rely on innovative visualisation that helps users make sense of complex, dynamic information. Products are applied most frequently in military/intelligence, life sciences and pharma, financial services, and other complex-data domains. The hard stuff, however, happens behind the scenes. The back-end work — identifying sources, compiling and harmonizing data, creating suitable query methods and tools — may involve momentous effort. IBM Watson, designed to play a game but with much grander targets (starting with health-care) in sight, reportedly involved 80-100 person-years effort, on top of the untold non-dedicated additional company resources.
General approaches apply semantic-computing techniques that map data items into common-ground namespaces and controlled vocabularies or master data. They discover linkages that bridge those diverse types and sources. The sum is sense-making, beyond conventional numbers-focused BI and also beyond text analytics, which I have studied for a decade now and define as business intelligence on and from text. Sense-making, further, is topical and situational, intended to provide not a rigid, one-size-fits-all analytical answer but instead one that responds to the needs of the searcher-analyst-user.
Business Intelligence — the practice and the tools — has evolved to bring information from ‘unstructured’ sources (primarily text) into analysis environments.
Text analytics is a key Big Data technology, but scalability is a hurdle, so is getting at all the information to be found in sources, accounting for the context in which it appeared and for the user’s needs. Further, no single source or type of data gives a complete picture.
First — All text analytics is semantic, whether linguistic, machine learning, or statistical techniques are involved. “Semantics” is usually defined as meaning. A statistical count of words or terms and their distribution can help you detect the key topics of a document or corpus. Similarly, statistical methods (cooccurrence, similarity measures, clustering) classify content in various ways, another form of ascribing meaning to text. Lexicons, rule sets, and taxonomies, which are typically associated with linguistic NLP and with various types of machine learning, rely on and also generate meaning. This said — Any business (or research or government) challenge that involves text whose volume, flow, or nature (e.g., use of inaccessible technical or human language) — anything you can make sufficient sense of by simply reading it — is an appropriate candidate for application of text analytics.
There are dozens of text analytics solution providers, from companies that range from enterprise giants IBM, HP, and SAP to a constant stream of start-ups via application-focused players such as Attensity, Clarabridge, Daedalus, Expert System, Kana, Lexalytics, Linguamatics, OpenText, and SRA; as-a-service providers AlchemyAPI, OpenAmplify, and Pingar; open-source toolkits including GATE, OpenNLP, and Python NLTK; social-intelligence providers Converseon, NetBase, and Sysomos; and search companies such as Attivio, Exalead, and Sinequa. They all do one form or another of (semantic) text analytics. Cambridge Semantics does tout Anzo’s unified information access capabilities. In doing so, the company positions Anzo to compete with Attivio, Coveo, Oracle Endeca (which relies on text-analytics technology from Lexalytics), and Sinequa, all of which have UIA plays.
The Semantic Web
A writer for a Big Data-analytics platform recently interviewed me for an article profiling a semantic technology player, one that tackles Semantic Web technologies to Semantic Web technologies are rarely used for text analytics. The core technologies — RDF, SPARQL, OWL, URIs — just don’t play much of a role. Rather, the situation is that both the analytics and the Semantic Web call on certain information-extraction and structuring technologies, for different purposes. The Semantic Web may use natural-language processing (NLP), focusing on entity and metadata extraction, to populate information stores. Text analytics uses NLP, much more broadly, to extract entities and metadata (title, author, publication date, etc.) and also topics, facts, events, relationships, sentiment, and opinions, as part of business intelligence, data mining, and automated text processing initiatives.
The Semantic Web, however, is wholly incapable of addressing these hurdles. The Semantic Web just wasn’t designed for comprehensive analytics. Considering analytics and semantics and enterprise (and personal) information needs: Sense-making is the future of Big Data.
Do Semantic Web technologies help in BI and text analytics initiatives, as a means of integrating structured and unstructured data?
Definitely, integration of database-sourced data from transactional and operational systems, and text-sourced data from social, online, and enterprise text, requires semantic capabilities: uniform identifiers, shared vocabularies/master data, shared classifications and taxonomy, and sometimes ability to do fuzzy matching. The data, whether database- or text-sourced, may be held in an RDF store or a relational or NoSQL format, and it may be SPARQL or SQL queryable or free-text searchable. The key is semantic technologies, not adherence to limited-scope Semantic Web technologies nor SQL database orthodoxy nor any other data-management dogma. True semantic computing bridges all these islands.
What are the implementation and management issues with these semantic technologies?
It’s hard to get machines to understand human communications (in a way that’s relevant to the user, given his or her particular task), and even harder to join data that orginate in disparate sources and to detect signals that aren’t even apparent until you fuse those sources. So the biggest challenge is to find technology that fits your business problem and to apply it right. Management wise: staffing, and design of a system that gets the best out of machines with necessary human guidance, is hard.
Cambridge Semantics argues that the Semantic Web technologies were designed to be web scale. You say that the core technologies can’t address the hurdles facing text analytics, specifically scalability, getting at all of the information in sources, accounting for context, etc. Can you say a little more about why the core Semantic Web technologies fall short of that task?
Definitely, Semantic Web technologies were designed to Web scale. That doesn’t mean they can do everything, however. You’re not going to use an RDF store and SPARQL to manage & query sensor data or server log files, or to run an online store, or for a BI OLAP engine. RDF and SPARQL are good for “wide” data but not so good for aggregation and slice-and-dice (dimensional) analysis of large, homogeneous datasets. So there are many modern-day computing applications that Semantic Web technologies aren’t well suited for. Getting at all the information in sources: Right now, Semantic Web-ers focus on entity extraction — on finding names of people, companies, places, chemical compounds, etc. — with little regard for features such as sentiment (mood, tone, emotion, opinion, intent) or even for the factual content of text sources. Information extraction, for Semantic Web-ers, means parsing the names out of “Barack Obama is president of the United States” which ignoring the relationship text that connects those names. Instead, SW-ers will resolve Barack Obama and the United States to URIs (uniform resource identifiers) and be able to use the relationship only if someone has captured it in an ontology.
If you were to build a competitive intelligence application that pulls information from Web news feeds and matches it to internal data, what technologies would you use?
To build the competitive intelligence app you describe, I’d pull information as your describe, but far more exhaustively than the SW-ers so. I would resolve the entities *and the extracted topics and concepts* to URIs, controlled vocabularies, and taxonomies that the internal data systems also use, in oder to establish the linkage. What we’re describing is *semanticizing* information without the Semantic Web.
A recent blog article of mine (thankfully) gave rise to a number of off-topic comments concerning the meaning of semantic content enrichment. As Marie Wallace of IBM remarked, it’s great to see the term semantic content enrichment generating discussion although she continued, “I suspect that most people still don’t differentiate it from just text analytics.”
There is a difference. Let’s explore it via the definitions that follow, first of text analytics, then content analytics, and finally content enrichment and where the ensemble takes us. First definition –
Text analytics is a set of software and transformational steps that discover business value in “unstructured” text. (Analytics in general is a process, not just algorithms and software.) The aim is to improve automated text processing, whether for search, classification, data and opinion extraction, business intelligence, or other purposes.
To expand on this definition a bit, to bridge from text to the wider content world:
Text analytics draws on data mining and visualization and also on natural-language processing (NLP). Supplement NLP with technologies that recognize patterns and extract information from images, audio, video, and composites and you have content analytics.
The concept of content enrichment is easy to grasp: Every link in this article — Web links are accomplished via the HTML “a” anchor tag — is a bit of content enrichment. And semantic content enrichment? Marie Wallace puts it this way, focusing on text but with concepts that extend to the broad set of content types:
When I think about semantic enrichment, I see it as transforming a piece of content into a linked data source. In order to do this you do indeed need text analytics for entity and relationship extraction, but you need more than that…. A text analytics engine might recognize that [Marie Wallace] is a person, [Ireland] is a place, and Marie comes from Ireland and annotate the entities/relationships found. However when doing semantic enrichment, I would want to convert those annotations to openly addressable URIs that contribute to the linked data cloud.
URIs are uniform resource identifiers, Semantic Web terminology for IDs, unique within a namespace, that name or locate things. Web URLs (e.g., http://whitehouse.gov/) are a type of URI.
Rather than write my own annotation elaboration, I’ll reuse one from the Web site of Ontotext, a semantic-technology developer:
Annotation, or tagging, is about attaching names, attributes, comments, descriptions, etc. to a document or to a selected part in a text. It provides additional information (metadata) about an existing piece of data.
Semantic Annotation goes one level deeper:
- It enriches the unstructured or semi-structured data with a context that is further linked to the structured knowledge of a domain.
- It allows results that are not explicitly related to the original search.
The earliest specific semantic content enrichment reference I’ve encountered is in an Ontotext paper, Towards Semantic Web Information Extraction, presented at the 2003 International Semantic Web Conference (ISWC). The paper covers work based on Ontotext’s Knowledge and Information Management (KIM) platform, which in turn relies on GATE, the General Architecture for Text Engineering, an open-source text-analysis framework and toolkit, Apache Lucene, and other technologies. The Ontotext folks have other, related papers posted on the company Web site.
The Ontotext materials help explain the role text/content analytics can and should (but doesn’t often enough) play as a Semantic Web generator. The entities, concepts, events, and other features discerned, via content analytics, in text and rich media not only enable smart content; they can also be loaded to knowledge bases (which I won’t get into here, other than to say that systems such as IBM Watson and Wolfram Alpha use them) and Semantic Web triple stores.
There are other solution providers in the content analytics meets semantic annotation/enrichment game. In addition to IBM and Ontotext, they include HP Autonomy, MarkLogic, OpenText, Temis, and the nascent, open-source IKS project. Other vendors offer enterprise-strength building blocks, for instance, SAS via the various SAS Text Analytics components.
I’m sold on this stuff given the business benefits for content producers and content consumers alike. These technologies — and the interplay between analytics and semantics — are key in making sense of the digital universe.