Smart Content Re-viewed: Text Analytics and Semantic Content Enrichment

A recent blog article of mine (thankfully) gave rise to a number of off-topic comments concerning the meaning of semantic content enrichment. As Marie Wallace of IBM remarked, it’s great to see the term semantic content enrichment generating discussion although she continued, “I suspect that most people still don’t differentiate it from just text analytics.”

There is a difference. Let’s explore it via the definitions that follow, first of text analytics, then content analytics, and finally content enrichment and where the ensemble takes us. First definition —

Text analytics is a set of software and transformational steps that discover business value in “unstructured” text. (Analytics in general is a process, not just algorithms and software.) The aim is to improve automated text processing, whether for search, classification, data and opinion extraction, business intelligence, or other purposes.

To expand on this definition a bit, to bridge from text to the wider content world:

Text analytics draws on data mining and visualization and also on natural-language processing (NLP). Supplement NLP with technologies that recognize patterns and extract information from images, audio, video, and composites and you have content analytics.

(I reused here definitions I gave Jen Roberts of Collective Intellect in an interview she blogged.)

The concept of content enrichment is easy to grasp: Every link in this article — Web links are accomplished via the HTML “a” anchor tag — is a bit of content enrichment. And semantic content enrichment? Marie Wallace puts it this way, focusing on text but with concepts that extend to the broad set of content types:

When I think about semantic enrichment, I see it as transforming a piece of content into a linked data source. In order to do this you do indeed need text analytics for entity and relationship extraction, but you need more than that…. A text analytics engine might recognize that [Marie Wallace] is a person, [Ireland] is a place, and Marie comes from Ireland and annotate the entities/relationships found. However when doing semantic enrichment, I would want to convert those annotations to openly addressable URIs that contribute to the linked data cloud.

URIs are uniform resource identifiers, Semantic Web terminology for IDs, unique within a namespace, that name or locate things. Web URLs (e.g., http://whitehouse.gov/) are a type of URI.

Rather than write my own annotation elaboration, I’ll reuse one from the Web site of Ontotext, a semantic-technology developer:

Annotation, or tagging, is about attaching names, attributes, comments, descriptions, etc. to a document or to a selected part in a text. It provides additional information (metadata) about an existing piece of data.

Semantic Annotation goes one level deeper:

  • It enriches the unstructured or semi-structured data with a context that is further linked to the structured knowledge of a domain.
  • It allows results that are not explicitly related to the original search.

The earliest specific semantic content enrichment reference I’ve encountered is in an Ontotext paper, Towards Semantic Web Information Extraction, presented at the 2003 International Semantic Web Conference (ISWC). The paper covers work based on Ontotext’s Knowledge and Information Management (KIM) platform, which in turn relies on GATE, the General Architecture for Text Engineering, an open-source text-analysis framework and toolkit, Apache Lucene, and other technologies. The Ontotext folks have other, related papers posted on the company Web site.

The Ontotext materials help explain the role text/content analytics can and should (but doesn’t often enough) play as a Semantic Web generator. The entities, concepts, events, and other features discerned, via content analytics, in text and rich media not only enable smart content; they can also be loaded to knowledge bases (which I won’t get into here, other than to say that systems such as IBM Watson and Wolfram Alpha use them) and Semantic Web triple stores.

There are other solution providers in the content analytics meets semantic annotation/enrichment game. In addition to IBM and Ontotext, they include HP Autonomy, MarkLogic, OpenText, Temis, and the nascent, open-source IKS project. Other vendors offer enterprise-strength building blocks, for instance, SAS via the various SAS Text Analytics components.

I’m sold on this stuff given the business benefits for content producers and content consumers alike. These technologies — and the interplay between analytics and semantics — are key in making sense of the digital universe.

7 thoughts on “Smart Content Re-viewed: Text Analytics and Semantic Content Enrichment

    1. Irene, yours is a very fair question. The short answer is, my list was of “other solution providers in the content analytics meets semantic annotation/enrichment game,” and I don’t know Xerox as a solution provider. I do know of Xerox (and of PARC, ex-Xerox) as a language-technology research organization. Checking out your Web site now, I do see a solution page, http://www.xrce.xerox.com/Technology-Showroom/Technologies/Cutting-Through-Information-Overload . Perhaps I hadn’t been aware of the listed tools because they are narrowly marketed or available only as part of a consulting engagement. It’s unclear. In any case, I’d welcome learning more. Perhaps we could set up a Xerox briefing? I’m at grimes(at)altaplana.com.

  1. please see my semantic annotation tool: http://code.google.com/p/autometa/, is a environment for semi-automatic (or automatic) annotation and meta-annotation of documents for publishing on the Web, using the RDFa W3C recommended annotation language. It also includes an RDFa extraction tool to provide the user with a view of the annotated triples.

    thanks,
    celso

Leave a Reply to Anonymous Cancel reply