The Word on Text Mining

[First published in Intelligent Enterprise, December 10, 2003]

As the minutia of everyday business and personal life migrates to the Internet, small wonder that text search is likely the Web’s most popular function. Who has time nowadays for Web surfing, for meandering through a network of linked pages until you come to something intriguing? We want the express train, a direct link to content. Yet text-search results sometimes seem like the generic wisdom you get randomly from a Magic 8 Ball: They’re so lacking in contextual relevance that they may answer many questions other than the one you’re asking. Text-search results lack the aptness that would follow from understanding the meaning of search terms — rather than just their presence or absence — and from the ability to assess the relevance of a search hit.

Text mining is poised to fill the void, structuring the information inherent in volumes of free text in ways that enable decidedly more intelligent search. There will still be a role for the thoughtful, manual classification and filtering that made Yahoo a winner from its earliest incarnation, and there will still be advantages to intentional, semantic-Web type efforts to categorize content for identification by automated agents. But just as data mining lets you discover hidden relationships in structured data and apply predictive algorithms, text mining will help identify value that you and the manual classifiers and Resource Description Framework wizards didn’t know existed.

Tired of search results presented as pages of hits? Text-mining software implements innovative display and navigation techniques that graphically represent networks of conceptually interrelated documents. Although plenty of pointless graphics, animation, and other whiz-bangs adorn the Web and office software, text-mining interfaces won’t be all glitz. They already harness hyperbolic (zooming) displays and other approaches that deliver results in a navigable, organized form that reflects the underlying structure of the result sets — approaches that add analytic value.

Text mining will let us move from knowledge management to knowledge analytics.

Wordspace

Everyone is familiar with the problem space: Languages and forms of communication are designed for human rather than machine consumption, but people’s daily lives are increasingly mediated by and reliant on information technology, creating a need for innovative modes of human-computer interaction. People and computers often meet halfway, communicating via simple, structured instruction sets tailored for particular processes like operating an automated teller machine. It isn’t feasible for people to go further by learning the variety of languages used to program more sophisticated transactions; instead we expect computers to understand our native languages.

This problem isn’t trivial because the meaning of words is highly dependent on context and may be obscured by slang, irregular grammar, fractured syntax, spelling errors and variations, and imprecision. Interpreting among languages is also difficult when you’re dealing with degrees of incomparability of syntax (composition), semantics (meaning), and alphabet. Humans can overcome these difficulties because we understand abstraction, context, and linguistic variations and can detect and apply patterns. We’re not so good on speed, volume, consistency, and breadth, by which I mean an individual’s ability to work in more than a handful of languages except in the most exceptional of cases.

The challenge — designing information technology that matches human language comprehension while bringing to bear the advantages of automation — defines the playing field for text mining.

Application Space

The most pressing applications for text mining are first in corporate knowledge analytics — making use of the vast stores of non-numeric information that organizations collect in the course of everyday operations — and second in responding to amorphous security threats. I’ve been skeptical of knowledge management for as long as the field has existed, seeing it as little more than providing a search interface on a document warehouse. In particular, government research programs like the Department of Defense mooted Terrorism Information Awareness (TIA), originally known as Total Information Awareness, proposed to analyze very large volumes of structured and unstructured data to detect patterns and forestall terrorist attacks. Congress has had issues with TIA and similar programs and has eliminated their funding. As a result, the programs will likely “go black,” classified secret, continuing out of the spotlight of public scrutiny, funded via special appropriations.

Many companies and government agencies already use text mining, albeit for very specialized applications. Because competitiveness and security concerns will only grow in the coming years and text mining extends well-understood search and data mining concepts, the scope and pervasiveness of text-mining applications are bound to grow rapidly.

Techniques and Vendors

Text mining is a two-stage process of categorization and classification. First, you figure out how to describe documents and their contents including the concepts they contain, and then you bin documents into the descriptive categories and map inter-document relationships according to the newly detected concepts. This approach is similar to segmentation and classification through data mining; I see data mining’s clusters as analogous to text-mining-generated concepts. Once you have classified according to categories, you can do something akin to OLAP-style slice-and-dice analysis of multidimensional data sets in order to tease interesting details — anomalous or exceptional information — out of the larger document sets.

Barak Pridor, president of text-mining vendor ClearForest describes text-mining steps as “semantic, statistical, and structural analysis that classifies documents and discovers buried persistent entities, event, facts, and relationships” in a process he calls “intelligent hybrid tagging.” Pridor distinguishes document-level tags (descriptive elements like subject and author) from “inner document tags” that work with families of entity types (that is, with conceptual groupings).

Text-mining offerings are by no means uniform. For example, many implementations such as those from Autonomy derive or import taxonomies (hierarchical knowledge representations that include concept definitions) for use in classifying and relating documents. Autonomy’s director of technology strategy, Ron Kolb, claims, “Autonomy is unique in being mathematically based, using pattern matching and statistical analysis across multiple languages and multiple platforms.” Autonomy uses Bayesian statistics, which assess relevance based on prior probabilities, and Claude Shannon’s information theory to facilitate extracting concepts from document sets. The result is to contain the effect of the vagaries of human languages.

Not everyone agrees that a statistically focused approach to categorization is best. Claude Vogel, CTO of Convera, told me, “You cannot build high-level taxonomies and ontologies that way. You can’t escape the manual librarian-style work.” (Roughly put, an ontology provides meaning for a knowledge domain, while a taxonomy organizes that knowledge.) That doesn’t mean that you need an army of taxonomy builders to work with Convera’s RetrievalWare because, as with Autonomy’s products, you can import XML-expressed taxonomies. Convera also shares with Autonomy the distinction of searching media such as audio, images, and video in addition to text.

Autonomy has focused on its mining engine, offering options such as weighting, supporting a large number of languages, and providing interfaces that integrate its products with BI, CRM, ERP, and other enterprise applications. Inxight Software, by contrast, is a notable vendor that, like ClearForest, has devoted significant resources to developing front ends. Inxight’s Star Tree, for example, lets you explore network maps via hyperbolic visualization where segment details are enlarged or collapsed as you move the focus from one map node to another. Inxight, like Autonomy, provides back-end categorization and taxonomy management software to other companies including ClearForest and SAS.

SAS dominates the high-end data analysis market. Its Text Miner incorporates Inxight technology for linguistic analysis and concept extraction but gives the results a statistical spin that can be matched by few other vendors. According to product manager Manya Mayes, Text Miner and the Enterprise Miner data-mining tools are fully integrated, where textual-analysis results become available as structured data for application of a full range of traditional analytic approaches.

Leave a Reply