Unstructured Data and the 80 Percent Rule

It’s a truism that 80 percent of business-relevant information originates in unstructured form, primarily text. The figure is very widely cited by analysts, vendors (including Clarabridge President Justin Langseth), and users alike, all seeking to make the case for text analytics. There are variations; Anant Jhingran of IBM Research, among others, cites an 85% figure. Whether 80 or 85 percent, the claim has clearly taken on a life of its own. It has been repeated many thousands of times. But for all of us who cite these figures: Where did they come from? More to the point, are they true, and are they useful? Let’s explore these questions.

It does seem obvious that a very high proportion of data is unstructured: How much of your workday is spent reading or writing e-mails, reports, or articles and the like, in conversations, or listening to live or recorded audio? And in making the case for tapping unstructured sources, a very important asset in fields ranging from customer experience management to counter-terrorism, it’s helpful to be able to quantify the proportion, to put a number on it.

The earliest really solid treatment of the topic I can find is offered in a 1998 Merrill-Lynch report on Enterprise Information Portals. Authors Christopher C. Shilakes and Julie Tylman saw portals as an “emerging concept” that would “broaden market opportunities for the Content Management, Business Intelligence, and Database vendors.” The content-management opportunity derived from the assessment that “unstructured data comprises the vast majority of data found in an organization. Some estimates run as high as 80%.”

The Merrill-Lynch authors were not reporting on primary research, findings from an actual study, nor were the Gartner, Butler Group, Yankee Group, Outsell, and other analyst sources I tracked down. The most helpful lead I got was from IDC analyst Sue Feldman, who said “That widely quoted 80% comes from an IBM study that was done quite a while ago, perhaps 15 years or more.”

The History Behind the Fact

While work in computational linguistics dates back decades – computer scientists working on artificial intelligence have targeted natural language since AI’s earliest days – text mining was only starting to emerge as an academic discipline 15 years ago, in the early ’90s. It’s only in the last 5-10 years that commercial-grade text technologies have emerged in the market, realizing the potential hinted at by Joseph Weizenbaum’s pioneering 1964 Eliza system, which applied basic pattern matching and linguistic rules to cleverly and successfully mimic a psychotherapist in a dialogue with a patient.

THE event for emotion-driven consumer, media & financial insight: Visit SentimentSymposium.com.

Advances in the intervening years notably included the emergence of document-management systems such as in the early ‘90s, more recently repositioned as enterprise content management (ECM) systems. These systems, and also the object-relational database systems that appeared in the mid-‘90s, used a repository or DBMS to store and index unstructured data. ECM is more popular than ever before, boosted by compliance best practices and mandates such as new, 2006 e-discovery rules. The “outside the firewall” Web is, however, a primary unstructured information source for just about every enterprise knowledge worker. ECM can’t manage the Web so it will never on its own support truly comprehensive, content-sourced business intelligence (BI).

The earliest 80%-ish reference I found dates to the pre-text analytics 1990s (and didn’t cite 80% at all). A September 1991 Software magazine article says, “Andy Rehn, vice president of marketing for Data Base Architects, said that as much as 90 percent of the information business uses is non-numerical, freeform data.” In those years, Data Base Architects (as DBA Software) sold document-management software. Similarly, a February 1996 Oracle press release quotes Brett Newbold, then vice president of Oracle’s ConText Group, part of an Oracle object-relational initiative, as saying “ninety percent of digitally stored data is unstructured information, mostly text.”

Those ‘90s systems were a start in refocusing BI on text, per IBM researcher Hans Peter Luhn’s original, 1958 definition of business intelligence systems as deriving knowledge from document sources. Contrary to Luhn’s conception, BI grew up on structured data. As Prabhakar Raghavan of Yahoo Research has noted, “the bulk of information value is perceived as coming from data in relational tables. The reason is that data that is structured is easy to mine and analyze.”

Only recently has the value equation changed, courtesy of text mining technology. So while I have no idea where Rehn’s and Newbold’s ‘90s figures came from, nor can I find original research leading to the more recent 80%-unstructured number, I do know that the volume of both unstructured information – e-mail, corporate documents, news and blog articles, Web pages, etc. – and data captured in structured form in relational databases – financial transactions, call records, sensor and RFID data streams, and so on – continues to grow dramatically. (I don’t, however, buy IBM’s July 2006 assertion, “just four years from now, the world’s information base will be doubling in size every 11 hours,” which, I note, was offered without source or supporting data.)

Now it’s time to introduce dissenting numbers. Philip Russom of the Data Warehousing Institute conducted a late 2006 unstructured-data study. Responses reported in BI Search and Text Analytics: New Additions to the BI Technology Stack put “structured data in first place at 47%, trailed by unstructured (31%) and semi-structured data (22%).” Philip does explain that even though semi-structured and unstructured data sum to 53%, “far short of the 80-85% mark claimed by other research organizations,… the discrepancy is probably due to the fact that TDWI surveyed data management professionals who deal mostly with structured data and rarely with unstructured data.”

Unstructured Data Does Matter

I also wonder about the information density of unstructured versus structured sources. That is, there’s an awful lot of linguistic chaff – all the narrative involved in human communications – that is of little use for business purposes. That’s why we apply information extraction (IE) techniques to mine unstructured sources for just what is useful and can be analyzed, whether to monitor customer satisfaction or identify potentially fraudulent financial transactions.

The 80%/85%/90% unstructured figures come from, well, everywhere. “X percent unstructured” is so obvious to all of us working in text analytics that we’ve never dug up the research or even ascertained that it actually exists. This bit of common wisdom falls in the category, “if it didn’t exist, we’d have to invent it,” and perhaps we did. It is qualitatively true in the sense that it reflects our experience.

Are the figures useful? Indisputably. They make concrete – they focus and solidify – the realization that unstructured data matters. The figures reinforce two thoughts. First, it’s now possible to derive information from unstructured sources. Second, doing so, exploiting unstructured sources, primarily by applying text analytics, is an essential component of any comprehensive business-intelligence program, a key element in ensuring enterprise competitiveness.

[Reblogged May 15, 2013. This article was first published as my Clarabridge Bridgepoints newsletter Q3 2008 column, “Experts Corner: Seth Grimes.” If you have working links to replace the broken ones, please let me know.]