How many Vs are enough? Doug Laney used three – Volume, Velocity, and Variety – in defining Big Data back in the ’90s. In recent years, revisionists have blown out the count to a too-many seven or eight. “Embrace and extend” is alive and well, it seems, expanding the market space but also creating confusion.
When a concept resonates, as Big Data has, vendors, pundits, and gurus – the revisionists – spin it for their own ends. Big Data revisionists would elevate Value, Veracity, Variability/Variance, Viability, and Victory (a notion so obscure that I won’t mention it further) to canonical V status. Each of the various new Vs has its champions. Joining them are the contrarians who have given us the “small data” countertrend.
In my opinion, the wanna-V backers and the contrarians mistake interpretive, derived qualities for essential attributes.
Essence Vs. Interpretation
The original 3 Vs do a fine job capturing essential Big Data attributes, but they do have short-comings, specifically related to usefulness. As Forrester analyst Mike Gualtieri puts it, the original 3 Vs are not “actionable.” Gualtieri poses three pragmatic questions. The first relates to Big Data capture. The others relate to data processing and use: “Can you cleanse, enrich, and analyze the data?” and “Can you retrieve, search, integrate, and visualize the data?”
As for “small data”: The concept is a misframing of the data challenge. Small data is nothing more or less than a filtered and reduced topical subset of the Big Data mother-lode, again the product of analytics. Fortunately, attention to this bit of Big Data backlash seems to have ebbed, which lets us get back to the Big picture.
The 3 Vs and Beyond
The Big picture is that original 3 Vs work well. I won’t explain them; instead, I will refer you to Big Data 3 Vs: Volume, Variety, Velocity, an infographic posted by Gil Press. You’ll see that the infographic posits Viability – essentially, can the data be analyzed in a way that makes it decision-relevant? – as “the missing V.” The concluding line: “Many data scientists believe that perfecting as few as 5% of the relevant variables will get a business 95% of the same benefit. Try trick is identifying that viable 5%, and extracting the most value from it.” Hmm… it seems to me that “the missing V” could equally well have been Value.
Neil Biehn, writing in Wired, sees Viability and Value as distinct missing Vs numbers 4 and 5. Biehn’s take on Viability is similar to Press’s. According to Biehn, “we want to carefully select the attributes and factors that are most likely to predict outcomes that matter most to businesses.” I agree, but note that the selection process is purpose-driven and external to the data. Biehn continues, “the secret is uncovering the latent, hidden relationships among these variables.” Again I agree, and I ask, How do you determine predictive viability, generated by those latent relationships among variables? My answer isn’t mine alone. Prof. Gary King of Harvard University read my mind when he stated, at a conference I attended in June, “Big Data isn’t about the data. It’s about analytics.” Viability isn’t a Big Data property. It’s a quality that you determine via Big Data Analytics.
Biehn continues, “We define prescriptive, needle-moving actions and behaviors and start to tap into the fifth V from Big Data: Value.” Again, how do you determine prescriptive value, which Biehn notes is derived from, and hence not an intrinsic quality of, Big Data? Analytics. Analytics verifies not only the accuracy of predictions, but also the effectiveness of outcomes in achieving goals. Analytics ascertains the validity of the methods and the ROI impact of the overall data-centered initiative. ROI quantifies Value, complementing the qualitative measure Validity. Both Vs are external to the data itself.
Compounding the Confusion
Variability and Veracity are similarly analytics-derived qualities that relate more to data uses than to the data itself.
Variability is particularly confusing. Forrester analysts Brian Hopkins and Boris Evelson observed back in 2011 that “many options or variable interpretations confound analysis.” Sure, and you can use a stapler to bang in a nail (I have), but that doesn’t make it any less a stapler.
Hopkins and Evelson wrote, “for example, natural language search requires interpretation of complex and highly variable grammar.” Put aside that grammar doesn’t vary so much; rather, it’s usage that is highly variable, ranging from grammatical to non. Natural-language processing (NLP) techniques, as implemented in search and text-analytics systems, deal with variable usage by modelling language. NLP facilitates entity and information extraction, applied for particular business purposes.
(NLP, text analytics, and their business applications are specializations of mine. An entity is a uniquely identiﬁable thing or object, for instance the name of a person, place, product, or pattern such as an e-mail address or Social Security Number. Extractable information may include attributes of entities, relationships among entities, and constructs such as events – “Michelle LaVaughn Robinson Obama (born January 17, 1964), an American lawyer and writer, is the wife of the 44th and current President of the United States.” – that we recognize as facts.)
IBM sees Veracity as a fourth Big Data V. (Like me, they don’t buy others’ advocacy of Variability, Validity, or Value as Big Data essentials.) Regarding Veracity, IBM asks, “How can you act upon information if you don’t trust it?”
Yet facts, whether captured in natural language or in a structured database, are not always true. False or out-dated data may nonetheless be useful, non-factual subjective data (feelings and opinions) too. Consider two statements, one asserting a fact and the other containing one that is no longer true. The second additionally expresses sentiment. Join me in concluding that data may contain Value unlinked from Veracity:
- “The Iraqi regime… possesses and produces chemical and biological weapons.” – George W. Bush, October 7, 2002.
- “I am glad that George Bush is President.” – Daniel Pinchbeck, writing ironically, June, 2003.
Veracity does matter. I’ll cite an old Russian proverb, “Trust, but verify.” That is, analyze your data – evaluate it in context, taking into account provenance – in order to understand it and use it appropriately.
4 Vs for Big Data Analytics
My aim in writing this article has been to differentiate the essence of Big Data, as defined by Doug Laney’s original-and-still-valid 3 Vs, from derived qualities, from the 4 or 5 new Vs proposed by various vendors, pundits, and gurus. The hope is to maintain clarity and stave off market-confusing fragmentation begotten by the wanna-Vs.
On one side of the divide we have data capture and storage; on the other, business-goal oriented filtering, analysis, and presentation. Databases and data streaming technologies answer the Big Data need; for the balance, the smart stuff, you need Big Data Analytics.
Variability, Veracity, Validity, and Value aren’t intrinsic, definitional Big Data properties. They are not absolutes. By contrast, they reflect the uses you intend for your data. They relate to your particular business needs.
You discover context-dependent Variability, Veracity, Validity, and Value in your data via analyses that assess and reduce data and present insights in forms that facilitate business decision-making. This function, Big Data Analytics, is the key to understanding Big Data.