Now and then, I find myself thinking about the big principles of Big Data, that is, not about Hadoop vs. relational databases or Mahout vs. Weka but rather about fundamental wisdom that frames our vision of “the new currency.” But maybe “the new oil” better describes data, or perhaps we need a new metaphor to explain data’s value.
Metaphors aren’t factual or provable, but they do illuminate “truths” about topics of interest. They make complex concepts understandable, and so do (I hope!) a set of quotations I’ve collected that relate to and explain basic Big Data principles. I’ll offer 8 truths about Big Data — you’re surely already bought into at least a few — ordered roughly chronologically, and then I’ll take a look ahead, at a “future truth”:
1 “Correlation is not causation.” We hear this over and over (or at least I do). I learned one version of the underlying fallacy, when I was in college (studying philosophy), as post hoc ergo propter hoc, “after the thing, therefore because of the thing.” You can read a smart take in the O’Reilly Radar blog. In “The Vanishing Cost of Guessing,” Alistair Croll observes, “overwhelming correlation is what big data does best… Parallel computing, advances in algorithms, and the inexorable crawl of Moore’s Law have dramatically reduced how much it costs to analyze a data set,” creating a “data-driven society [that] is both smarter and dumber.” Therefore be smart and respect the difference between correlation and causation. Patterns are not conclusions.
2 “Essentially, all models are wrong, but some are useful,” wrote accidental statistician George E.P. Box, in “Empirical model-building and response surfaces,” in 1987. Box developed his thoughts on modeling, which very much apply to Big Data, over the length of his career. See in particular “Science and Statistics,” published in the Journal of the American Statistical Association in December 1976.
3 Big Data knows (almost) all. If you don’t already, it’s time to accept Scott McNealy’s 1999 statement, “You have zero privacy anyway… Get over it.” McNealy was cofounder and CEO of Sun Microsystems, quoted in Wired magazine. Big Data’s growing invasiveness is simply detail: Analysts’ ability to infer sex and sexual orientation (for instance) from social postings and pregnancy from buying patterns; the on-going expansion of vast, commercialized consumer-information stores held by Acxiom and the like; the rise of Palantir and Riot-ous information synthesis; the NSA Prism vacuum cleaner.
4 I covered the truism that “80 percent of business-relevant information originates in unstructured form, primarily text” (but also video, images, and audio) in a 2008 article, although as I wrote then, the 80% factoid dates at least to the early 1990s. This bit of pseudo-data is a “factoid” because it is far too broadly drawn to be precise. So far as I know, it’s not derived from any form of systematic measurement ever performed. Still, per Box, “80% unstructured” is a useful notion, even if not correct. Whatever number works for you, text and content analytics belong in your toolkit.
5 “It’s not information overload. It’s filter failure,” explained Clay Shirky at the September 2008 Web 2.0 Expo in New York. Truisms such as “More data does not imply better insights” (I made that one up) are corollaries of Shirky’s filter observation. But don’t overdo it; avoid what Eli Pariser terms “the filter bubble,” an inability to see beyond what automation makes immediate.
6 “The same meaning can be expressed in many different ways, and the same expression can express many different meanings.” So say Googlers Alon Halevy, Peter Norvig, and Fernando Pereira in their touchstone March 2009 IEEE Intelligent Systems article, “The Unreasonable Effectiveness of Data.” How is data’s unreasonable effective revealed? Via semantic interpretation of “imprecise and ambiguous” natural languages, by tackling the scientific problem of interpreting massive, aggregated content by inferring relationships via machine learning.
7 When Harvard Prof. Gary King says “Big Data is not about the data! The value in Big Data [is in] the analytics,” in effect he’s spinning out the Googlers’ (#6) thoughts. Yet I can’t completely agree with King. There is value in the business process of determining data needs and devising a smart approach to collecting and structuring the data for analysis. Analytics helps you discover that value, so my preferred formulation would be, “the value of Big Data is in analytics.” My thinking isn’t original. See, for instance, “Big Data, Analytics, and the Path from Insight to Value” by Steve LaValle, Eric Lesser, Rebecca Shockley, Michael S. Hopkins, and Nina Kruschwitz in the MIT Sloan Management Review, winter 2011 (December 2010).
8 Phil Simon says “intuition is as important as ever.” Phil wrote Too Big to Ignore: The Business Case for Big Data,” published earlier this year. (I contributed material on text analytics and sentiment analysis.) Simon explains, “Big Data has, at least not yet, replaced intuition; the latter merely complements the former. The relationship between the two is a continuum, not a binary.” Tim Leberecht explores this same point in a June, 2013 article, “Why Big Data will never beat business intuition.”
These eight points lead to a future truth, an appraisal that isn’t yet widely understood that will be, I believe:
9 The future of Big Data is synthesis and sensemaking. The missing element from most solutions is ability to integrate information across sources, in situationally appropriate ways, to generate contextually relevant, usable insights. I’ll pull some defining quotations from an illuminating paper by design strategist Jon Kolko (admittedly applying them out of context). First, Kolko cites cognitive psychologists who have been studying the connections between problem solving and intuition, who “reference sensemaking as a way of understanding connections between people, places and events that are occurring now or have occurred in the past, in order to anticipate future trajectories and act accordingly.” (Kolko’s source is “Making Sense of Sensemaking 1: Alternative Perspectives” (2006) by Gary Klein, Brian Moon, and Robert R. Hoffman in IEEE Intelligent Systems. See also their “Making Sense of Sensemaking 2: A Macrocognitive Model.”)
Kolko sees [design] synthesis as a key element, a “sensemaking process of manipulating, organizing, pruning and filtering data in an effort to produce information and knowledge.” What capabilities are afforded? IBM Fellow Jeff Jonas says “general purpose sensemaking systems will colocate diverse data in the same data space. Such an approach enables massively scalable, real-time, novel discovery over an ever changing observational space.”
Isn’t that our Big Data goal, to advance from pattern detection to actionable conclusions? I hope my 9 truths have helped you understand the path.