(TechWeb’s Intelligent Enterprise published my “Is It Time For NoETL?” article on Wednesday, March 24, 2010. IE was subsequently rolled into InformationWeek and TechWeb abandoned much of its old content, including my article.)
I’ve been bemused by NoSQL, the movement that propounds database-management diversity with the very valid claim that a one-size-fits-all relational approach is a poor match for emerging, demanding data challenges. Didn’t we all know that relational databases, based on tables and joins, aren’t always best? Hadn’t the issue been the lack of usable, reliable, enterprise worthy alternatives? Similarly, haven’t we long understood that wiring-up extract-transform-load (ETL) is laborious — all those adapters and rules and the need for hand-matching — even if necessary given the perceived need to gather, cleanse, and integrate diverse BI data sources? Is that preparatory data work still essential? Or is it now time for a NoETL movement, reflecting a new world of liberated, semantically enriched, analysis-ready, mashable data?
The “SQL” of NoSQL is Structured Query Language, which has been closely associated with relational databases since the ’70s, since the RDBMS early days. With Oracle’s and IBM’s support, SQL vanquished superior alternatives such as Ingres’s Quel. SQL is an easy target for criticism, on its own and standing in as a proxy for relational systems.
SQL is stateless, given which limitation, vendors have wrapped it in diverse, incompatible procedural languages to support multi-step data processes. SQL’s set-oriented approach creates a data-handling burden for application programmers so we have cursors, a row-/record-oriented retrieval kludge. Correlated subqueries are a usability nightmare, and the check-list demand for ACID compliance — transactional atomicity, consistency, isolation, durability — is simply overhead overkill for analytical applications.
NoSQL is a catch-all term for a grab bag of relational alternatives. NoSQL is a New Testament that seeks to supplant the Codd of Old.
SQL’s deficiencies have been known for years; nonetheless, SQL has served the database community well and supported the creation of immense business value for the many, many millions of RDBMS end users. So have the ETL technologies that feed relational (and other) databases from flat-file, spreadsheet, operational-system, and database sources — technologies, plural. Is ETL still relevant in a world of semantic computing?
Semantic computing relies on meaning-ful data. That data may be stored in RDBMS tables with an associated metadata repository. It may be modeled with a graph structure, described via RDF (the XML based Resource Description Framework), and captured in a “triple store” for query via SPARQL. It may be mapped into an ontology, a mechanism for knowledge representation. (“Knowledge” here is a network of relationships, a.k.a. facts, that link entities within a subject-matter domain.)
Semantic computing involves methods and software designed to mine meaning, relationships, and usages from sources both conventional and unconventional, from structured databases and from the chaos that is the Web. All that good stuff is inferred from whatever definitions, data profiles (i.e., information on the distributions of the values of variables), and context are available.
The payoff is that you have all the ingredients necessary to support dynamic integration, to enable as-you-like-it data mashability.
Dynamic integration: NoETL
A number of tools claim/aim to support dynamic integration, some metadata or semantics driven, so that are essentially visually programmed without reliance, for the end user or behind the scenes, on the ETL equivalent of SQL. They include companies such as Expressor, Progress Software, and JackBe, the latter an enterprise mashup vendor.
I’ll credit JackBe with prompting me to think much more intently about this stuff than I would have otherwise. I wrote a short paper for them, Nimble Intelligence: Enterprise BI Mashup Best Practices, and presented on the same topic in a JackBe webinar yesterday. (I was paid for this work and for strategy consulting.) The thought is that mashups bring agility to BI, the possibility of integrating the data and application elements you need, when needed, without much or most of the overhead typically associated with conventional BI.
NoETL is an extension of this concept, actually a sort-of retake on Enterprise Information Integration (EII), a once-promising but now neglected notion that one can successfully build and query a unified virtual schema, spanning data sources, without requiring data collection into a single data warehouse or repository. In considering NoETL, let’s recognize the value of traditional ETL and of EII and use them where they fit best. Let’s also understand the promise and power of semantics, and of the diversity of NoSQL-ite data representations, in seeking data integration approaches that enable truly agile BI.