Earlier in the year I wrote about areas of investment interest for 2010 and included Big Data aggregation, management, and data mining/processing. A couple of weeks ago I had attended Gigaom’s Structure 2010, a conference devoted to big data. As is also noted by Derrick Harris here there is a growing market of vendors that are working around open source tools to support big data initiatives. I wanted to summarize my thoughts from the conference but travel got in the way. When EMC’s acquisition of Greenplum was announced big data came to fore again since Greenplum had created a good implementation of MapReduce into its platform (even though I think EMC has been thinking along more traditional data warehousing lines with this acquisition).
- Only 10-20 companies have a good grasp of big data issues and are innovating in this area. Though many companies in the Fortune 1000 are starting to experiment with Hadoop, today only 10-20% of enterprises need big data solutions. This number could grow as high as 40-50% in 5 years.
- NoSQL databases are emerging as the preferred systems for storing and managing big data sets. The data in these sets is at the terabyte or petabyte scale, it is semi-structured, highly distributed, and much of it is of unknown value so it must be processed quickly to identify the interesting parts to keep. NoSQL databases provide efficient and effective storage, management and processing of such data sets at low cost. However, it is unlikely that these databases will evolve into general purpose platforms, like relational databases did. It will therefore be important to match the big data problem being solved to the right database. Data analysis and business intelligence are emerging as the best applications for taking full advantage of NoSQL databases. Almost every company that presented at the conference discussed such applications.
- Too many NoSQL database companies have already been created (Cloudera, 10gen, MongoDB, VoltDB, CouchDB, etc). While the user interest in such databases is increasing (many Fortune 1000 companies have started Hadoop evaluation projects), the market won’t be able to sustain them. I expect to see significant consolidation in the next 3-5 years.
- Today there is no “LAMP stack” equivalent for big data processing and analytics and I think that there is an opportunity and need to create one. As Jim Kobielus at Forrester wrote in his blog, there is not going to be a standalone market for Hadoop. The pioneers of the big data movement, e.g., internet portals, social networks, ad networks, etc., are creating ad hoc “stacks” using open source tools for data management, data aggregation, in-memory storage. In most cases they extend/modify these tools heavily. In addition to base Hadoop, tools such as Hive, Cassandra, Scribe, memcached, MapReduce, Google App Engine are part of such stacks and, consequently can be the components of a pre-integrated, pre-certified Big Data Platform. Using these tools still requires a programmer’s talents (PIG, and QL are efforts to provide a higher abstraction programming layer to Hadoop and Hive respectively so that database administrators can interact with the databases). The Big Data Platform must be usable by business analysts, as well. Hadoop client-side applications such as those being developed by Karmasphere and Datameer provide steps in the right direction but they remain standalone.
- For no reason apparent to me, NoSQL database companies are trying to reinvent the data warehousing and business intelligence infrastructures that have been created over the years. Some of the “reinventions” may be absolutely necessary and will lead to important innovations. However, these companies appear to be also ignoring important aspects of the data management and data analysis technology that has been developed over the years around data warehouses built using relational technology.