Earlier in the year I wrote about areas of investment interest for 2010 and included Big Data aggregation, management, and data mining/processing. A couple of weeks ago I had attended Gigaom’s Structure 2010, a conference devoted to big data. As is also noted by Derrick Harris here there is a growing market of vendors that are working around open source tools to support big data initiatives. I wanted to summarize my thoughts from the conference but travel got in the way. When EMC’s acquisition of Greenplum was announced big data came to fore again since Greenplum had created a good implementation of MapReduce into its platform (even though I think EMC has been thinking along more traditional data warehousing lines with this acquisition).
- Only 10-20 companies have a good grasp of big data issues and are innovating in this area. Though many companies in the Fortune 1000 are starting to experiment with Hadoop, today only 10-20% of enterprises need big data solutions. This number could grow as high as 40-50% in 5 years.
- NoSQL databases are emerging as the preferred systems for storing and managing big data sets. The data in these sets is at the terabyte or petabyte scale, it is semi-structured, highly distributed, and much of it is of unknown value so it must be processed quickly to identify the interesting parts to keep. NoSQL databases provide efficient and effective storage, management and processing of such data sets at low cost. However, it is unlikely that these databases will evolve into general purpose platforms, like relational databases did. It will therefore be important to match the big data problem being solved to the right database. Data analysis and business intelligence are emerging as the best applications for taking full advantage of NoSQL databases. Almost every company that presented at the conference discussed such applications.
- Too many NoSQL database companies have already been created (Cloudera, 10gen, MongoDB, VoltDB, CouchDB, etc). While the user interest in such databases is increasing (many Fortune 1000 companies have started Hadoop evaluation projects), the market won’t be able to sustain them. I expect to see significant consolidation in the next 3-5 years.
- Today there is no “LAMP stack” equivalent for big data processing and analytics and I think that there is an opportunity and need to create one. As Jim Kobielus at Forrester wrote in his blog, there is not going to be a standalone market for Hadoop. The pioneers of the big data movement, e.g., internet portals, social networks, ad networks, etc., are creating ad hoc “stacks” using open source tools for data management, data aggregation, in-memory storage. In most cases they extend/modify these tools heavily. In addition to base Hadoop, tools such as Hive, Cassandra, Scribe, memcached, MapReduce, Google App Engine are part of such stacks and, consequently can be the components of a pre-integrated, pre-certified Big Data Platform. Using these tools still requires a programmer’s talents (PIG, and QL are efforts to provide a higher abstraction programming layer to Hadoop and Hive respectively so that database administrators can interact with the databases). The Big Data Platform must be usable by business analysts, as well. Hadoop client-side applications such as those being developed by Karmasphere and Datameer provide steps in the right direction but they remain standalone.
- For no reason apparent to me, NoSQL database companies are trying to reinvent the data warehousing and business intelligence infrastructures that have been created over the years. Some of the “reinventions” may be absolutely necessary and will lead to important innovations. However, these companies appear to be also ignoring important aspects of the data management and data analysis technology that has been developed over the years around data warehouses built using relational technology.


Hey,
I disagree that there is no LAMP stack for big data processing and analytics. If you divide Hadoop-related projects into collection, storage, processing, and access, you can start to see a stack emerging. Every large Hadoop installation I've seen has a solution in each section of this stack; our product strategy at Cloudera essentially derives from this fact.
You can see evidence of a "stack" emerging if you look at presentations from Facebook, LinkedIn, Twitter, and Yahoo. At Adobe, they call this stack the "Hstack" (see http://hstack.org).
The components of the "Hstack" we've chosen to include in CDH:
* Flume and Sqoop for collection (from log files and RDBMS, respectively)
* HDFS for storage
* MapReduce, Hive, Pig, and Oozie for processing (bit of heterogeneity here)
* HBase and HUE for access
Avro and ZooKeeper are common platform services used by all of these systems (now or in the future).
Of course, I'm fairly biased, having founded Cloudera. I encourage you, though, to take a look at the larger Hadoop installations for evidence of a common stack emerging, even if the different components have different names at each company.
Regards,
Jeff
Posted by: Jeff Hammerbacher | 07/12/2010 at 05:37 PM
We have a different take on the stack. CouchDB is a database with an HTTP interfacing, opening up the possibility of a vastly simpler 2-layer application architecture. I've written a technical blog post describing our app platform, and why developers are so excited to jettison the complexity they've been dealing with since the dawn of the web.
http://wiki.couchapp.org/page/what-is-couchapp
Posted by: J Chris Anderson | 08/04/2010 at 07:21 AM
Evangele,
Thanks for the nice post. I wanted to comment on a couple of points:
a) My experience is that nearly 100% of F1000 need big data solutions today (i.e. they either have something or they are thinking about it). By 10%-20% you may refer to demand for Hadoop specifically (?) however Hadoop is not the only tool for Big Data - I definitely consider Aster Data as an enterprise Big Data solution and our list of cusotmers includes many enterprise F1000.
More generally, Big Data for me is (a) big data size [TBs-PBs] + deep analytics [go beyond plain SQL, e.g. MapReduce, In-Database SAS etc]
b)The 'NoSQL' term is very misleading as many of the NoSQL systems are racing to add SQL support and be more "enterprise-friendly". It's better to talk in terms of relational vs non-relational, SQL vs non-SQL, MPP vs SMP etc. In my opinion the future is in hybrid systems that support different interfaces and data types. Aster itself offers SQL & MapReduce tightly integrated in a single system and we're advancing towards that direction.
Thanks,
Tasso Argyros
CTO, Aster Data
Posted by: Targyros | 08/04/2010 at 04:16 PM
Hi,
I think that one important piece in the big data puzzle is search, and search should also be able to handle the same (if not more) amount of data. Also, search already provides nice "analytical" aspects like facets, and advance query capabilities that go beyond the simple text based search. Take a look at elasticsearch (http://www.elasticsearch.com) which tries to solve this problem, and my thoughts on how it works within the nosql world (http://www.elasticsearch.com/blog/2010/02/25/nosql_yessearch.html)
-shay.banon
Posted by: Kimchy | 08/05/2010 at 12:58 AM
I agree with the comment above by Chris Anderson and disagree with your statement "Data analysis and business intelligence are emerging as the best applications for taking full advantage of NoSQL databases." This is simply not true. JSON-based document stores are NOT best suited to Big Data, but do make a lot of sense in some web-based scenarios, where the O/R impedance mismatch has always been problematic.
Posted by: Hugorodgerbrown | 08/05/2010 at 01:08 AM
My comment on the suitability of NoSQL databases for data analysis and BI was based on reports I have read on such projects using these type of databases. Of course, the people working on such projects may ultimately find out that indeed these databases are not as applicable to exploratory data analysis and BI as they initially thought they were.
Posted by: Evangelos | 08/05/2010 at 12:01 PM
@Shay. I agree that search is and will continue to be an important tool for operating on big data. However, I will note that to date it has not been as effective (or as accepted), for analytics, even though the incumbent BI vendors, e.g., Business Objects/SAP, have offered search functionality and tried repeatedly to get users to employ it. Maybe we need analysts who come to analytics with a fresh eye. Much of my recent excitement about the sector comes from seeing a new generation of analysts using new approaches to tackle the analysis of, primarily, semi-structured, internet-based data.
@Tasso. I agree with your first point that 100% of the F1000 have to warehouse, manage, and analyze large data volumes. I also view Aster as a big data solution. However, having worked on DW and BI solutions to F1000 companies for the larger part of my 20-year operating career, with a few exceptions (and that's where my 20% statement comes from), I don't think they need to work yet with the data volumes that internet businesses generate today. In fact, most of the case studies being reported by companies like IBM, Teradata, Netezza about their work with F1000 companies tend to be about data warehouses of 0.8-6TB in size. This is still big data, compared to what we were dealing with 5-10 years ago but still not what companies like Yahoo, eBay, Facebook, Myspace, etc have to deal with.
I don't know what to say about your second point. Obviously SQL is a very expressive and efficient query language, and over the years corporations have made significant investments around SQL for their BI efforts. I'm sure they don't want to throw away these investments even when they decide to adopt a NoSQL database. I'm not close enough to the topic to know whether for the new types of data being collected and being analyzed, SQL is the best language to use for data analysis queries. I hope other, more knowledgeable, readers of this blog will be able to weigh in on this issue.
Posted by: Evangelos | 08/05/2010 at 12:39 PM