« Lessons from the Past Ten Years (Part 2) | Main | Lessons from the Past Ten Years (Part 3) »

07/08/2010

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00e55017ec4b88340134854e7d44970c

Listed below are links to weblogs that reference Thoughts on Big Data:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Hey,

I disagree that there is no LAMP stack for big data processing and analytics. If you divide Hadoop-related projects into collection, storage, processing, and access, you can start to see a stack emerging. Every large Hadoop installation I've seen has a solution in each section of this stack; our product strategy at Cloudera essentially derives from this fact.

You can see evidence of a "stack" emerging if you look at presentations from Facebook, LinkedIn, Twitter, and Yahoo. At Adobe, they call this stack the "Hstack" (see http://hstack.org).

The components of the "Hstack" we've chosen to include in CDH:
* Flume and Sqoop for collection (from log files and RDBMS, respectively)
* HDFS for storage
* MapReduce, Hive, Pig, and Oozie for processing (bit of heterogeneity here)
* HBase and HUE for access

Avro and ZooKeeper are common platform services used by all of these systems (now or in the future).

Of course, I'm fairly biased, having founded Cloudera. I encourage you, though, to take a look at the larger Hadoop installations for evidence of a common stack emerging, even if the different components have different names at each company.

Regards,
Jeff

We have a different take on the stack. CouchDB is a database with an HTTP interfacing, opening up the possibility of a vastly simpler 2-layer application architecture. I've written a technical blog post describing our app platform, and why developers are so excited to jettison the complexity they've been dealing with since the dawn of the web.

http://wiki.couchapp.org/page/what-is-couchapp

Evangele,

Thanks for the nice post. I wanted to comment on a couple of points:

a) My experience is that nearly 100% of F1000 need big data solutions today (i.e. they either have something or they are thinking about it). By 10%-20% you may refer to demand for Hadoop specifically (?) however Hadoop is not the only tool for Big Data - I definitely consider Aster Data as an enterprise Big Data solution and our list of cusotmers includes many enterprise F1000.

More generally, Big Data for me is (a) big data size [TBs-PBs] + deep analytics [go beyond plain SQL, e.g. MapReduce, In-Database SAS etc]

b)The 'NoSQL' term is very misleading as many of the NoSQL systems are racing to add SQL support and be more "enterprise-friendly". It's better to talk in terms of relational vs non-relational, SQL vs non-SQL, MPP vs SMP etc. In my opinion the future is in hybrid systems that support different interfaces and data types. Aster itself offers SQL & MapReduce tightly integrated in a single system and we're advancing towards that direction.

Thanks,
Tasso Argyros
CTO, Aster Data

Hi,

I think that one important piece in the big data puzzle is search, and search should also be able to handle the same (if not more) amount of data. Also, search already provides nice "analytical" aspects like facets, and advance query capabilities that go beyond the simple text based search. Take a look at elasticsearch (http://www.elasticsearch.com) which tries to solve this problem, and my thoughts on how it works within the nosql world (http://www.elasticsearch.com/blog/2010/02/25/nosql_yessearch.html)

-shay.banon

I agree with the comment above by Chris Anderson and disagree with your statement "Data analysis and business intelligence are emerging as the best applications for taking full advantage of NoSQL databases." This is simply not true. JSON-based document stores are NOT best suited to Big Data, but do make a lot of sense in some web-based scenarios, where the O/R impedance mismatch has always been problematic.

My comment on the suitability of NoSQL databases for data analysis and BI was based on reports I have read on such projects using these type of databases. Of course, the people working on such projects may ultimately find out that indeed these databases are not as applicable to exploratory data analysis and BI as they initially thought they were.

@Shay. I agree that search is and will continue to be an important tool for operating on big data. However, I will note that to date it has not been as effective (or as accepted), for analytics, even though the incumbent BI vendors, e.g., Business Objects/SAP, have offered search functionality and tried repeatedly to get users to employ it. Maybe we need analysts who come to analytics with a fresh eye. Much of my recent excitement about the sector comes from seeing a new generation of analysts using new approaches to tackle the analysis of, primarily, semi-structured, internet-based data.

@Tasso. I agree with your first point that 100% of the F1000 have to warehouse, manage, and analyze large data volumes. I also view Aster as a big data solution. However, having worked on DW and BI solutions to F1000 companies for the larger part of my 20-year operating career, with a few exceptions (and that's where my 20% statement comes from), I don't think they need to work yet with the data volumes that internet businesses generate today. In fact, most of the case studies being reported by companies like IBM, Teradata, Netezza about their work with F1000 companies tend to be about data warehouses of 0.8-6TB in size. This is still big data, compared to what we were dealing with 5-10 years ago but still not what companies like Yahoo, eBay, Facebook, Myspace, etc have to deal with.

I don't know what to say about your second point. Obviously SQL is a very expressive and efficient query language, and over the years corporations have made significant investments around SQL for their BI efforts. I'm sure they don't want to throw away these investments even when they decide to adopt a NoSQL database. I'm not close enough to the topic to know whether for the new types of data being collected and being analyzed, SQL is the best language to use for data analysis queries. I hope other, more knowledgeable, readers of this blog will be able to weigh in on this issue.

The comments to this entry are closed.

Creative Commons Attribution-ShareAlike 3.0 Unported

Become a Fan

Share |

Subscribe via email:

or
 Subscribe in a reader

My Photo
Evangelos Simoudis is a Senior Managing Director at Trident Capital where he focuses on investments in SaaS, Internet and Data.

Twitter Updates

    follow me on Twitter