On a daily basis
Internet publishers (e.g., Yahoo, MTV) and Internet applications such as
e-commerce sites (e.g., Amazon, eBay), social networks (e.g., Facebook,
MySpace, Tweeter), and ad networks (e.g., VideoEgg, Valueclick) generate very
large data sets with new types of data. For example, a site like MTV.com
may generate 90TB of raw data per year which, after being augmented with
demographic and geotagging data, can easily balloon to 700TB. A recent post on
the management of large data states that eBay has a 6.5PB data warehouse and
Facebook a 2.5PB data warehouse. Facebook is capturing 15TB of data
daily. This new, Internet-based data consists of various types of logs,
user generated content, etc. The size of these data sets dwarfs the
corporate data, e.g., sales transactions, collected and stored in more
“traditional” data warehouses used by the non-internet members of the Global
2000. These “traditional” data warehouses typically store 600GB-1TB of
data. Most mid-size companies, i.e., companies that do about $400M in
annual sales, operate even smaller data warehouses that rarely cross the
300-400GB level. The existing analytic tools and applications, e.g.,
Business Objects, or Cognos, were not developed with the intent of operating on
anything that resembles the king-size, Internet-based data sets.
During the last couple of years we’ve seen significant innovation in the area of the area of data management, first with the introduction of data warehouse appliances by companies such as Netezza, Datallegro, Greenplum (that were based on relational database technology) and more recently with the introduction of appliances such as Aster’s and Vertica’s that are using column-based databases. The latter two are starting to be adopted for the management of the Internet-based data. We have also seen the development of systems such as Hadoop that provides a framework which applications can use to work on very large data sets. These products are maturing quickly and their use is significantly reducing the cost of managing very large data sets.
While companies are making good progress on managing these very large data sets, their ability to effectively and efficiently analyze these sets is lagging. Companies like Google, eBay and Yahoo are using internally-developed frameworks (e.g., Google’s MapReduce) and home-grown routines to analyze the data they generate because, in most cases, the existing analysis products they throw at them can’t scale to operate on these sets, or don’t have the necessary functionality to address the questions that must be answered through these analyses. Sample questions that e-marketers (who represent only one of the constituencies that need to analyze this data) are trying to answer include:
- What should my keyword-bidding strategy be (which keywords, what price) for each of Google, Yahoo, and MSN? How should I allocate my budget between SEM and SEO?
- Which ad networks are giving me the best performance?
- Across which channels and at what percentages should I allocate my marketing budget to reduce lead acquisition costs and increase conversion rates?
Over the past few months I’ve been meeting with several startups that are developing new analytic applications to address such questions, and am particularly excited about the data analysis innovations they are working on. Because more money is shifting online and the importance of the decisions made using this new data is rapidly increasing, this area will attract strong investor interest and has the potential of producing several winners.