Yahoo! hosted the Hadoop Summit Tuesday of this week. I posted my rough notes on the conference over the course of the day – posting summarized some of what caught my interest and consolidates my notes.
Yahoo expected 100 attendees and ended up having to change venues to get closer to fitting the more than 400 who wanted to attend. For me the most striking thing is that Hadoop is now clearly in broad use and at scale. Dave Cutting did a quick survey at the start and rough ½ the crowd were running Hadoop in production and around 1/5 have over 100 node clusters. Yahoo remains the biggest with 2,000 nodes in their cluster.
Christian Kunz of Yahoo! gave a bit of a window into how Yahoo! is using Hadoop to process their Webmap data store. The Webmap is a structured storage representation of all Yahoo! crawled pages and all the metadata they extract or compute on those pages. There are over 100 Webmap applications used in managing the Yahoo! indexing engine. Christian talked about why they moved to Hadoop from the legacy system and summarized the magnitude of the workload they are running. These are almost certainly the largest Hadoop jobs in the world. The longest map/reduce jobs run for over three days and have 100k maps and 10k reduces. This job reads 300 TB and produces 200 TB.
Another informative talk was given by the Facebook team. They described Hive, the data warehouse at Facebook. Joydeep Sarma and Ashish Thusoo presented this work. I liked this talk as it was 100% customer driven. They implemented what the analyst and programmers inside Facebook needed and I found their observations credible and interesting. They reported that Analyst are used to SQL and found a SQL like language most productive but that programmers like to have direct access to map/reduce primitives. As a consequence, they provide both (so do we). The Facebook team reports they roughly 25% of the development team using Hive and process 3,500 map/reduce jobs a week.
Google is heavily invested in Hadoop using it as a teaching vehicle even though it’s not used internally. The Google interest in Haddop is to get graduating students more familiar with the map/reduce programming model. Several schools have agreed to teach the map/reduce programming using Hadoop. For example Berkeley, CMU, MIT, Stanford, UW, and UMD all plan courses
The agenda for the day:
Welcome & Logistics
Ajay Anand, Yahoo!
Doug Cutting / Eric Baldeschwieler, Yahoo!
Chris Olston, Yahoo!
Kevin Beyer, IBM
Michael Isard, Microsoft
Monitoring Hadoop using X-Trace
Andy Konwinski and Matei Zaharia, UC Berkeley
Ben Reed, Yahoo!
Michael Stack, Powerset
Hbase at Rapleaf
Bryan Duxbury, Rapleaf
Joydeep Sen Sarma / Ashish Thusoo, Facebook
GrepTheWeb - Hadoop an AWS
Jinesh Varia, Amazon.com
Building Ground Models of Southern California
Steve Schlosser, David O'Hallaron, Intel / CMU
Online search for engineering design content
Mike Haley, Autodesk
Yahoo - Webmap
Arnab Bhattacharjee, Yahoo!
Natural language Processing
Jimmy Lin, U of Maryland / Christophe Bisciglia, Google
Panel on future directions
Sameer Paranjpye, Sanjay Radia, Owen O.Malley (Yahoo), Chad Walters (Powerset), Jeff Eastman (Mahout)
My more detailed notes are at: HadoopSummit2008_NotesJamesRH.doc (81.5 KB). Peter Lee’s Hadoop Summit summary is at: http://www.csdhead.cs.cmu.edu/blog/
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh
Disclaimer: The opinions expressed here are my own and do not
necessarily represent those of current or past employers.