Thursday, March 27, 2008

Yahoo! hosted the Hadoop Summit Tuesday of this week.  I posted my rough notes on the conference over the course of the day – posting summarized some of what caught my interest and consolidates my notes.

 

Yahoo expected 100 attendees and ended up having to change venues to get closer to fitting the more than 400 who wanted to attend.  For me the most striking thing is that Hadoop is now clearly in broad use and at scale. Dave Cutting did a quick survey at the start and rough ½ the crowd were running Hadoop in production and around 1/5 have over 100 node clusters. Yahoo remains the biggest with 2,000 nodes in their cluster.

 

Christian Kunz of Yahoo! gave a bit of a window into how Yahoo! is using Hadoop to process their Webmap data store. The Webmap is a structured storage representation of all Yahoo! crawled pages and all the metadata they extract or compute on those pages.  There are over 100 Webmap applications used in managing the Yahoo! indexing engine. Christian talked about why they moved to Hadoop from the legacy system and summarized the magnitude of the workload they are running. These are almost certainly the largest Hadoop jobs in the world. The longest map/reduce jobs run for over three days and have 100k maps and 10k reduces. This job reads 300 TB and produces 200 TB.

 

Another informative talk was given by the Facebook team. They described Hive, the data warehouse at Facebook.  Joydeep Sarma and Ashish Thusoo presented this work. I liked this talk as it was 100% customer driven. They implemented what the analyst and programmers inside Facebook needed and I found their observations credible and interesting.  They reported that Analyst are used to SQL and found a SQL like language most productive but that programmers like to have direct access to map/reduce primitives.  As a consequence, they provide both (so do we).  The Facebook team reports they roughly 25% of the development team using Hive and process 3,500 map/reduce jobs a week.

 

Google is heavily invested in Hadoop using it as a teaching vehicle even though it’s not used internally.  The Google interest in Haddop is to get graduating students more familiar with the map/reduce programming model. Several schools have agreed to teach the map/reduce programming using Hadoop. For example Berkeley, CMU, MIT, Stanford, UW, and UMD all plan courses

 

The agenda for the day:

Time

Topic

Speaker(s)

8:00-8:55

Breakfast/Registration

8:55-9:00

Welcome & Logistics

Ajay Anand, Yahoo!

9:00-9:30

Hadoop Overview

Doug Cutting / Eric Baldeschwieler, Yahoo!

9:30-10:00

Pig

Chris Olston, Yahoo!

10:00-10:30

JAQL

Kevin Beyer, IBM

10:30-10:45

Break

10:45-11:15

DryadLINQ

Michael Isard, Microsoft

11:15-11:45

Monitoring Hadoop using X-Trace

Andy Konwinski and Matei Zaharia, UC Berkeley

11:45-12:15

Zookeeper

Ben Reed, Yahoo!

12:15-1:15

Lunch

1:15-1:45

Hbase

Michael Stack, Powerset

1:45-2:15

Hbase at Rapleaf

Bryan Duxbury, Rapleaf

2:15-2:45

Hive

Joydeep Sen Sarma / Ashish Thusoo, Facebook

2:45-3:05

GrepTheWeb - Hadoop an AWS

Jinesh Varia, Amazon.com

3:05-3:20

Break

3:20-3:24

Building Ground Models of Southern California

Steve Schlosser, David O'Hallaron, Intel / CMU

3:40-4:00

Online search for engineering design content

Mike Haley, Autodesk

4:00-4:20

Yahoo - Webmap

Arnab Bhattacharjee, Yahoo!

4:20-4:45

Natural language Processing

Jimmy Lin, U of Maryland / Christophe Bisciglia, Google

4:45-5:30

Panel on future directions

Sameer Paranjpye, Sanjay Radia, Owen O.Malley (Yahoo), Chad Walters (Powerset), Jeff Eastman (Mahout)

My more detailed notes are at: HadoopSummit2008_NotesJamesRH.doc (81.5 KB). Peter Lee’s Hadoop Summit summary is at: http://www.csdhead.cs.cmu.edu/blog/

 

James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh

Thursday, March 27, 2008 11:53:46 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
Comments are closed.

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<March 2008>
SunMonTueWedThuFriSat
2425262728291
2345678
9101112131415
16171819202122
23242526272829
303112345

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton