James Hamilton's Blog RSS 2.0
 Wednesday, July 16, 2008

I’ve been collecting scaling stories for some time now and last week I came across the following run down on Fliker scaling: Federation at Flickr: Doing Billions of Queries Per Day by Dathan Vance Pattishall, the Flickr database guy.

 

The Flickr DB Architecture is sharded with a PHP access layer to maintain consistency.  Flickr users are randomly assigned to a shard. Each shard is duplicated in another database that is also serving active shards. Each DB needs to be less than 50% loaded to be able to handle failover.

 

Shards are found via a lookup ring that maps userID or groupID to shardID and photoID to userID.  The DBs are protected by a memcached layer with a 30 minute caching lifetime. Slide 16 says they are maintaining consistency using distributed transactions but I strongly suspect they are actually just running two parallel transactions with application management rather than 2pc.

 

Maintenance is done by bringing down ½ the DBs and the remaining DBs will handle the load but it appears they have no redundancy (failure protection) during the maintenance periods.

 

They have 12TB of user data in aggregate and they appear to be using MySQL (slide 25 complains about an INNODB bug).

 

Other web site scaling stories:

·         Scaling Linkedin: http://perspectives.mvdirona.com/2008/06/08/ScalingLinkedIn.aspx

·         Scaling Amazon: http://glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

·         Scaling Second Life: http://radar.oreilly.com/archives/2006/04/web_20_and_databases_part_1_se.html

·         Scaling Technorati: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/

·         Scaling Flickr: http://radar.oreilly.com/archives/2006/04/database_war_stories_3_flickr.html

·         Scaling Craigslist: http://radar.oreilly.com/archives/2006/04/database_war_stories_5_craigsl.html

·         Scaling Findory: http://radar.oreilly.com/archives/2006/05/database_war_stories_8_findory_1.html

·         MySpace 2006: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1423&year=All&search=megasite&sortChoice=&stype=

·         MySpace 2007: http://sessions.visitmix.com/upperlayer.asp?event=&session=&id=1521&year=All&search=scale&sortChoice=&stype=

·         Twitter, Flickr, Live Journal, Six Apart, Bloglines, Last.fm, SlideShare, and eBay: http://poorbuthappy.com/ease/archives/2007/04/29/3616/the-top-10-presentation-on-scaling-websites-twitter-flickr-bloglines-vox-and-more

 

                        -jrh

 

Thanks to Kevin Merritt (Blist) and Dave Quick (Microsoft) for sending this my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, July 16, 2008 5:13:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, July 10, 2008

What follows is a guest posting from Phil Bernstein on the Google Megastore presentation by Jonas Karlsson, Philip Zeyliger at SIGMOD 2008:

 

Megastore is a transactional indexed record manager built by Google on top of BigTable. It is rumored to be the store behind Google AppEngine but this was not confirmed (or denied) at the talk. [JRH: I certainly recognize many similarities between the Google IO talk on the AppEngine store (see Under the Covers of the App Engine Datastore in Rough notes from Selected Sessions at Google IO Day 1) and Phil’s notes below].

 

·         A transaction is allowed to read and write data in an entity group.

·         The term “entity group” refers to a set of records, possibly in different BigTable instances. Therefore, different entities in an entity group might not be collocated on the same machine. The entities in an entity group share a common prefix of their primary key.  So in effect, an entity group is a hierarchically-linked set of entities.

·         A per-entity-group transaction log is used. One of the rows that stores the entity group is the entity group’s root. The log is stored with the root, which is replicated like all rows in Big Table.

·         To commit a transaction, its updates are stored in the log and replicated to the other copies of the log. Then they’re copied into the database copy of the entity group.

·         They commit to replicas before acking the caller and use Paxos to deal with replica failures. So it’s an ACID transaction.

·         Optimistic concurrency control is used. Few details were provided, but I assume it’s the same as what they describe for Google Apps.

·         Schemas are supported.

·         They offer vertical partitioning to cluster columns that are frequently accessed together.

·         They don’t support joins except across hierarchical paths within entity groups. I.e., if you want to do arbitrary joins, then you write an application program and there’s no consistency guarantee between the data in different entity groups.

·         Big Table does not support indexes. It simply sorts the rows by primary key. Megastore supports indexes on top. They were vague about the details. It sounds like the index is a binary table with a column that contains the compound key as a slash-separated string and a column containing the primary key of the entity group.

·         Referential integrity between the components of an entity group is not supported.

·         Many-to-many relationships are not supported, though they said they can store the inverse of a functional relationship.  It sounded like a materialized view that is incrementally updated asynchronously.

·         It has been in use by apps for about a year.

 

The follow are the notes that I typed while listening to the talk. For the most part, it’s just what was written on the slides and is incomplete. I don’t think it adds much to my summary above.

 

TITLE: Megastore – Scalable Data System for User-facing Apps

SPEAKERS: Jonas Karlsson, Philip Zeyliger (Google)

 

User-facing apps have TP-system-like requirements

Ø  data updated by a few users

Ø  largely reads, small updates

Ø  lots of work scaling the data layer

Ø  users expect consistency

 

Megastore – scale, RAD, consistency, performance

 

Scale

Ø  start with Big Table for storage

Ø  add db technologies that scale: indices, schemas

Ø  offer transparent replication and failover between data centers

Ø  support many, frequent reads

Ø  writes may be more costly, because they’re less frequent

Ø  with a correct schema, the app should scale naturally

 

Rapid App Development

Ø  hide operational aspects of scalability from app code

Ø  Flexible declarative schemas (MDL) – looks like SQL

o   indices, rich data types, consistency and partitioning

Ø  offer transactions

 

Entity Group consistency is supported

Ø  it’s a logical partition – e.g., blogs, posts, comments, which are all keyed by the owner of the blog

Ø  all entities in the group have the same high-order key component

 

Transactions

Ø  roll forward transaction log per entity group, no rollbacks

o   pb: It’s unclear to me whether they pool the log across all entity groups in a partition. If not, then they don’t benefit from group commit.

Ø  A transaction over an entity group is ACID , but not across entity groups

Ø  optimistic concurrency control

Ø  updates are available only after commit

Ø  api: newTransaction, read/write, commit (pb: couldn’t type fast enough for the details)

Ø  non-blocking, consistent reads (pb: does a transaction see its previous writes?)

Ø  cross-entity group operations have looser consistency

 

Performance

Ø  schemas declare their physical data locality

Ø  optimized to minimize seeks, RPCs, bandwidth, and storage

Ø  several ways of declaring physical locality

o   entity groups

o   shared primary key prefixes (collocating tables in Big Table)

o   locality groups – i.e., attribute partitioning

Ø  simply cost-transparent API primitives imply

o   only add scalable features

o   cost of writes is linear to data/indices

o   avoid scalability surprises

 

Avoiding joins

Ø  hierarchical primary keys

Ø  repeated fields (I guess just like Big Table)

Ø  store hierarchical data  in a column (it’s unclear to me whether this is the whole entity group or only part of it)

Ø  Syntax looks like SQL: Create Table (with primary key), Create Index (on particular columns), ..

 

Replication HA

Ø  uses Paxos-based algorithm, per entity group

Ø  it was more complicated than they expected

Ø  writes need a majority of replicas to be up in order to commit

Ø  most reads are local, consistency ensured

Ø  replication is consistent and synchronous

Ø  automatic query failover: individual table servers may fail

 

Usage

Ø  has been used in production for a year

Ø  used by several internal tools and projects

Ø  several large and user-visible apps (they wouldn’t say which ones, except we know them)

Ø  used for rapidly implementing new features in older projects

Ø  many other projects are migrating to it

 

Technical lessons

Ø  declarative schemas are good

Ø  cost-transparent APIs good (SQL is not cost-transparent)

Ø  avoid joins with hierarchical data and indices (if you want a join that isn’t on a hierarchical path, then write a program, e.g., Sawzall

Ø  avoid scalability surprises

 

Social

Ø  schema reviews helpful

Ø  consistency is necessary

Ø  need a mindset of scalability and performance

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Thursday, July 10, 2008 1:47:45 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, June 29, 2008

Title: Needle in a Haystack: Efficient Storage of Billions of Photos

Speaker: Jason Sobel, Manager of the Facebook, Infrastructure Group)

Slides: http://beta.flowgram.com/f/p.html#2qi3k8eicrfgkv

 

An excellent talk that I really enjoyed.  I used to lead a much smaller service that also used a lot of NetApp storage and I recognized many of the problems Jason mentioned.  Throughout the introductory part of the talk I found myself thinking they need to move to a cheap, directly attached blob store. And that’s essentially the topic of remainder of the talk.  Jason presented Haystack, the Facebook solution to the problem of a filesystem not working terribly well for their high volume blob storage needs.

 

The same thing happened when he talked through the Facebook usage of Content Delivery Networks (CDNs).  The CDN stores the data once in a geo-distributed cache, Facebook stores it again in their distributed cache (Memcached) and then again the database tier.  Later in the talk Jason, made exactly this observation and observed the new design will allow them to use the CDNs less and as they get a broader geo-diverse data center footprint, they may move to being their own CDN. I 100% agree.

 

My rough notes below with some of what I found most interesting.

 

Overall Facebook facts:

·         #6 site on the internet

·         500 total employees

o   200 in engineering

o   25 in Infrastructure Engineering

·         One of the largest MySQL installations in the world

·         Big user and contributor to Memcached

·         More than 10k servers in production

·         6,000 logical databases in production

 

Photo Storage and Management at Facebook:

·         Photo facts:

o   6.5B photos in total

§  4 to 5 sizes of each picture is materialized (30B files)

§  475k images/second

·         Mostly served via CDN (Akamai & Limelight)

·         200k profile photos/second

§  100m uploads/week

o   Stored on netapp filers

·         First level caching via CDN (Akamai & Limelight)

o   99.8% hit rate for profiles

o   92% hit rate for remainder

·         Second level caching for profile pictures only via Cachr (non-profile goes directly against file handle cache)

o   Based upon a modified version of evhttp using memcached as a “backing” store

o   Since cachr is independent from memcachd, cachr failure doesn’t lose state

o   1 TB of cache over 40 servers

o   Delivers microsecond response

o   Redundancy so no loss of cache contents on server failure

·         Photo Servers

o   Non-profile requests go directly against the photo-servers

o   Only profile requests that miss the cachr cache.

·         File Handle Cache (FHC)

o   Based upon lighttpd and uses memcached as backing store

o   Reduces metadata workload on NetApp servers

o   Issue: filename to inode lookup is a serious scaling issue: 1) drives many I/Os or 2) wastes too much memory with a very large metadata caceh

§  They have extended the Linux kernel to allow NFS file opens via inode number rather than filename to avoid the NetApp scaling issue.

§  The inode numbers are stored in the FHC

§  This technique offloads the NetApp servers dramatically.

§  Note that files are write only.  Mods write a new file and delete the old ones so the handles will fail and a new metadata lookup will be driven.

·         Issues with this architecture:

o   Netapp storage overwhelmed by metadata (3 disk I/Os to read a single photo).

§  The original design required 15 I/Os for a single picture (due to deeper directory hierarchy I’m guessing)

§  Tracking last access time, last modified etc. has no value to Facebook.  They really only need a blob store but they are using a filesystem at additional expense

o   Heavy reliance on CDNs and caches such that netapp is basically almost pure backup

§  92% of non-profile and 99.8% of profile pictures are stored in CDN

§  Many of the rest are almost all stored in caching layers

·         Solution: Haystacks

o   Haystacks are a user level abstraction where lots of data is stored in a single file

o   Store an independent index vastly more efficient than the file store

o   1M of metadata/1G of data

§  Order of magnitude better on average than standard NetApp metadata

o   1 disk seek for all reads with any workload

o   Most likely store in XFS

o   Expect each haystack to be about 10G (with an index)

o   Speaker equates a Haystack to be a lot like a LUN and could be implemented on a LUN.  The actual implementation is via NFS onto NetApp as photos were previously stored

o   Net of what’s happening:

§  Haystack always hits on the metadata

o   Plan to replace NetApp

§  Haystack is a win over NetApp but we’ll likely run over XFS (originally done by Silicon Grapics)

§  Want more control of the cache behavior

o   Each Haystack Format:

§  Version number,

§  Magic number,

§  Length,

§  Data,

§  Checksum

o   Index format

§  Version,

§  Photo key,

§  Photo size,

§  Start,

§  Length.

o   Not planning to delete photos at all since delete rate is VERY low so it the resource that would be recovered are not worth the work to recover them in the Facebook usage.  Deletion just removes the entry from the index which makes the data unavailable but they don’t bother to actually remove it from the Haystack bulk storage system.

o   Q:Why not store the index in a RDBMS?  Feels that it’ll drive too many I/Os and have the problems they are trying to avoid (I’m not completely convinced but do understand that simplicity and being in control has value).

·         They still plan to use the CDN but they are hoping to reduce their dependence on CDN. They are considering becoming their own CDN (Facebook is absolutely large enough to be able to do this cost effectively today).

·         They are considering using to SSDs in the future.

·         Not interested in hosting with Google or Amazon. Compute is already close to the data and they are working to get both closer to users but don’t see a need/use for GAE or AWS at the Facebook scale.

·         The Facebook default is to use databases.  Photos are the largest exception but most data is stored in DBs. Few actions use transactions and joins though.

·         Almost all data is cached twice: once in memcached and then again in the DBs.

·         Random bits:

o   Canada: 1 out of 3 Canadians use Facebook.

o   Q:What is the strategy in China?  A:“not to do what Google did” :-)

o   Looking at de-duping and other commonality exploiting systems for client to server communications and storage (great idea although not clearly a big win for photos).

o   90% Indians access internet via a mobile device.  Facebook very focused on mobile and international.

 

Overall, an excellent talk by Jason.

 

--jrh

 

Sent my way by Hitesh Kanwathirtha  of the Windows Live Experience team, Mitch Wyle of Engineering Excellence, and Dave Quick and Alex Mallet both in Windows Live Cloud Storage group. It was originally Slashdotted at: http://developers.slashdot.org/article.pl?no_d2=1&sid=08/06/25/148203. The presentation is posted at: http://beta.flowgram.com/f/p.html#2qi3k8eicrfgkv.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Sunday, June 29, 2008 8:04:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Friday, June 27, 2008

Alex Mallet and Viraj Mody of the Windows Live Mesh team took great notes at the Structure ’08 (Put Cloud Computing to Work) conference (appended below).

 

Some pre-reading information was made available to all attendees as well: Refresh the Net: Why the Internet needs a Makeover?

 

Overall

-          Interesting mix of attendees from companies in all areas of “cloud computing”

-          The quality of the presentations and panels was somewhat uneven

-          Talks were not very technical

-          Amazon is the clear leader in mindshare; MS isn’t even on the board

-          Lots of speculation about how software-as-a-service, platform-as-a-service, everything-as-a-service is going to play out: who the users will be, how to make money, whether there will be cloud computing standards etc

 

5 min Nick Carr video [author of “The Big Switch”]

-          Drew symbolic link between BillG retiring and Structure ’08, the first “cloud computing” conference, being the same week, marking the shift of computing from the desktop to the datacenter

-          Generic pontificating on the coming “age of cloud computing”

 

“The Platform Revolution: a look into disruptive technologies”, Jonathan Yarmis, research analyst

-          Enterprises always lag behind consumers in adoption of new technology, and IT is powerless to stop users from adopting new technology

-          4 big tech trends: social networks, mobility, cloud computing, alternative business models [eg ad-supported]

-          Tech trends mutually reinforcing: mobility leads to more social networking applications, being able to access data/apps in the cloud leads to more mobility

-          Mobile is platform for next-gen and emerging markets: 1.4 billion devices per year, 20% device growth per year, average device lifetime 21 months; opens up market to new users and new uses

-          Claim: “single converged device will never exist”; cloud computing enables independence of device and location

-          Stream computing: rate of data creation is growing at 50-500% per year, and it’s becoming increasingly important to be able to quickly process the data, determine what’s interesting and discard the rest

-          “Economic value of peer relationships hasn’t been realized yet” – Facebook Beacon was a good idea, but poorly realized

 

“Virtualization and Cloud Computing”, with Mendel Rosenblum, VMWare founder 

-          Virtualization can/should be used to decouple software from hardware even in the datacenter

-          Virtualization is cloud-computing enabler: can decide whether to run your own machines, or use somebody else’s, without having to rewrite everything

-          Coming “age of multicore” makes virtualization even more important/useful

-          Smart software that figures out how to distribute VMs over physical hardware isn’t a commodity yet

-          VMWare is working on merging the various virtualization layers: machine, storage, networks [eg VLANs]

-          HW support for virtualization is mostly being done for server-class machines [?]

-          Rosenblum doesn’t think moving workloads from the datacenter to edge machines, to take advantage of spare cycles, will ever take off – it’s just too much of a pain to try to harness those spare cycles

<