Wednesday, June 03, 2009

Don MacAskill did one of his usual excellent talks at MySQL Conf 09 this. My rough notes follow.

 

Speaker: Don MacAskill

Video at: http://mysqlconf.blip.tv/file/2037101

·         SmugMug:

o   Bootstrapped in ’02 and still operating without external funding

o   Profitable and without debt

o   Top 400 website

o   Doubling yearly

·         SmugMug Challenge:

o   Users get unlimited storage & bandwidth            

o   Photos up to 48Mpix (more than 500m)

o   Video up to 1920x180p

·         300+ four core hosts (mostly diskless)

o   Mostly AMD but really excited by Intel Nehalem [JRH: so am I]

·         5 datacenters (3 in Silicon Valley, 1 in Seattle, and 1 in Virginia) [JRH: corrected from 4 to 5 -- thanks Modesto Alexandre]

·         Only 2 ops guys

·         Lots of AWS use (Simple Storage Service, Elastic Compute Cloud, etc.)

·         Service deployment model: servers automatically load their config from a central role database. On reboot, the configured role is loaded.  Role change is a DB update followed by a reboot. [JRH: very nice]

·         Binary data all stored in Amazon S3 (PB of data at this point)

·         Akamai for content distribution network

·         Structured data

o   MySQL (InnoDB mostly)

o   Scaled up and out using cheap multi-core CPUs with lots of memory

o   4+ cores, 64GB memory, >2TB storage

·         Heavy use of MemcacheD (over 1TB of memory)

o   Over 96% hit rate and fall back to MySQL for cold data access

o   Been using it since first released 4 to 5 years back

·         Compute:

o   Amazon EC2 for photo and video processing and encoding

o   Depend upon EC2 for scaling up to high traffic times and, more importantly, being able to scale down to low traffic times such as the middle of the night (SmugMug is predominantly a North American service at this point). During scale down periods 10’s of cores and during scale up periods 100s if not 1000s of cores)

§  Totally autonomous scaling up and down using SkyNet (written by SmugMug)

·         Web Servers:

o   Diskless with PXE boot

·         MySQL:

o   Most important technology in use at SmugMug

o   Super dependent on replication for performance, reliability, and high availability

o   No data loss in over 7 years

o   No joins or other 4.x+ features

§  Like the Drizzle project (http://en.wikipedia.org/wiki/Drizzle_(database_server)) since its re-focuses MySQL on the core they actually use – lean and mean.

o   Vertically partitioned. They have looked at sharding several times but have always managed to find a way to avoid it so far

·         InnoDB

o   Running 1.0.3+ patches (Percona XtraDB) in production (great for concurrency bound issues)

§  Great relationship with Percona (“Crazy concentration of talent under 1 roof”) who does MySQL support

·         MySQL Details:

o   Data integrity is number 1 issue

o   Next most important is write latency since scaling reads is relatively easy.

o   Replication kept at less than 1sec behind

o   Big RAM (64GB+) to keep indexes in memory

o   Previously had many concurrency issues (better now).

·         MySQL Usage:

o   Not very relational. Mostly a key-value store

o   Very denormalized

o   No  joins or complex selects

o   96% MemcacheD hit rate to cool MySQL

·         MySQL Issues:

o   Need a better filesystem:

§  They use the CentOS linux distro

§  MySQL is storage intensive (IOPS & capacity)

§  Ext3 is broken and sucks. Fsck sucks as well

§  Ext4 is also old and busted

§  Want good volume management

§  Ext3 serialized writes to a given file

§  Love ZFS

·         Transactional, copy-on-write, end-to-end data integrity, on the fly corruption detection and repair, integrated volume management, snapshots and clones supported, and open source software

·         Unfortunately ZFS doesn’t run on Linux and SmugMug is a Linux shop

o   Replication:

§  Unknown state on crash

§  Did *.info get written at commit or 2 months out of date (in one instance)?

·         Transactional replication to the rescue

§  Bringing up TB+ slaves is slow

§  Backups using LVM/ZFS a pain

§  Single thread for replication can fall behind

§  Transactional replication patches from Google are GREAT and solves these issues

·         InnoDB only

·         Taking these patches to production next week.

·         Sun Sushi Toro aka S7410

o   NAS box with a few twists:

§  2x quad-core Opterond with 64GB RAM

§  100GB Readzilla SSD

§  2x 18GB Writezilla SSd (20k write IOPS)

§  22x 1TB 7200 RPM HDD

§  Clustered for HA

§  SSD performance with HDD economy

§  Toro supports ZFS on Linux

§  Can access using : NFS, iSCSI, CIFS, HTTP, FTP, etc.

§  Supports compression (1.5 compression ratio on their workload)

§  Cost: $80k ($142k clustered) – nobody pays list price though

§  SmugMug has 5 of these devices

§  5 different MySQL workloads hosted on a single shared cluster

§  Backups are a breeze (great snapshot support with roll back)

·         Rollback can selectively skip operations

·         Investigating 10GigE and actively testing

o   Intel NICS with Arista switches at less than $500/port

o   Using copper twinax SFP+

·         Expect 100% SSD in the future (not for bulk data)

·         Excited about Drizzle (scaled down MySQL)

·         Request from Oracle:

o   MySQL is a crown jewel – take care of it

o   GPL ZFS (lots of applause)

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, June 03, 2009 6:57:34 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Thursday, May 21, 2009

Cloud services provide excellent value but it’s easy to underestimate the challenge of getting large quantities of data to the cloud. When moving very large quantities of data, even the fastest networks are surprisingly slow.  And, many companies have incredibly slow internet connections. Back in 1996 MInix author and networking expert, Andrew Tanenbaum said “Never underestimate the bandwidth of a station wagon  full of tapes hurtling down the highway”.  For large data transfers, it’s faster (and often cheaper) to write to local media and ship the media via courier.

 

This morning the Beta release Amazon Web Services Import/Export was announced. This service essentially implements sneakernet allowing the efficient transfer of very large quantities of data into or out of the AWS Simple Storage Service. This initial beta release only supports import but the announcement reports that “the service will be expanded to include export in the coming months”.

 

To use the service, the data is copied to a portable storage device formatted using NTFS, FAT, ext2, or ext3 file systems. The manifest that describes the data load job is digitally signed using the sending users AWS access secret key and shipped to Amazon for loading.  Load charges are:

Device Handling

·         $80.00 per storage device handled.

Data Loading Time

·         $2.49 per data-loading-hour. Partial data-loading-hours are billed as full hours.

Amazon S3 Charges

·         Standard Amazon S3 Request and Storage pricing applies.

·         Data transferred between AWS Import/Export and Amazon S3 is free of charge (i.e. $0.00 per GB).

In addition to allowing much faster data ingestion, AWS Import/Export reduces networking costs since there is no charge for the transfer of data from the Import/Export service and S3.  A calculator is provided to compare estimated electronic transfer costs vs import/export costs.  It’s a clear win for larger data sets.

 

                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Thursday, May 21, 2009 5:49:27 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Services
 Wednesday, May 20, 2009

From an interesting article in Data Center Knowledge Who has the Most Web Servers:

The article continues to speculate on server counts at the companies that publically disclose server counts but are likely over 50k.  Google is likely around a million, microsoft is over 200k, and "Amazon says very little about its data center operations".

 

                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, May 20, 2009 4:45:41 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Monday, May 18, 2009

Earlier this morning Amazon Web Services announced the public beta of Amazon Cloudwatch, Auto Scaling, and Elastic Load Balancing.  Amazon Cloudwatch is a web service for monitoring AWS resources. Auto Scaling automatically grows and shrinks Elastic Compute Cloud resources based upon demand.  Elastic Load Balancing distributed workload over a fleet of EC2 servers.

  • Amazon CloudWatch – Amazon CloudWatch is a web service that provides monitoring for AWS cloud resources, starting with Amazon EC2. It provides you with visibility into resource utilization, operational performance, and overall demand patterns—including metrics such as CPU utilization, disk reads and writes, and network traffic. To use Amazon CloudWatch, simply select the Amazon EC2 instances that you’d like to monitor; within minutes, Amazon CloudWatch will begin aggregating and storing monitoring data that can be accessed using web service APIs or Command Line Tools. See Amazon CloudWatch for more details.
  • Auto Scaling – Auto Scaling allows you to automatically scale your Amazon EC2 capacity up or down according to conditions you define. With Auto Scaling, you can ensure that the number of Amazon EC2 instances you’re using scales up seamlessly during demand spikes to maintain performance, and scales down automatically during demand lulls to minimize costs. Auto Scaling is particularly well suited for applications that experience hourly, daily, or weekly variability in usage. Auto Scaling is enabled by Amazon CloudWatch and available at no additional charge beyond Amazon CloudWatch fees. See Auto Scaling for more details.
  • Elastic Load Balancing – Elastic Load Balancing automatically distributes incoming application traffic across multiple Amazon EC2 instances. It enables you to achieve even greater fault tolerance in your applications, seamlessly providing the amount of load balancing capacity needed in response to incoming application traffic. Elastic Load Balancing detects unhealthy instances within a pool and automatically reroutes traffic to healthy instances until the unhealthy instances have been restored. You can enable Elastic Load Balancing within a single Availability Zone or across multiple zones for even more consistent application performance. Amazon CloudWatch can be used to capture a specific Elastic Load Balancer’s operational metrics, such as request count and request latency, at no additional cost beyond Elastic Load Balancing fees. See Elastic Load Balancing for more details.

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Monday, May 18, 2009 5:16:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Saturday, May 16, 2009

A couple of weeks back, a mini-book by Luiz André Barroso and Urs Hölzle of the Google infrastructure team was released. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines is just over 100 pages long but an excellent introduction into very high scale computing and the issues important at scale.

 

From the Abstract:

As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today’s WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today’s WSCs on a single board.

 

Some of the points I found particularly interesting:

·         Networking:

o   Commodity switches in each rack provide a fraction of their bi-section bandwidth for interrack communication through a handful of uplinks to the more costly cluster-level switches. For example, a rack with 40 servers, each with a 1-Gbps port, might have between four and eight 1-Gbps uplinks to the cluster-level switch, corresponding to an oversubscription factor between 5 and 10 for communication across racks. In such a network, programmers must be aware of the relatively scarce cluster-level bandwidth resources and try to exploit rack-level networking locality, complicating software development and possibly impacting resource utilization. Alternatively, one can remove some of the cluster-level networking bottlenecks by spending more money on the interconnect fabric.

·         Server Power Usage:

·         Buy vs Build:

Traditional IT infrastructure makes heavy use of third-party software components such as databases and system management software, and concentrates on creating software that is specific to the particular business where it adds direct value to the product offering, for example, as business logic on top of application servers and database engines. Large-scale Internet services providers such as Google usually take a different approach in which both application-specific logic and much of the cluster-level infrastructure software is written in-house. Platform-level software does make use of third-party components, but these tend to be open-source code that can be modified inhouse as needed. As a result, more of the entire software stack is under the control of the service developer.

 

This approach adds significant software development and maintenance work but can provide important benefits in flexibility and cost efficiency. Flexibility is important when critical functionality or performance bugs must be addressed, allowing a quick turn-around time for bug fixes at all levels. It is also extremely advantageous when facing complex system problems because it provides several options for addressing them. For example, an unwanted networking behavior might be very difficult to address at the application level but relatively simple to solve at the RPC library level, or the other way around.

 

The full paper: http://www.morganclaypool.com/doi/pdf/10.2200/S00193ED1V01Y200905CAC006

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Saturday, May 16, 2009 9:30:04 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Services
 Sunday, May 03, 2009

Chris Dagdigian of BioTeam presented the keynote at this year’s Bio-IT World Conference. I found this presentation interesting for at least two reasons: 1) it’s a very broad and well reasoned look at many of the issues in computational science and, 2) an innovative example of cloud computing is presented  where BioTeam and Pfizer implement protein docking using Amazon AWS.

 

The presentation is posted at: http://blog.bioteam.net/wp-content/uploads/2009/04/bioitworld-2009-keynote-cdagdigian.pdf and I summarize some of what caught my interest below:

·         Argues that virtualization is “still the lowest hanging fruit in most shops” yielding big gains for operators, users, the environment, and budgets

·         Storage:

o   Storage still cheap and getting cheaper but operational costs largely unchanged

o   Data Triage needed: volume of data production is outpacing declining fully burdened cost of storage (including operational costs)

o   Lessons learned from a data loss event (10+TB lost)

§  Double disk failure on RAID5 volume holding SAN FS metadata with significant operational errors

§  Need more redundancy than RAID5

§  Need SNMP and email error reporting

§  Need storage subsystems to actively scrub, verify, and correct  errors

o   Concludes the storage discussion by pointing out that cloud services offer excellent fully burdened storage costs

·         Utility Computing

o   It is expensive to design for peak demand in-house

o   Pay-as-you-go can be compelling for some workloads

o   Explained why he “drank the Amazon EC2 Kool-Aid: saw it, used it, solved actual customer problems with it. As an example, Chris looked at a protein docking project done by Pfizer & BioTeam.

·         Protein Docking project architecture:

o   Borrows heavily from Rightscale Grid Edition

o   Inbound and outbound in Amazon SQS

o   Job specification in JSON

o   Data stored in Amazon S3

o   Job provenance and metadata stored in SimpleDB

o   Worker instances dynamic spawned in EC2 where structures are scored

o   All results stored in S3 (EC2 <-> S3 bandwidth free)

o   Download the top ranked docked complexes

o   Launch post-processing EC2 instances to score, rank, filter,  and cluster results into S3 (bring the computation do the data)

·         Don’t want to belittle the security concerns but whiff hypocrisy in the air

o   Is your staff really concerned or just protecting their turf

o   It is funny to see people demanding security measures they don’t practice internally across their own infrastructure

·         Next-Gen & utility storage

o   Primary analysis onsite; data moved to remote utility storage service after passing QC tests

o   Data would rarely (if ever) move back

o   Need to reprocess or rerun?

§  Spin up cloud servers to re-analyze in situ

§  Terabyte data transit not required

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Sunday, May 03, 2009 8:58:20 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Services
 Wednesday, April 29, 2009

In the Randy Katz on High Scale Data Centers posting I the article brought up Google Dalles.  The article reported that Dalles used air side economization but I’ve not seen the large intakes or louvers I would expect from a facility of that scale. 

 

Cary Roberts, ex-TellMe Networks and all around smart guy, produced a picture of Google Dalles that clearly shows air side economization (Thanks Cary).

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Wednesday, April 29, 2009 1:46:27 PM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Services
 Monday, April 20, 2009

I’m always interested in research on cloud service efficiency, and last week, at the Uptime Institute IT Symposium in New York City, management consultancy McKinsey published a report entitled Clearing the air on Cloud Computing. McKinsey is a well respected professional services company that describes itself as “a management consulting firm advising leading companies on organization, technology, and operations”.  Over the first 22 years of my career in server-side computing at Microsoft and IBM, I’ve met McKinsey consultants frequently, although they were typically working on management issues and organizational design rather than technology. This particular report focuses more on technology, where the authors investigate the economics of very high scale data centers and cloud computing. This has been my prime area of interest for the last 5 years, and my first observation is the authors are taking on an incredibly tough challenge.

 

Gaining a complete inventory of the costs of internal IT is very difficult. The costs hide everywhere.  Some are in central IT teams, some are in central procurement groups, some with the legal and contract teams, and some in departmental teams doing IT work although not part of corporate IT. It’s incredibly difficult to get a full, accurate, and unassailable inventory into the costs of internal IT. Further complicating the equation, internal IT is often also responsible for mission-critical tasks that have nothing to do with comparing internal IT with cloud services offerings. Internal IT is often responsible for internal telco and for writing many of the applications that actually run the business.  Basically, it’s very hard to first find all the comparable internal IT costs and, even with a complete inventory of IT costs, it’s then even harder to separate out mission-critical tasks that internal IT teams own that have nothing to do with whether the applications are cloud or internally hosted. I’m arguing that this report’s intent, of comparing costs in a generally applicable way, across all industries, is probably not possible to do accurately and may not be a good idea.

 

In the report, the authors conclude that current cloud computing offerings “are not cost-effective compared to large enterprise data centers.”  They argue that cloud offerings are most attractive for small and medium sized enterprises. The former is a pretty strong statement, and contradicts most of what I’ve learned about high scale service, so it’s definitely worth digging deeper.

 

It’s not clear that a credible detailed accounting of all comparable IT costs that generalizes across all industries can be produced. Each company is different and these costs are both incredibly hard to find and entangled with many other mission-critical tasks the internal IT team owns that has nothing to do with whether they are internally hosted or utilizing the cloud. From all the work I’ve done around high scale services, it’s inarguably true that some internal IT tasks are very leveraged. These tasks form the core competency of the business and are usually at least developed internally if not hosted internally.  In what follows, I’ll argue that non-differentiated services -- services that need to be good but aren’t the company’s competitive advantage -- are much more economically hosted in very high-scale cloud computing environments. The hosting decision should be driven by company strategy and a decision to concentrate investment capital where it has the most impact. The savings available using a shared cloud for non-differentiated services are dramatic, and are available for all companies, from the smallest startup to the largest enterprise. I’ll look at some of these advantages below.

 

In this report the authors conclude that cloud computing makes sense for small and medium enterprises but “are not cost-effective to large enterprise data centers.” The authors argue there are economies of scale that makes sense for the small and medium sized businesses, but the cost advantages break down at the very large. Essentially they are arguing that big companies already have all the economies of scale available to internet-scale services. On the face, this appears unlikely. And, upon further digging, we’ll see it’s simply incorrect across many dimensions.

 

Let’s think about economies of scale.  Large power plants produce lower cost power than small regional plants.  Very large retail store chains spend huge amounts on optimizing all aspects of their businesses from supply chain optimization through customer understanding and, as a consequence, can offer lower prices. There are exceptions to be sure but, generally, we see a pretty sharp trend towards economies of scale across a wide range of businesses.  There will always be big, dumb, poorly run players and there will always be nimble but small innovators.  The one constant is those that understand how to grow large and get the economies of scale and yet still stay nimble, often deliver very high quality products at much lower cost to the customer.

 

Perhaps the economies of scale don’t apply to the services world?  Looking at services such as payroll and internal security, we see that almost no companies choose to do their own internally.  These services clearly need to be done well, but they are not differentiated.  It’s hard to be so good at payroll that it yields a competitive advantage, unless your company is actually specializing in payroll. Internal operations such as payroll and security are often sublet to very large services companies that focus on them. ADP, for example, has been successful at providing a very high scale service that makes sense for even the biggest companies. I actually think it’s a good thing that the companies I’ve worked for over the last twenty years didn’t do their own payroll and instead focus their investment capital on technology opportunities that grow the business and help customers. It’s the right answer.

 

We find another example in enterprise software.  When I started my career, nearly all large companies developed their own internal IT applications. At the time, most industry experts speculated that none of the big companies would ever move to packaged ERP systems. But, the economies of scale of the large ERP development shops are substantial and, today, very few companies develop their own ERP or CRM systems.  The big companies like SAP can afford to invest in the software base at rates even the largest enterprise couldn’t afford. Fifteen years ago SAP had 4,200 engineers working on their ERP system. Even the largest enterprise could never economically justify spending a fraction of that.  Large central investments at scale typically make better economic sense unless the system in question is one of a company’s core strategic assets.

 

I’ve argued that smart, big players willing to invest deeply in innovating at scale can produce huge cost advantages and we’ve gone through examples from power generation, through retail sales, payroll, security, and even internal IT software. The authors of the McKinsey study are essentially arguing that, although all major companies have chosen to enjoy the large economies of scale offered by packaged software products over internal development, this same trend won’t extend to cloud hosted solutions. Let’s look closely at the economics to see if this conclusion is credible.

 

In the enterprise, most studies report that the cost of people dominates the cost of servers and data center infrastructure.  In the cloud services world, we see a very different trend.  Here we find that the costs of servers dominate, followed by mechanical systems, and then power distribution (see the Cost of Power in Large Data Centers). As an example, looking at all aspects of operational costs in a mid-sized service led years ago, the human administrative costs were under 10% of the overall operational costs.  I’ve seen very large, extremely well run services where the people costs have been driven below 4%. Given that people costs dominate many enterprise deployments, how do high-scale cloud services get these cots so low? There are many factors contributing but the most important two are 1) cloud services run at very high scale and can afford to invest more in automation amortizing that investment across a much larger server population, and 2) services teams can specialize focused on doing one thing and doing it very well. This kind of specialization yields efficiency gains, but it is only affordable at multi-tenant scale. The core argument here is that the number 1 cost in the enterprise is people whereas, in high scale services, these costs have been amortized down to sub-10%. Arguing there are no economies at cloud scale is the complete opposite of my experience and observations.

 

<JRH>Page 25 of study shows a “disguised client example“ where the example company had 1,704 people working in IT before the move to cloud services and still required 1,448 after the move. I’m very skeptical that any company with 1,704 people working in IT – clearly a large company – would move to cloud computing in one, single discrete step.  It’s close to impossible and would be foolhardy.  Consequently, I suspect the data either represents a partial move to the cloud or is only a paper exercise. If the former, the data is incomplete and, if the later, the data is speculative.  The story is clouded further by including in the headcount inventory desktop support, real estate, telecommunications and many other responsibilities that wouldn’t be impacted by the move to cloud services. Adding extraneous costs in large numbers dilutes the savings realized by this disguised customer. Overall, this slide doesn’t appear informative.

 

We’ve shown that at very high scale the dominant costs are server hardware and data center infrastructure. Very high scale services hire server designers and have an entire team focused on the acquisition of some of the most efficient server designs in the world.  Google goes so far as to design custom servers (see Jeff Dean on Google Infrastructure) something very hard to economically do at less than internet-scale.  I’ve personally done joint design work with Rackable Systems in producing servers optimized for cloud services workloads (Microslice Servers). When servers are the dominant cost and you are running at 10^5 to 10^6 servers scale, considerable effort can and should be spent on obtaining the most cost effective servers possible for the workload. This is hard to do economically at lower scale.

 

We’ve shown that people costs are largely automated out of very high scale services and that the server hardware is either custom, jointly developed, or specifically targeted to the workload.  What about data center infrastructure?  The Uptime Institute reports that the average data center Power Usage Effectiveness is 2.0 (smaller is better). What this number means is that for every 1W of power that goes to a server in an enterprise data center, a matching watt is lost to power distribution and cooling overhead. Microsoft reports that its newer designs are achieving a PUE of 1.22 (Out of the box paradox…). All high scale services are well under 1.7 and most, including Amazon, are under 1.5. High scale services can invest much more in infrastructure innovations by spreading this large investment out over a large number of data centers. As a consequence, these internet-scale services are a factor of 2 more efficient than the average enterprise. This is good for the environment and, with power being such a substantial part of the cost of high-scale computing, it substantially reduces costs as well.

 

Utilization is the factor that many in the industry hate talking about because the industry-wide story is so poor.  The McKinsey report says that enterprise server utilization is actually down around 10% which is approximately consistent with I’ve seen working with enterprise customers over the years. The implication is the servers and the facilities that house them are only 10% used.  This sounds like the beginning of an incredibly strong argument for cloud services but the authors take a different path and argue it would be easy to increase enterprise utilization far higher than 10%. With an aggressive application of virtualization and related technologies, they feel utilizations as high as 35% are possible.  That conclusion is possibly correct, but it’s worth spending a minute on this point. At 35% efficiency, a full 2/3 is still wasted which seems unfortunate, unnecessary, and hard on the environment.  Improving from 10% to 35% will require time, new software, new training, etc. but it may be possible.  What’s missing in this observation is that 1) cloud services can invest more in these efficiency innovations and they are already substantially down that path, 2) large user populations allow a greater investment in infrastructure efficiency at a higher rate, and 3) not all workloads have correlated peaks, so larger, heterogeneous populations offer substantially larger optimization possibilities than most enterprises can achieve alone (see: resource consumption shaping).

 

In the discussion above, we focused on the costs “below” the software (data center infrastructure and servers) and found a substantial and sustainable competitive advantage in high scale deployments.  Looking at people costs, we see the same advantage again.  On the software-side, the cost picture ranges from less in the cloud to equal but it isn’t higher. There doesn’t seem to be a dimension that supports the claim of this report. I just can’t find the data to support the claim that enterprises shouldn’t consider cloud service deployments. Looking at slides on the McKinsey presentation that make the cost argument in detail, the graphs on slides 22, 23, and 24 just don’t make sense to me. I’ve spent considerable time on the data but just can’t get it to line up with the AWS price sheet or any other measure of reality.  The limitation might be mine but it seems others are having trouble matching this data to reality as well.

 

My conclusion: any company not fully understanding cloud computing economics and not having cloud computing as a tool to deploy where it makes sense is giving up a very valuable competitive edge. No matter how large the IT group, if I led the team, I would be experimenting with cloud computing and deploying where it make sense.  I would want my team to know it well and to be deploying to the cloud when the work done is not differentiated or when the capital was better leveraged elsewhere

 

IT is complex and a single glib answer is almost always wrong.  My recommendation is to start testing and learning about cloud services, to take a closer look at your current IT costs, and to compare the advantages of using a cloud service offering with both internal hosting and mixed hosting models.

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Monday, April 20, 2009 4:53:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [15] - Trackback
Services
 Tuesday, April 14, 2009

My notes from an older talk done by Ryan Barrett on the Google App Engine Data store at Google IO last year (5/28/2008). Ryan is a co-founder of the App Engine team.

 

·         App Engine Data Store is build on Big Table.

o   Scalable structured storage

o   Not a sharded database

o   Not an RDBMS (MySQL, Oracle, etc.)

o   Not a Distributed Hash Table (DHT)

o   It IS a sharded sorted array

·         Supported operations:

o   Read

o   Write

o   Delete

o   Single row transactions (optimistic concurrency control).

o   Scans:

1.       Prefix scan

2.       Range scan

·          Primary object: Entity

o   Stored in entity table

o   Each row has a name and the row name is fully qualified /root/parent/entity/child

o   Each entity has a parent or is a root entity and may have child entities

o   Primary key is the fully qualified name and this can’t change

o   An entity can’t be reparented (it can be deleted and created with a different parent)

·         Queries:

o   Queries can be filtered on kind and Ryan says kind “is like a table” (kind can be parent, child, grandparent, …)

o   Queries can be filtered on ancestor

o   Query language is GQL (presumably Google Query Language) which is a small subset of SQL

o   All queries must be expressible as range or prefix scans (no sort, orderby, or other unbounded size operations supported)

·         Secondary index implementation:

o   Indexes are also implemented as BigTable tables

o   Kind Index:

·         Contents: (kind, key)

o   Single property index:

·         Coentents: (kind, name, value)

·         Two copies of this index maintained: 1) ascending, and 2) descending

o   Composite indexes:

·         Contents: (kind, value, value)

·         Supports multi-property indexes

·         Built on programmer request but not on use (query returns error if required doesn’t exist)

·         Programmer can specify what composite indexes are needed in index.yaml

·         SDK creates composit index specs automatically in index.yaml as queries are run

·         Entity group

o   Supports multi-entity update

·         Defined by root entity (all entities under a root are an entity group)

·         All journaling and transactions done at root

·         Text and Blobs:

o   Not indexed. All other properties are

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Tuesday, April 14, 2009 5:28:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Thursday, April 09, 2009

Last week I attended the Data Center Efficiency Summit hosted by Google. You’ll find four posting on various aspects of the summit at: http://perspectives.mvdirona.com/2009/04/05/DataCenterEfficiencySummitPosting4.aspx.

 

Two of the most interesting videos:

·         Modular Data Center Tour: http://www.youtube.com/watch?v=zRwPSFpLX8I&feature=channel

·         Data Center Water Treatment Plant: http://www.youtube.com/watch?v=nPjZvFuUKN8&feature=channel

 

A Cnet article with links to all the videos: http://news.cnet.com/8301-1001_3-10215392-92.html?tag=newsEditorsPicksArea.0.

 

The presentation I did on Data Center Efficiency Best Practices is up at: http://www.youtube.com/watch?v=m03vdyCuWS0

 

                                                                --jrh

 

James Hamilton, Amazon Web Services

1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
james@amazon.com  

H:mvdirona.com | W:mvdirona.com/jrh/work  | blog:http://perspectives.mvdirona.com

 

Thursday, April 09, 2009 7:18:35 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<July 2009>
SunMonTueWedThuFriSat
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678

Categories
This Blog
Member Login
All Content © 2009, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton