High Performance Computing (HPC) is defined by Wikipedia as:
High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems. Today, computer systems approaching the teraflops-region are counted as HPC-computers. The term is most commonly associated with computing used for scientific research or computational science. A related term, high-performance technical computing (HPTC), generally refers to the engineering applications of cluster-based computing (such as computational fluid dynamics and the building and testing of virtual prototypes). Recently, HPC has come to be applied to business uses of cluster-based supercomputers, such as data warehouses, line-of-business (LOB) applications, and transaction processing.
Predictably, I use the broadest definition of HPC including data intensive computing and all forms of computational science. It still includes the old stalwart applications of weather modeling and weapons research but the broader definition takes HPC from a niche market to being a big part of the future of server-side computing. Multi-thousand node clusters, operating at teraflop rates, running simulations over massive data sets is how petroleum exploration is done, it’s how advanced financial instruments are (partly) understood, it’s how brick and mortar retailers do shelf space layout and optimize their logistics chains, it’s how automobile manufacturers design safer cars through crash simulation, it’s how semiconductor designs are simulated, it’s how aircraft engines are engineered to be more fuel efficient, and it’s how credit card companies measure fraud risk. Today, at the core of any well run business, is a massive data store –they all have that. The measure of a truly advanced company is the depth of analysis, simulation, and modeling run against this data store. HPC workloads are incredibly important today and the market segment is growing very quickly driven by the plunging cost of computing and the business value understanding large data sets deeply.

High Performance Computing is one of those important workloads that many argue can’t move to the cloud. Interestingly, HPC has had a long history of supposedly not being able to make a transition and then, subsequently, making that transition faster than even the most optimistic would have guessed possible. In the early days of HPC, most of the workloads were run on supercomputers. These are purpose built, scale-up servers made famous by Control Data Corporation and later by Cray Research with the Cray 1 broadly covered in the popular press. At that time, many argued that slow processors and poor performing interconnects would prevent computational clusters from ever being relevant for these workloads. Today more than ¾ of the fastest HPC systems in the world are based upon commodity compute clusters.
The HPC community uses the Top-500 list as the tracking mechanism for the fastest systems in the world. The goal of the Top-500 is to provide a scale and performance metric for a given HPC system. Like all benchmarks, it is a good thing in that it removes some of the manufacturer hype but benchmarks always fail to fully characterize all workloads. They abstract performance to a single or small set of metrics which is useful but this summary data can’t faithfully represent all possible workloads. Nonetheless, in many communities including HPC and Relational Database Management Systems, benchmarks have become quite important. The HPC world uses the Top-500 list which depends upon LINPACK as the benchmark.
Looking at the most recent Top-500 list published in June 2010, we see that Intel processors now dominate the list with 81.6% of the entries. It is very clear that the HPC move to commodity clusters has happened. The move that “couldn’t happen” is near complete and the vast majority of very high scale HPC systems are now based upon commodity processors.
What about HPC in the cloud, the next “it can’t happen” for HPC? In many respects, HPC workloads are a natural for the cloud in that they are incredibly high scale and consume vast machine resources. Some HPC workloads are incredibly spiky with mammoth clusters being needed for only short periods of time. For example semiconductor design simulation workloads are incredibly computationally intensive and need to be run at high-scale but only during some phases of the design cycle. Having more resources to throw at the problem can get a design completed more quickly and possibly allow just one more verification run to potentially save millions by avoiding a design flaw. Using cloud resources, this massive fleet of servers can change size over the course of the project or be freed up when they are no longer productively needed. Cloud computing is ideal for these workloads.
Other HPC uses tend to be more steady state and yet these workloads still gain real economic advantage from the economies of extreme scale available in the cloud. See Cloud Computing Economies of Scale (talk, video) for more detail.
When I dig deeper into “steady state HPC workloads”, I often learn they are steady state as an existing constraint rather than by the fundamental nature of the work. Is there ever value in running one more simulation or one more modeling run a day? If someone on the team got a good idea or had a new approach to the problem, would it be worth being able to test that theory on real data without interrupting the production runs? More resources, if not accompanied by additional capital expense or long term utilization commitment, are often valuable even for what we typically call steady state workloads. For example, I’m guessing BP, as it battles the Gulf of Mexico oil spill, is running more oil well simulations and tidal flow analysis jobs than originally called for in their 2010 server capacity plan.
No workload is flat and unchanging. It’s just a product of a highly constrained model that can’t adapt quickly to changing workload quantities. It’s a model from the past.
There is no question there is value to being able to run HPC workloads in the cloud. What makes many folks view HPC as non-cloud hostable is these workloads need high performance, direct access to underlying server hardware without the overhead of the virtualization common in most cloud computing offerings and many of these applications need very high bandwidth, low latency networking. A big step towards this goal was made earlier today when Amazon Web Services announced the EC2 Cluster Compute Instance type.
The cc1.4xlarge instance specification:
· 23GB of 1333MHz DDR3 Registered ECC
· 64GB/s main memory bandwidth
· 2 x Intel Xeon X5570 (quad-core Nehalem)
· 2 x 845GB 7200RPM HDDs
· 10Gbps Ethernet Network Interface
It’s this last point that I’m particularly excited about. The difference between just a bunch of servers in the cloud and a high performance cluster is the network. Bringing 10GigE direct to the host isn’t that common in the cloud but it’s not particularly remarkable. What is more noteworthy is it is a full bisection bandwidth network within the cluster. It is common industry practice to statistically multiplex network traffic over an expensive network core with far less than full bisection bandwidth. Essentially, a gamble is made that not all servers in the cluster will transmit at full interface speed at the same time. For many workloads this actually is a good bet and one that can be safely made. For HPC workloads and other data intensive applications like Hadoop, it’s a poor assumption and leads to vast wasted compute resources waiting on a poor performing network.
Why provide less than full bisection bandwidth? Basically, it’s a cost problem. Because networking gear is still building on a mainframe design point, it’s incredibly expensive. As a consequence, these precious resources need to be very carefully managed and over-subscription levels of 60 to 1 or even over 100 to 1 are common. See Datacenter Networks are in my Way for more on this theme.
For me, the most interesting aspect of the newly announced Cluster Compute instance type is not the instance at all. It’s the network. These servers are on a full bisection bandwidth cluster network. All hosts in a cluster can communicate with other nodes in the cluster at the full capacity of the 10Gbps fabric at the same time without blocking. Clearly not all can communicate with a single member of the fleet at the same time but the network can support all members of the cluster communicating at full bandwidth in unison. It’s a sweet network and it’s the network that makes this a truly interesting HPC solution.
Each Cluster Compute Instance is $1.60 per instance hour. It’s now possible to access millions of dollars of servers connected by a high performance, full bisection bandwidth network inexpensively. An hour with a 1,000 node high performance cluster for $1,600. Amazing.
As a test of the instance type and network prior to going into beta Matt Klein, one of the HPC team engineers, cranked up LINPACK using an 880 server sub-cluster. It’s a good test in that it stresses the network and yields a comparative performance metric. I’m not sure what Matt expected when he started the run but the result he got just about knocked me off my chair when he sent it to me last Sunday. Matt’s experiment yielded a booming 41.82 TFlop Top-500 run.
For those of you as excited as I am interested in the details from the Top-500 LINPACK run:
This is phenomenal performance for a pay-as-you-go EC2 instance. But what makes it much more impressive is that result would place the EC2 Cluster Compute instance at #146 on the Top-500. It also appears to scale well which is to say bigger numbers look feasible if more nodes were allocated to LINPACK testing. As fun as that would be, it is time to turn all these servers over to customers so we won’t get another run but it was fun.
You can now have one of the biggest super computers in the world for your own private use for $1.60 per instance per hour. I love what’s possible these days.
Welcome to the cloud, HPC!
--jrh
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I didn’t attend the Hadoop Summit this year or last but was at the inaugural event back in 2008 and it was excellent. This year, the Hadoop Summit 2010 was held June 29 again in Santa Clara. This agenda for the 2010 event is at: Hadoop Summit 2010 Agenda. Since I wasn’t able to be there, Adam Gray of the Amazon AWS team was kind enough to pass on his notes and let me use them here:
Key Takeaways
· Yahoo and Facebook operate the world largest Hadoop clusters, 4,000/2,300 nodes with 70/40 petabytes respectively. They run full cluster replicas to assure availability and data durability.
· Yahoo released Hadoop security features with Kerberos integration which is most useful for long running multitenant Hadoop clusters.
· Cloudera released paid enterprise version of Hadoop with cluster management tools and several dB connectors and announced support for Hadoop security.
· Amazon Elastic MapReduce announced expand/shrink cluster functionality and paid support.
· Many Hadoop users use the service in conjunction with NoSQL DBs like Hbase or Cassandra.
Keynotes
Yahoo had the opening keynote with talks by Blake Irving, Chief Products Officer, Shelton Shugar, SVP of Cloud Computing, and Eric Baldeschwieler, VP of Hadoop. They talked about Yahoo’s scale, including 38k Hadoop servers, 70 PB of storage, and more than 1 million monthly jobs, with half of those jobs written in Apache Pig. Further their agility is improving significantly despite this massive scale—within 7 minutes of a homepage click they have a completely reconfigured preference model for that user and an updated homepage. This would not be possible without Hadoop. Yahoo believes that Hadoop is ready for enterprise use at massive scale and that their use case proves it. Further, a recent study found that 50% of enterprise companies are strongly considering Hadoop, with the most commonly cited reason being agility. Initiatives over the last year include: further investment and improvement in Hadoop 0.20, integration of Hadoop with Kerberos, and the Oozie workflow engine.
Next, Peter Sirota gave a keynote for Amazon Elastic MapReduce that focused on how the service makes combining the massive scalability of MapReduce with the web-scale infrastructure of AWS more accessible, particularly to enterprise customers. He also announced several new features including expanding and shrinking the cluster size of running job flows, support for spot instances, and premium support for Elastic MapReduce. Further, he discussed Elastic MapReduce’s involvement in the ecosystem including integration with Karmasphere and Datameer. Finally, Scott Capiello, Senior Director of Products at Microstrategy, came on stage to discuss their integration with Elastic MapReduce.
Cloudera followed with talks by Doug Cutting, the creator of Hadoop, and Charles Zedlweski, Senior Director of Product Management. They announced Cloudera Enterprise, a version of their software that includes production support and additional management tools. These tools include improved data integration and authorization management that leverages Hadoops security updates. And they demoed a WebUI for using these management tools.
The final keynote was given by Mike Schroepfer, VP of Engineering at Facebook. He talked about Facebook’s scale with 36 PB of uncompressed storage, 2,250 machines with 23k processors, and 80-90 TB growth per day. Their biggest challenge is in getting all that data into Hadoop clusters. Once the data is there, 95% of their jobs are Hive-based. In order to ensure reliability they replicate critical clusters in their entirety. As far as traffic, the average user spends more time on Facebook than the next 6 web pages combined. In order to improve user experience Facebook is continually improving the response time of their Hadoop jobs. Currently updates can occur within 5 minutes; however, they see this eventually moving below 1 minute. As this is often an acceptable wait time for changes to occur on a webpage, this will open up a whole new class of applications.
Discussion Tracks
After lunch the conference broke into three distinct discussion tracks: Developers, Applications, and Research. These tracks had several interesting talks including one by Jay Kreps, Principal Engineer at LinkedIn, who discussed LinkedIn’s data applications and infrastructure. He believes that their business data is ideally suited for Hadoop due to its massive scale but relatively static nature. This supports large amounts of computation being done offline. Further, he talked about their use of machine learning to predict relationships between users. This requires scoring 120 billion relationships each day using 82 Hadoop jobs. Lastly, he talked about LinkedIn’s in-house developed workflow management tool, Azkaban, an alternative to Oozie.
Eric Sammer, Solutions Architect at Cloudera, discussed some best practices for dealing with massive data in Hadoop. Particularly, he discussed the value of using workflows for complex jobs, incremental merges to reduce data transfer, and the use of Sqoop (SQL to Hadoop) for bulk relational database imports and exports. Yahoo’s Amit Phadke discussed using Hadoop to optimize online content. His recommendations included leveraging Pig to abstract out the details of MapReduce for complex jobs and taking advantage of the parallelism of HBase for storage. There was also significant interest in the challenges of using Hadoop for graph algorithms including a talk that was so full that they were unable to let additional people in.
Elastic MapReduce Customer Panel
The final session was a customer panel of current Amazon Elastic MapReduce users chaired by Deepak Singh. Participants included: Netflix, Razorfish, Coldlight Solutions, and Spiral Genetics. Highlights include:
· Razorfish discussed a case study in which a combination of Elastic MapReduce and cascading allowed them to take a customer to market in half the time with a 500% return in ad spend. They discussed how using EMR has given them much better visibility into their costs, allowing them to pass this transparency on to customers.
· Netflix discussed their use of Elastic MapRedudce to setup a hive-based data warehouseing infrastructure. They keep a persistent cluster with data backups in S3 to ensure durability. Further, they also reduce the amount of data transfer through pre-aggregation and preprocessing of data.
· Spiral Genetics talked about they had to leverage AWS to reduce capital expenditures. By using Amazon Elastic MapReduce they were able to setup a running job in 3 hours. They are also excited to see spot instance integration.
· Coldlight Solutions said that buying $1/2M in infrastructure wasn’t even an option when they started. Now it is, but they would rather focus on their strength: machine learning and Amazon Elastic MapReduce allows them to do this.
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I did a talk at Velocity 2010 last week. The slides are posted at Datacenter Infrastructure Innovation and the video is available at Velocity 2010 Keynote. Urs Holze Google Senior VP of infrastructure also did a Velocity keynote. It was an excellent talk and is posted at Urs Holzle at Velocity 2010. Jonathan Heilliger, Facebook VP of Technical Operations spoke at Velocity as well. A talk summary is up at: Managing Epic Growth in Real Time. Tim O’Reilly did a talk: O’Reilly Radar. Velocity really is a great conference.
Last week I posted two quick notes on Facebook: Facebook Software Use and 60,000 Servers at Facebook. Continuing on that theme, a few other Facebook Data points that I have been collecting of late:
From Qcon 2010 in Beijing (April 2010): memcached@facebook:
· How big is Facebook:
o 400m active users
o 60m status updates per day
o 3b photo uploads per month
o 5b pieces of content shared each week
o 50b friend graph edges
§ 130 friend per user on average
o Each user clicks on 9 pieces of content each month
· Thousands of servers in two regions [jrh: 60,000]
· Memcached scale:
o 400m gets/second
o 28m sets/second
o 2T cached items
o Over 200 TB
o Networking scale:
§ Peak rx: 530m pkts/second (60GB/s)
§ Peak tx: 500m pkts/second (120GB/s)
· Each memcached server:
o Rx: 90k pkts/sec (9.7MB/s)
o Tx 94k pkts/sec (19 MB/s)
o 80k gets/second
o 2k sets/s
o 200m items
· Phatty Phatty Multiget
o Php is single threaded and synchronous so need to get multiple objects in a single request to be efficient and fast
· Cache segregration:
o Different objects have different lifetimes so separate out
· Incast problem:
o The use of multiget increased performance but lead to incast
problem
The talk is full of good data and worth a read.
From Hadoopblog, Facebook has the world’s Largest Hadoop Cluster:
- 21 PB of storage in a single HDFS cluster
- 2000 machines
- 12 TB per machine (a few machines have 24 TB each)
- 1200 machines with 8 cores each + 800 machines with 16 cores each
- 32 GB of RAM per machine
- 15 map-reduce tasks per machine
The Yahoo Hadoop cluster is reported to be twice the node count of the Facebook cluster at 4,000 nodes: Scaling Hadoop to 4000 nodes at Yahoo!. But, it does have less disk:
· 4000 nodes
· 2 quad core Xeons @ 2.5ghz per node
· 4x1TB SATA disks per node
· 8G RAM per node
· 1 gigabit ethernet on each node
· 40 nodes per rack
· 4 gigabit ethernet uplinks from each rack to the core
· Red Hat Enterprise Linux AS release 4 (Nahant Update 5)
· Sun Java JDK 1.6.0_05-b13
· Over 30,000 cores with nearly 16PB of raw disk!
http://www.bluedavy.com/iarch/facebook/facebook_architecture.pdf
http://www.qconbeijing.com/download/marc-facebook.pdf
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
This morning I came across Exploring the software behind Facebook, the World’s Largest Site. The article doesn’t introduce new data not previously reported but it’s a good summary of the software used by Facebook and the current scale of the social networking site:
· 570 billion page views monthly
· 3 billion photo uploads monthly
· 1.2 million photos served per second
· 30k servers
The later metric, the 30k servers number is pretty old (Facebook has 30,000 servers). I would expect the number to be closer to 50k now based only upon external usage growth.
The article was vague on memcached usage saying only “Terrabytes”. I’m pretty interested in memcached and Facebook is, by far, the largest user, so I periodically check their growth rate. They now have 28 terabytes of memcached data behind 800 servers. See Scaling memcached at Facebook for more detail.
The mammoth memchached fleet at Facebook has had me wondering for years how close the cache is to the entire data store? If you factor out photos and other large objects, how big is the entire remaining user database? Today the design is memecached insulating the fleet of database servers. What is the aggregate memory size of the memcached and database fleet? Would it be cheaper to store the entire database 2-way redundant in memory with changes logged to support recovery in the event that a two server loss?
Facebook is very close if not already able to store the entire data store minus large objects in memory and within a factor of two of being able to store in memory twice and have memcached be the primary copy completely omitting the database tier. It would be a fun project.
--jrh
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
One of the most important attributes needed in a cloud solution is what I call cloud data freedom. Having the ability to move data out of the cloud quickly, efficiently, cheaply, and without restriction is a mandatory prerequisite in my opinion to trusting a cloud. In fact, you need the ability to move the data both ways. Moving in cheaply, efficiently, and quickly is often required just to get the work done. And the ability to move out cheaply, efficiently, quickly, and without restriction is the only way to avoid lock-in. Data movement freedom is the most important attribute of an open cloud and a required prerequisite to avoiding provider lock in.
The issue came up in the comments on this post: Netflix on AWS where Jan Miczaika asked:
James, as long as Amazon is growing constantly and has a great culture of smart people it will work out fine. Should the going ever get tough (what I of course don't hope) these principles may be thrown overboard. It would not be the first time companies sacrifice long-term values for short-term profits.
This is all very hypothetical. Still, for strategic long-term planning, I believe it should be taken into account.
And I responded:
Jan, it is inarguably true that there have been instance of previously good companies making incredibly short sighted decisions. It has happened before and it could happen again.
The point I'm making is not that there exists any company that is incapable of damn dumb decisions. My point is that the cloud computing model is a huge win economically. I agree with you that no company is assured to be great, customer focused, and thinking clearly forever. Disasters can happen. That's why I would never do business with a cloud provider that didn't have great support for export of LARGE amounts of data cost effectively. Its super important that the data not be locked in. I don't care so much about the low level control plane programming model -- I can change how I call those APIs. But its super important that the data can be moved to another service easily. And, this export service has to be cheap and there is no way I would use the network for very high scale data movements. You have to assume that the data is going to keep growing and so its physical media export that you want. Recall Andrew Tenenbaum's "Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway" (sneakernet).
I'm saying you need to use cloud computing but I'm not saying you should trust one company to be the right answer for ever. Don't step in without a good quality export service based upon physical media at a known and reasonably price.
This morning AWS Import/Export announced that the service is now out of beta and it now supports a programmatic, web services interface. From the announce letter of earlier today:
AWS Import/Export accelerates moving large amounts of data into and out of AWS using portable storage devices for transport. The service is exiting beta and is now generally available. Also, a new web service interface augments the email-based interface that was available during the service's beta. Once a storage device is loaded with data for import or formatted for an export, the new web service interface makes it easy to initiate shipment to AWS in minutes, or to check import or export status in real-time.
You can use AWS Import/Export for:
· Data Migration - If you have data you need to upload into the AWS cloud for the first time, AWS Import/Export is often much faster than transferring that data via the Internet.
· Content Distribution Send data you are computing or storing on AWS to your customers on portable storage devices.
· Direct Data Interchange - If you regularly receive content on portable storage devices from your business associates, you can have the data sent directly to AWS for import into your Amazon S3 buckets.
· Offsite Backup - Send full or incremental backups to Amazon S3 for reliable and redundant offsite storage.
· Disaster Recovery - In the event that you need to quickly retrieve a large backup stored in Amazon S3, use AWS Import/Export to transfer the data to a portable storage device and deliver it to your site.
To use AWS Import/Export, you just prepare a portable storage device, and submit a Create Job request with the open source AWS Import/Export command line application, a third party tool, or by programming directly against the web service interface. AWS Import/Export will return a unique identifier for the job, a digital signature for authenticating your device, and an AWS address to which you ship your storage device. After copying the digital signature to your device, ship it along with its interface connectors and power supply to AWS.
You can learn more about AWS Import/Export and get started using the web service at aws.amazon.com/importexport.
Also announced this morning: AWS Management Console for S3 and 3 days ago: Amazon Cloudfront adds HTTPS Support, Lower prices, and Opens an NYC Edge Location. Things are moving pretty quickly right now.
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Last month I wrote about Solving World Problems with Economic Incentives. In that post I talked about the power of economic incentives when compared to regulatory body intervention. I’m not really against laws and regulations – the EPA, for example, has done some good work and much of what they do has improved the situation. But 9 times out of 10 good regulation is first blocked and/or water down by lobby groups, what finally gets enacted is often not fully through and brings unintended consequences, it is often overly prescriptive (see Right Problem but Wrong Approach), and regulations are enacted at the speed of government (think continental drift – there is movement but it’s often hard to detect).
If an economic incentive can be carefully crafted such that its squarely targeting the desired outcome rather than how to get there, wonderful things can happen. This morning I came across a great example of applying economic incentive to drive a positive outcome rapidly.
First, some background on the base issue. I believe that web site latency has a fundamental impact on customer satisfaction and there is considerable evidence that it drives better economic returns. See The Cost of Latency for more detail on this issue. Essentially I’m arguing that there really is a economic argument to reduce web page latency and astute companies are doing it today. If I’m right that economic incentives are enough, why isn’t the latency problem behind us?
The problem is that economic incentives only drive desired outcomes when there is a general, widely held belief that there is direct correlation between the outcome and the improved economic condition. In the case of web page latency, I’ll claim the evidence from Goggles Steve Souder, Jake Brutlag, and Marissa Mayer, Bing’s Eric Schurman (now Amazon), Dave Artz from AOL, Phil Dixon from Shopzilla and many others is very compelling. But, compelling isn’t enough. The reason I still write about it, is its not widely believed. Many argue that web page latency isn’t highly correlated with better economic outcomes.
Regardless of your take on this important topic, I urge you to read Steve Souder’s post Velocity and the Bottom Line. By the way, Velocity 2010 is coming up and you should consider doing the trip. It’s a good conference, the 2009 even produced some wonderful data and I expect 2010 to be at least as good.. I plan to be down for a day to give a talk.
Returning to web latency. I really believe there is an economic incentive to improve web site latency. But, if this belief is not widely held, it has no impact. I think that is about to change. Google recently announced Google Now Counts Site Speed as a Ranking Factor. Silly amounts of money is spent on search engine optimization. Getting to the top of the ranking is worth big bucks. This economic value of improved ranking is widely held and drives considerable behavior and investment today. It’s a very powerful tool. In fact, so valuable that an entire industry has grown up around helping achieve better ranking. Ranking is a very powerful incentive.
What Google has done here is a tiny first step but it’s a very cool first step with lots of potential upside. If ranking is believed to be materially impacted by site performance, we are going to see the entire web speed up. This could be huge if Google keeps taking steps down this path. Steve Souder’s books High Performance Web Sites and Even Faster Web Sites will continue to have a bright future. Content Distribution Networks Like Limelight, Akamai, and Cloudfront will grow even faster. The very large cloud services providers like Amazon Web Services, with data centers all over the world will continue to grow quickly. We are going to see accelerated datacenter building in Asia. Lots will change.
If Google continues to move down the path of making web site latency a key factor in site ranking, we are going to see a faster web. More! Faster!
Thanks to Todd Hoff of High Scalability for pointing me towards this one in Web Speed Can Push You Off Googles Search Rankings.
--jrh
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Industry trends come and go. The ones that stay with us and have lasting impact are those that fundamentally change the cost equation. Public clouds clearly pass this test. The potential savings approach 10x and, in cost sensitive industries, those that move to the cloud fastest will have a substantial cost advantage over those that don’t.
And, as much as I like saving money, the much more important game changer is speed of execution. Those companies depending upon public clouds will noticeably more nimble. Project approval to delivery times fall dramatically when there is no capital expense to be approved. When the financial risk of new projects is small, riskier projects can be tried. The pace of innovation increases. Companies where innovation is tied the financial approval cycle and the hardware ordering to install lag are at a fundamental disadvantage.
Clouds change companies for the better, clouds drive down costs, and clouds change the competitive landscape in industries. We have started what will be an exciting decade.
Earlier today I ran across a good article by Rodrigo Flores, CTO of newScale. In this article, Rodrigo says;
First, give up the fight: Enable the safe, controlled use of public clouds. There’s plenty of anecdotal and survey data indicating the use of public clouds by developers is large. A newScale informal poll in April found that about 40% of enterprises are using clouds – rogue, uncontrolled, under the covers, maybe. But they are using public clouds.
The move to the cloud is happening now. He also predicts:
IT operations groups are going to be increasingly evaluated against the service and customer satisfaction levels provided by public clouds. One day soon, the CFO may walk into the data center and ask, “What is the cost per hour for internal infrastructure, how do IT operations costs compare to public clouds, and which service levels do IT operations provide?” That day will happen this year.
This is a super important point. It was previously nearly impossible to know what it would cost to bring an application up and host it for its operational life. There was no credible alternative to hosting the application internally. Now, with care and some work, a comparison is possible and I expect that comparison to be made many times this year. This comparison won’t always be made accurately but the question will be asked and every company now has access to the data to be able to credibly make the comparison.
I particularly like his point that self service is much better than “good service”. Folks really don’t want to waste time calling service personal no matter how well trained those folks are. Customers just want to get their jobs done with as little friction as possible. Less phone calls are good.
Think like an ATM: Embrace self-service immediately. Bank tellers may be lovely people, but most consumers prefer ATMs for standard transactions. The same applies to clouds. The ability by the customer to get his or her own resources without an onerous process is critical.
Self service is cheaper, faster, and less frustrating for all involved. I’ve seen considerable confusion on this point. Many people tell me that customers want to be called on by sales representatives and they want the human interaction from the customer service team. To me, it just sounds like living in the past. These are old, slow, and inefficient models.
Public clouds are the new world order. Read the full article at: The Competitive Threat of Public Clouds.
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
I did a talk at the Usenix Tech conference last year, Where does the Power Go in High Scale Data Centers. After the talk I got into a more detailed discussion with many folks from Netflix and Canada’s Research in Motion, the maker of the Blackberry. The discussion ended up in a long lunch over a big table with folks from both teams. The common theme of the discussion was predictably, given the companies and folks involved, innovation in high scale service and how to deal with incredible growth rates. Both RIM and Netflix are very successful and, until you have experienced and attempted to manage internet growth rates, you really just don’t know. I'm impressed with what they are doing. Growth brings super interesting problems and I learned from both and really enjoyed spending time with them.
I recently came across an interesting talk by Santosh Rau, the Netflix Cloud Infrastructure Engineering Manager. The fact that Netflix actually has a Cloud Infrastructure engineering manager is what caught my attention. Netflix continues to innovate quick and is moving fast with cloud computing.
My notes from Rau’s talk:
· Details on Netflix
o More than 10m subscribers
o Over 100,000 DVD titles
o 50 distribution centers
o Over 12,000 instant watch titles
· Why is Netflix going to the cloud
o Elastic infrastructure
o Pay for what you use
o Simple to deploy and maintain
o Leverage datacenter geo-diversity
o Leverage application services (queuing, persistence, security, etc.
· Why did Netflix chose Amazon Web Services
o Massive scale
o More mature services
o Thriving, active developer community of over 400,000 developers with excellent support
· Netflix goals for move to the cloud:
o Improved availability
o Operational simplicity
o Architect to exploit the characteristic of the cloud
· Services in cloud:
o Streaming control service: stream movie content to customers
§ Architecture: Three Netflix services running in EC2 (replication, queueing, and streaming) with inter-service communication via SQS and persistent state in SimpleDB.
§ Good cloud workload in that usage can vary greatly and there is value in having regional data centers and a better customer experience is possible by streaming content from locations near users
o Encoding Service: Encodes movies in format required by diverse set of supported devices.
§ Good cloud workload in that its very computational intense and as new formats are introduced, massive encoding work needs to be done and there is value in doing it quickly (more servers for less time).
o AWS Services used by Netflix
§ Elastic compute Cloud
§ Elastic Block Storage
§ Simple Queuing Service
§ SimpleDB
§ Simple Storage Service
§ Elastic Load Balancing
§ Elastic MapReduce
o Developer Challenges:
§ Reliability and capacity
§ Persistence strategy
· Oracle on EC2 over EBS vs MySQL vs SimpleDB
· SimpleDB: Highly available replicating across zones
· Eventually consistent (now supports full consistency (I love eventual consistency but…)
§ Data encryption and key management
§ Data replication and consistency
Predictably, the talk ended with “Netflix is hiring” but, in this case, it is actually worth mentioning. They are doing very interesting work and moving lightening fast. RIM is hiring as well: http://www.rim.com/careers/index.shtml.
The slides for the talk are at: slideshare.
--jrh
James Hamilton
e: jrh@mvdirona.com
w: http://www.mvdirona.com
b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
|