James Hamilton's Blog RSS 2.0
 Monday, November 17, 2008

Two weeks ago I posted the notes I took from Tony Hoare’s “The Science of Programming” talk at the Computing in the 21st Century Conference in Beijing.  

 

Here’s are the slides from the original talk: Tony Hoare Science of Programming (199 KB).

Here are my notes from two weeks back: Tony Hoare on The Science of Programming.

 

                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Monday, November 17, 2008 8:37:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Saturday, November 15, 2008

Last week, IBM honored database giant Pat Selinger by creating a Ph.D. Fellowship in her name.  I worked with Pat closely for many years at IBM and much of what I learned about database management systems was from Pat during those years.   She was a one of the original members of the IBM System R team and is probably best known as the inventor of the cost based optimizer. Access Path Selection in a Relational Database Management Systems is a paper from that period that I particularly enjoyed.

 

From the IBM press release:

 

Pat Selinger IBM Ph.D. Fellowship: awarded to an exceptional female Ph.D. student worldwide with special focus on database design and management

 

Pat Selinger IBM Ph.D. Fellowship
Dr. Pat Selinger was a leading member of the IBM Research team that produced the world's first relational database system and established the basic architecture for the highly successful IBM DB2 database product family. Her innovative work on cost-based query optimization for relational databases has been adopted by nearly all relational database vendors and is now taught in virtually every university database course. In 1994, Dr. Selinger was named an IBM Fellow -- an honor accorded only to the top 50 technical experts in IBM -- and in 2004, she was inducted into the Women in Technology International Hall of Fame.

 

An ACM Queue interview with Pat: A conversation with Pat Selinger.

 

It’s great to see IBM actively supporting engineering education, particular encouraging female engineers, and recognizing Pat Salinger’s contribution to the commercial and academic database community.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Saturday, November 15, 2008 11:10:14 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, November 12, 2008

Intel Fellow and Director of Storage Architecture Knut Grimsrud presented at WinHEC 2008 last week and it caught my interest for several reasons: 1) he talked about Intel findings with their new SSD which looks like an extremely interesting price/performer, 2) they have found interesting power savings in their SSD experiments beyond the easy to predict reduction in power consumption of SSDs over HDDs, and 3) Knut presented a list of useful SSD usage do’s and don’ts.

 

Starting from the best practices:

·         DO queue requests to SSD as deeply as possible

  • SSD has massive internal parallelism and generally is underutilized. Parallelism will further increase over time.
  • Performance scales well with queue depth

·         DON’T withhold requests in order to “optimize” or “aggregate” them

  • Traditional schemes geared towards reducing HDD latencies do not apply. Time lost in withholding requests difficult to make up.

·         DO worry about software/driver overheads & latencies

  • At 100K IOPS how does your SW stack measure up?

·         DON’T use storage “backpressure” to pace activity

  • IO completion time (or rate) is not a useful pacing mechanism and attempting to use that as throttle can result in tasks generating more activity than desired

 

Common HDD optimizations you should avoid:

·         Block/page sizes, alignments and boundaries

  • Intel® SSD is insensitive to whether host writes have any relationship to internal NAND boundaries or granularities
  • Expect other high-performing SSDs to also handle this
  • Internal NAND structures constantly changing anyway, so chasing this will be a losing proposition

·         Write transfer sizes & write “globbing”

  • No need to accumulate writes in order to create large writes
  • Temporarily logging writes sequentially and later re-locating to final destination unhelpful to Intel SSD (and is detrimental to longevity)

·         Software “helping” by making near-term assumptions about SSD internals will become a long-term hindrance

  • Any SW assistance must have longevity

 

On the power savings point, Knut laid out an interesting argument on increased power savings for SSDs over HDDs beyond the standard device power difference.  These standard power differences are real of course but, on a laptop device where a HDD typically draws around 2.5W active, these often pointed to savings are relatively small. However, an additional measurable savings was reported by Knut. Because SSDS are considerably faster than HDD, speculative page fetching done by Windows Superfetch is not needed.  And, because Superfetch is sometimes incorrect, the additional I/Os and processing done by Superfecth, consume more power.  Essentially, with the very high random I/O rates offered by SSDs, Superfetch isn’t needed and, if disabled, there will be additional power savings due to reduced I/o and page processing activity.

 

Another potential factor I’ve discussed with Knut’s is that in standard laptop operating mode, the common usage model is one where there are periods of inactivity and short periods of peak workload typically accompanied by high random I/O rates.  More often than not, laptop performance is bounded by random I/O performance. If SSD usage allows these periods of work to be completed more quickly, the system can quickly return to an idle, low-power state.  We’ve not measured this gain but it seems intuitive that getting the work done more quickly will leave the system active for shorter periods and have it in idle states for longer.  Assuming a faster system spends more time in idle states (rather than simply doing more work), we should be able to measure additional power savings indirectly attributable to SSD usage.

 

Knut’s slides: Intel’s Solid State Drives. Thanks to Vlad Sadovsky for sending this one my way.

 

                                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, November 12, 2008 5:01:17 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Sunday, November 09, 2008

Abolade Gbadegesin, Windows Live Mesh Architect gave a great talk at the Microsoft Professional Developers Conference on Windows Live Mesh (talk video, talk slides). Live mesh is a service that supports p2p file sharing amongst your devices, file storage in the cloud, remote access to all your devices (through firewalls and NATS), and web access to those files you chose to store in the cloud. Live Mesh is a good service and worth investigating in its own right but what makes this talk particularly interesting is Abolade gets into the architecture of how the system is written and, in many cases, why it is designed that way. 

 

I’ve been advocating redundant, partitioned, fail fast service designs based upon Recovery Oriented Computing for years.  For example, Designing and Deploying Internet Scale Services (paper, slides). Live Mesh is a great example of such a service.   It’s designed with enough redundancy and monitoring such that service anomalies are detected and, when detected, it’ll auto-recover by first restarting, then rebooting, and finally re-image the failing system.

 

It’s partitioned across multiple data centers and, in each datacenter, across many symmetric commodity servers each of which is a 2 core, 4 disk, 8 GB system. The general design principles are:

·         Commodity hardware

·         Partitioning for scaling out, redundancy for availability

·         Loose coupling across roles

·         Xcopy deployment and configuration

·         Fail-fast, recovery-oriented error handling

·         Self-monitoring and self-healing

 

The scale out strategy is to:

·         Partition by user, device, and Mesh Object

·         Use soft state to minimize I/O load

·         Leverage HTTP 1.1 semantics for caching, change notification, and incremental state transfer

·         Leverage client-side resources for holding state

·         Leverage peer connectivity for content replication

 

Experiences and lessons learned on availability:

·         Design for loosely coupled dependence on building blocks

·         Diligently validate client/cloud upgrade scenarios

·         Invest in pre-production stress and functional coverage in environments that look like production

·         Design for throttling based on both dynamic thresholds and static bounds

 

Experiences and lessons learned on monitoring:

·         Continuously refine performance counters, logs, and log processing tools

·         Monitor end-user-visible operations (Keynote)

·         Build end-to-end tracing across tiers

·         Self-healing is hard:  Invest in tuning watchdogs and thresholds

 

Experiences and lessons learned on deployment:

·         Deployments every other week, client upgrades every month

·         Major functionality roughly each quarter

·         Took advantage of gradual ramp to learn lessons early

 

--jrh

 

Thanks to Andrew Enfield  for sending this one my way.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Sunday, November 09, 2008 10:05:38 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Wednesday, November 05, 2008

Butler Lampson, one of the founding members of Xerox PARC, Turing award winner, and one of the most practical engineering thinkers I know spoke a couple of days ago at the Computing in the 21st Century Conference in Beijing. My rough notes from Butler’s talk follow.  Overall Butler argues that “embodiment” is the next big phase of computing after simulation and communications.  Butler defines embodiment as computers interacting directly with the physical world.  For example, autonomously driven vehicles.  Butler argues that this class of applications are only possible now due to the rapidly falling price of computing coupled with systems capabilities driven by Moore’s law.

 

He argues that we need to further advance how we deal with uncertainty and dependability to be successful with these applications.  Uncertainty is important since all input has noise, all sensors have faults, and all data is incomplete.  Dependability in that these systems are directly interacting with the physical world and actions in the physical world can have live critical failure modes. 

 

Butler’s recommendation on how to build incredibly complex systems that directly interact with the physical world and yet have these systems be dependable is to build them two tier.  At the core, is a small, simple kernel that doesn’t do a great job of its task but doesn’t hard fail and won’t kill anyone.  He calls this “catastrophe mode”.  For example, an autonomous vehicle may slow down to 10 MPH or just safely stop in catastrophe mode. 

 

The software stack is designed in two layers where the top layer is responsible for the complex, real time interaction the system is designed to deliver. The inner or lower layer is catastrophe mode designed to be simple and, as only simple systems can be, correct.  I like the approach.

 

Butlers Slides are: ButlerLampson_China_Microsoft2008 (1.49 MB).

 

                                                                --jrh

 

Title: The Uses of Computers: What's Past is Merely Prologue

Speaker: Butler Lampson

 

Implication of Moore

·         Spend hardware to simplify software

·         Hardware enables new applications

·         Pull complexity up into software (if unavoidable)

The uses of computers:

·         1950: Simulation

·         1980: Communications

·         2010: Embodiment (computers interacting directly with the physical world)

Argument: embodiment is now possible and there are some grand challenges that fall into this category:

·         Gave some examples from Jim Gray’s Systems Challenges (Turing award lecture)

·         Butler  example: Reduce highway traffic deaths to zero

What do we need to learn how to deal with to achieve embodiment in general and zero traffic deaths in particular:

·         Dealing with uncertainty

o   Need good models of what can happen (what is possible)

o   Need boundaries for models (where they don’t apply)

·         Dependability

o   The system meets its spec

o   Measure: probability(failure) x Cost(failure)

o   Had to model dependability. Recommends using “no catastrophes”

o   Must have a threat model of what can go wrong

o   Recommends producing a simple, small base that will avoid catastrophe. It must be simple. There may be incredibly complex, very highly optimized layers but a reliable systems needs to be able to fail back to the reliable base kernel (less than 50k loc?)

Conclusions for Engineers:

·         Understand Moore’s Law

·         Aim for mass markets

·         Learn how to deal with uncertainty

·         Learn how to avoid catastrophe (avoiding fault not possible in systems at scale)

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, November 05, 2008 1:07:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Tuesday, November 04, 2008

Tony Hoare spoke yesterday at the Computing in the 21st Century Conference in Beijing. Tony is a Turing award winner, Quicksort inventor, author of the influential Communication Sequential Processes (CSP) formal language, and long time advocate of program verification and tools to help produce reliable software systems. In his talk he argues that programming should be and can be a science and the goals should be correct programs that stay correct through change. Zero defect software. 

 

He explains that engineers will accept that there will be defects but the scientist should pursue perfection far beyond that for which there is a commercial need. Tony has spent a big part of his successful career in pursuit of techniques and tools to produce reliable complex systems.

 

Tony ended his talk on an a practical engineering note hoping that we can advance our field to the point that “Software will contain no more errors than other engineering disciplines”.  We’re not there yet.

 

My rough notes from the talk follow.

 

Title: The Science of Programming

Speaker: Tony Hoare

 

The Vision:

·         Computer software contains no more errors

o   Software is the most reliable component of any device that contains it

·         Programmers make no mistakes

o   Programs work the first time they run

o   They run forever after, even after changing

·         Programming is an engineering discipline

o   Respected for its delivered benefits and it’s foundation on basic science

·         Semantics is the science of programming

o   Explores the meaning of computer programs

o   Operational: correctness of implementation

o   Algebraic: Correctness of optimization

o   Axiomatic

The Insight:

·         Computer programs are mathematical formulae

o   They don’t suffer from rust, wear, decay, fatigue

o   If a correct program is started in a correct state, they it will stay correct

·         Their correctness is a mathematical conjecture

o   To be proved by logic and calculation

o   Checked by the computer itself

History of the idea:

·         Aristotle (350bc): Syllogistic logic

·         Euclid (300bc): geometry

·         Leibnitz (1700): calculus

·         Boole (1850): laws of thought

·         Frege (1880): predicate logic

·         Russel (1920): Principia

·         Hao Wang (1956): Computer checks

Basic Science:

·         Answers fundamental questions

·         What does it do?

·         How does it work?

·         Why does it work?

·         How do we know?

What does it do?

·         Answered by its behavioral specification

How does it work?

·         Answer by it’s internal interface contracts

Why does the program work?

·         Answered by programming theory

How do we know?

·         By logical/mathematical proof

Ideals in Basic Science

·         Pursued for the sake of scientific glory far in advance of commercial need

·         Physics: accuracy of measurement

·         Chemistry: purity of materials

·         Computing Science: zero defect programs

Unifying Theory

·         Basic science seeks unifying theories

·         Explains diverse phenomena

·         Supported by evidence

Overall, industry is not heavily using software verification along the lines that Tony wants to see but there are some in use. For example, some tools in use at Microsoft:

·         PREfix and PREfast

·         Static Driver Verifier

·         ESP (locates potential buffer overflows)

The Hope:

·         Software will contain no more errors than other engineering disciplines.

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, November 04, 2008 2:23:57 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Saturday, October 25, 2008

Service monitoring at scale is incredibly hard. I’ve long argued that you should never learn anything about a problem your service is experiencing from a customer.  How could they possibly know first when there is a service outage or issue? And, yet it happens frequently. The reason it happens is most sites don’t have close to an adequate level of instrumentation.  Without this instrumentation, you are flying blind.

 

Systems monitoring data can be used to drive alerts, to compute SLAs, to drive capacity planning, to find latencies, to understand customer access patterns, and some sites use it to drive billing although the later is probably a mistake.

 

In the rare cases where I’ve come across high quality monitoring systems that actually do fine-grained data collection, its often not looked at or underutilized.  It turns out that fully using and exploiting very large amounts of  monitoring data isn’t much easier than collecting it.

 

Returning the challenge of efficiently collecting fine grained monitoring data and events from thousands of servers, Facebook made a contribution yesterday in making Scribe available as an open source project: Facebook's Scribe technology now open source.  Scribe is used at Facebook to monitor their more than 10k servers across multiple data centers.  Scribe is a Sourceforge project at: http://sourceforge.net/projects/scribeserver/.

 

Facebook continues to both develop interesting and broadly useful software and often contributes it to the community by making it open source. For example, Facebook Releases Cassandra as Open Source.

 

Some excerpts from On Designing and Deploying Internet-Scale Services on why I think auditing, monitoring, and alerting are important

 

Alerting is an art. There is a tendency to alert on any event that the developer expects they might find interesting and so version-one services often produce reams of useless alerts which never get looked at. To be effective, each alert has to represent a problem. Otherwise, the operations team will learn to ignore them. We don’t know of any magic to get alerting correct other than to interactively tune what conditions drive alerts to ensure that all critical events are alerted and there are not alerts when nothing needs to be done. To get alerting levels correct, two metrics can help and are worth tracking: 1) alerts-to-trouble ticket ratio (with a goal of near one), and 2) number of systems health issues without corresponding alerts (with a goal of near zero).