I’ve spent a big part of my life working on structured storage engines, first in DB2 and later in SQL Server. And yet, even though I fully understand the value of fully schematized data, I love full text search and view it as a vital access method for all content wherever it’s stored. There are two drivers of this opinion: 1) I believe, as an industry, we’re about ¼ of the way into a transition from primarily navigational access patterns to personal data to ones based upon full text search, and 2) getting agreement on broad, standardizing schema across diverse user and application populations is very difficult.
On the first point, for most content on the web, full text search is the only practical way to find it. Navigational access is available but it’s just not practical for most content. There is simply too much data and there is no agreement on schema so more structured searches are usually not possible. Basically structured search is often not supported and navigational access doesn’t scale to large bodies of information. Full text search is often the only alternative and it’s the norm when looking for something on the web.
Let’s look at email. Small amounts of email can be managed by placing each piece of email you chose to store in a specific folder so it can be found later navigationally. This works fine but only if we keep only a small portion of the email we get. If we never bothered to throw out email or other documents that we come across, the time required to folderize would be enormous and unaffordable. Folderization just doesn’t scale. When you start to store large amount of email or just stop (wasting time) aggressively deleting email, then the only practical way to find most content is full text search. As soon as 5 to 10GB of un-folderized and un-categorized personal content is accumulated, it’s the web scenario all over again: search is the only practical alternative. I understand that this scenario is not supported or encouraged by IT or legal organizations at most companies but that is the way I chose to work. There is no technical stumbling block to providing unbounded corporate email stores and the financial ones really don’t stand up to scrutiny. Ironically most expensive, corporate email systems offer only tiny storage quotas while most free, consumer-based services are effectively unbounded. Eventually all companies will wake up to the fact that knowledge workers work more efficiently with all available data. And, when that happens, even corporate email stores will grow beyond the point of practical folderization.
The second issue was the difficulty of standardizing schema across many different stores and many different applications. The entire industry has wanted to do this over the past couple of decades and many projects have attempted to make progress. If they were widely successful, it would be wonderful but they haven’t been. If we had standardized schema, we would have quick and accurate access to all data across all participating applications. But it’s very hard to get all content owners to cooperate or even care. Search engines attempt to get to the same goal but they chose a more practical approach: they use full text search and just chip away at the problem. They work hard on ranking. They infer structure in the content where possible and exploit it where it’s found. Where structure can’t be found, at least there is full text search with reasonably good ranking to full back upon.
Strong or dominant search engine providers have considerable influence over content owners and weak forms of schema standardization becomes more practical. For example, a dominate search engine provider can offer content owners opportunities to get better search results for their web site if they supply a web site map (standard schema showing all web pages in site). This is already happening and web administrators are participating because it brings them value. A web sites ranking in the important search engine providers is very vital and a chance to lift your ranking even slightly is worth a fortune. Folks will work really hard where they have something to gain. So, if adopting common schema can improve ranking, there is significant chance something positive actually could happen.
The combination of providing full text search over all content and then motivating content providers to participate in full or partial schema standardization coupled with the search engine inferring schema where it’s not feels like a practical approach to richer search. I love full text search and view it as the under-pinning to finding all information structured or not. The most common queries will include both structured and non-structured components but the common element will be that full schema standardization isn’t required nor is it required that a user understand schema to be able to find what they need. Over time, I think we will see incremental participation in standardized schemas but this will happen slowly. Full text search with good ranking and relevance assisted by whatever schema can be found or inferred in the data will be the under-pinning to finding most content over the near term.
--jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Some time back I got a question on what I look for when hiring a Program Manager from the leader of a 5 to 10 person startup. I make no promise that what I look for is typical of what others look for – it almost certainly is not. However, when I’m leading an engineering team and interviewing for a Program Manager role, these are the attribute I look for. My response to the original query is below:
The good news is that you’re the CEO not me. But, were our roles reversed, I would be asking you why you think you need PM at this point? A PM is responsible for making things work across groups and teams. Essentially they are the grease that helps make a big company be able to ship products that work together and get them delivered through a complicated web of dependencies. Does a single product startup in the pre-beta phase actually need PM? Given my choice, I would always go with more great developer at this phase of the companies life and have the developers have more design ownership, spend more time with customers, etc. I love the "many hats" model and it's one of the advantages of a start-up. With a bunch of smart engineers wearing as many hats as needed, you can go with less overhead and fewer fixed roles, and operate more efficiently. The PM role is super important but it’s not the first role I would staff in a early-stage startup.
But, you were asking for what I look for in a PM rather than advice on whether you should look to fill the role at this point in the company’s life. I don't believe in non-technical PMs, so what I look for in PM is similar to what I look for in a developer. I'm slightly more willing to put up with somewhat rusty code in a PM, but that's not a huge difference. With a developer, I'm more willing to put up with certain types of minor skill deficits in certain areas if they are excellent at writing code. For example, a talented developer that isn’t comfortable public speaking, or may only be barely comfortable in group meetings, can be fine. I'll never do anything to screw up team chemistry or bring in a prima donna but, with an excellent developer, I'm more willing to look at greatness around systems building and be OK with some other skills simply not being there as long as their absence doesn't screw-up the team chemistry overall. With a PM, those skills need to be there and it just won't work without them.
It's mandatory that PMs not get "stuck in the weeds". They need to be able to look at the big picture and yet, at the same time, understand the details, even if they aren't necessarily writing the code that implements the details. A PM is one of the folks on the team responsible for the product hanging together and having conceptual integrity. They are one of the folks responsible for staying realistic and not letting the project scope grow and release dates slip. They are one of the team members that need to think customer first, to really know who the product is targeting, to keep the project focused on that target, and to get the product shipped
So, in summary: what I look for in a PM is similar to what I look for in a developer (http://mvdirona.com/jrh/perspectives/2007/11/26/InterviewingWithInsightAtMicrosoft.aspx) but I'll tolerate their coding possibly being a bit rusty. I expect they will have development experience. I'm pretty strongly against hiring a PM straight out of university -- a PM needs experience in a direct engineering role first to gain the experience to be effective in the PM role. I'll expect PMs to put the customer first and understand how a project comes together, keep it focused on the right customer set, not let feature creep set in, and to have the skill, knowledge, and experience to know when a schedule is based upon reality and when it's more of a dream. Essentially I have all the expectations of a PM that I have of a senior developer, except that I need them to have a broad view of how the project comes together as a whole, in addition to knowing many of the details. They must be more customer focused, have a deeper view of the overall project schedules and how the project will come together, be a good communicator, perhaps a less sharp coder, but have excellent design skills. Finally, they must be good at getting a team making decisions, moving on to the next problem, and feeling good about it.
--jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
I forget what brought it up but sometime back Sriram Krishnan forwarded me this article on Mike Burrows and his work through Dec, Microsoft, and Google (The Genius: Mike Burrows' self-effacing journey through Silicon Valley). I enjoyed the read. Mike has done a lot over the years but perhaps his best known works of recent years are Alta Vista at DEC and Chubby at Google.
I first met Mike when he was at Microsoft Research. He and Ted Wobber (also from Digital) came up to Redmond to visit. Back then I led the SQL Server relational engine development team which included the full text search index support. I was convinced then, and still am today, that relational database engines do a good job of managing structured data but a poor job of the other 90 to 95% of the data in the world that is less structured. It just seems nuts to me that customers industry-wide are spending well over $10B a year on relational database management systems and yet only being able to effectively use these systems to manage a tiny fraction of their data. As an increasing fraction of the structured data in the world is already stored in relational database managements systems, industry growth will come from helping customers manage their less structured data.
To be fair, most RDMBS (including SQL Server) do support full text indexing but what I’m after is deep support for full text where the index is a standard access method rather than a separate indexing engine on the side and, more importantly, full statistics are tracked on the full text corpus allowing the optimizer to make high quality decisions on join orders and techniques that include full text indices.
If you haven’t read Mike’s original Chubby paper, do that: http://labs.google.com/papers/chubby.html. Another paper is at: http://labs.google.com/papers/paxos_made_live.html. Chubby is an interesting combination of name server, lease manager, and mini-distributed file system. It’s not the combination of functionality that I would have thought to bring together in a single system but it’s heavily used and well regarded at Google. Unquestionably a success.
--jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
The years of Moore’s law growth without regard to power consumption are now over. On the data center side, power isn’t close to the largest cost of running a large service but it is one of the largest controllable costs and it has been in the press frequently of late. On the client side, battery power is the limiting factor.
It is worth understanding what devices consume the most power since most laptops provide some form of user control. Most systems allow LCD backlight dimming, the CPU power consumption can be lowered (a combination of factors including reducing clock speed and voltage), wireless radios can be switched off, and disks activity can be curtailed or eliminated. Where does the power go?
The data below was measured by Mahesri and Vardhan with an Thinkpad R40 as the system under test:
|
Device |
Standby |
Minimum |
Maximum |
|
CPU |
|
11.3W |
25.5W |
|
CD-R/RW, DVD |
0.0W |
2.8W |
5.3W |
|
LCD Backlight |
|
0.6W |
3.5W |
|
Wireless (802.11) |
0.1W |
1.0W |
3.1W |
|
HDD (40GB@4,200RPM) |
0.2W |
0.6W |
2.8W |
|
LCD |
|
0.9W |
1.0W |
Data from: http://www.crhc.uiuc.edu/~mahesri/classes/project_report_cs497yyz.pdf.
The dominant consumer by a significant factor is the CPU. This power consumption is, of course, very load dependent particularly in multi-core systems where the spread between minimum and maximum power dissipation is even higher. The second largest consumer is the LCD backlight, which isn’t surprising. Two LCD-related findings that I did find surprising: 1) the LCD without backlight is a very light consumer of power, and 2) there is a perceptible difference in power consumption between mostly black and mostly white backgrounds. The hard disk drive power consumption was notably less than I expected with only 2.8W dissipated during active reading.
I wrote up more detail in: ClientSidePower6_External.doc (130 KB).
--jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
My rough notes from the Web 2.0 Keynote by Yahoo! CTO Ari Balogh:
· Yahoo! is making three big bets:
1. Be the starting point for all consumers
2. Be the must buy for advertisers
3. Provide an Open Platform
· Focus of today’s talk is on the later, open platform.
· Yahoo! broad set of assets are well known
· We lead in 7 areas including: Mail, My Front Page and Messenger (the full list was provided nor how Yahoo! was computed to “lead” in these area)
· 350M unique users/month and 500M users overall
· 20B page views/month
· 250M users minutes per month
· 10B user relationships across properties and this is the real asset
· Yahoo! has been open since 2003
· 25+ APIs
· 200K App IDs (hints at the large number of developers)
· #2 API in the world with Flikr
· 1B UI files/served/week
· Y!OS: (Yahoo! Open Strategy)
· Announcing today they are open all assets at Yahoo! to developers
· Planning to make all experiences at Yahoo “social”
· Provide an open developer platform with hooks for third parties to make experiences more social
· Built into application platform:
· Security: give users control of their data. Where they want to share what with who.
· Application gallery. A common way to <JRH>
· Unify profiles across all of Yahoo (this will take a while) and provide access to developers the social graph and the notification engine. Open up developer access to produce events and the platform includes the ranking engine to show users the most relevant events based upon their context (including social graph).
· Making Yahoo! more social:
· Not creating another social network
· Making all of yahoo “social”
· “social” isn’t a destination but rather a dimension of a user experience
· “social” drives relevance, community, and virality
· Showed some examples:
· Email client showing messages most relevant on the basis of social network
· Same basic idea for a “My Yahoo!” page
· When?
· Search Monkey is the first step
· Later this year they will deliver Y!OS and provide more uniform and consistent developer access
· Making Yahoo! more social will take longer with property by property steps being taken over next few years
· Summary:
1. Rewiring Yahoo! from the ground up
2. Open Yahoo! to developers like never before
3. Making Yahoo! more social across Yahoo! properties and to third party developers
The 12 min presentation is at: Ari Balogh Web 2.0 Expo Keynote.
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Flash SSDs in laptops have generated considerable excitement over the last year and are in use at both extremes of the laptop market. At the very low end, where only very small storage amounts can be funded, NAND Flash is below the below the disk price floor. Mechanical disks with all their complexity are very difficult to manufacture for less than $30 each. What this means is that for very small storage quantities, NAND Flash storage can actually be cheaper than mechanical disk drives even though the price per GB for Flash is larger. That’s why the One Laptop Per Child project uses NAND flash for persistent storage. At the high end of the market, NAND flash considerably more expensive than disk but, for the premium price, offers much higher performance, more resilience to shock and high G handling, and longer battery life.
Recently there have been many reports of high-end SSD laptop performance problems. Digging deeper, this is driven by two factors: 1) gen 1 SSDS produce very good read performance but aren’t particularly good on random write workloads, and 2) performance degradation over time. The first factor can be seen clearly in this performance study using SQLIO: http://blogs.mssqltips.com/blogs/chadboyd/archive/2008/03/16/ssd-and-sql-sqlio-performance.aspx. The poor random write performance issue is very solvable using better Flash wear leveling algorithms, reserving more space (more on this later), and capacitor backed DRAM staging areas. In fact STEC ZeusIOPS is producing great performance numbers today, Fusion IO is reporting great numbers, and many others are coming. The first problem, that of poor random write performance, can be solved and these solutions will migrate down to the commodity drives.
The second problem, the performance degradation issue, is more interesting. There have been many reports of laptop dissatisfaction and very high return rates: Returns, technical problems high with flash-based notebooks. Dell has refuted these claims Dell: Flash notebooks are working fine but there are lingering anecdotal complaints of degrading performance. I’ve heard it enough myself that I decided to dig deeper. I chatted off the record with an industry insider on why SSDs appear to degrade over time. Here’s what I learned (released with their permission):
On a pristine NAND SSD made of quality silicon to ensure write amplification remaining at 1 [jrh: write amplification refers to the additional writes that are caused by a single write due to wear leveling and the Flash erase block sizes being considerably larger than the write page size – the goal is to get this as close to 1 as possible where 1 is no write amplification], given a not-so-primitive controller and reasonable over-provisioning (greater than 25%), a sparsely used volume (less than half full at any time) will not start showing perceptible degraded performance for a long time (perhaps as long as 5 years, the projected warranty period to be given to these SSD products).
If any of the above conditions is changed, the write amplification will quickly degrade ranging from 2 to 5, or even higher. That contributes to the early start of perceptible degraded write performance. That is, on a fairly full SSD you’d start having perceptible write performance problems more quickly, and so on.
Inexpensive (cheap?) SSD made of low-quality silicon will likely to have more read errors. Error correction techniques will still guarantee correct information being returned on reads. However, each time a read error is detected, the whole “block” of data will have to be relocated elsewhere on the device. A not-so-well designed controller firmware will worsen the read delay, due to poorly implemented algorithms and ill-conceived space layout that take longer to search for available space for the relocated data, away from the read error area.
If the read-error-data-relocation happens to collide with the negative conditions that plague the write performance above, you’d start seeing overall degraded performance very quickly.
Chkdsk may have contributed to the forced relocation of the data away from where read errors occurred, hence improving the SSD performance (for a while) until the above collisions happen. Perhaps the same when Defrag is used.
In short, performance degradation over time is unavoidable with SSD devices. It’s a matter of how soon it kicks in and how bad it gets; and it varies across designs.
We expect the enterprise class SSD devices to be as much as 100% over-provisioned (e.g., a 64GB SSD actually holds 128GB of flash silicon).
Summary: there are two factors in play. The first is that SSD write random performance is not great on low end parts so ensure you understand the random write I/O specification before spending on an SSD. The second one is more insidious in that, in this failure mode, the performance just degrade slowly over time. The best way to avoid this phenomena is to 2x over-provision. If you buy N bytes of SSD, don’t use more than ½N and consider either chkdsk or copying the data off, bulk erasing, and sequentially copying back on . We know over-provisioning is effective. The later techniques are unproven but seem likely to work. I’ll report supporting performance studies or vendor reports when either surface.
--jrh
James Hamilton, Windows Live Platform Services Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052 W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com
Earlier today, Amazon AWS announced a reduction in egress charges. The new charges:
· $0.100 per GB - data transfer in
· $0.170 per GB - first 10 TB / month data transfer out
· $0.130 per GB - next 40 TB / month data transfer out
· $0.110 per GB - next 100 TB / month data transfer out
· $0.100 per GB - data transfer out / month over 150 TB
Compared with the old:
· $0.100 per GB - data transfer in
· $0.180 per GB - first 10 TB / month data transfer out
· $0.160 per GB - next 40 TB / month data transfer out
· $0.130 per GB - data transfer out / month over 50 TB
|