How a PR Miracle can save VMware from itself

(and how one little itsy-bitsy code snafu has resulted in a huge FAIL for the soon-to-be AMD of the virtualization industry)

Sat August 16 19:37:01 EDT 2008

martums

On or about 12 August 2008 at 1238 hours EST AU , VMware Communities user mattjk of Melbourne, Australia, started a thread in the VI3 ESX 3.5 Communities Discussion Forum. This thread was one of the earliest public, high-profile indicators of the failure of ESX & ESXi 3.5 update 2 to power-on or VMotion virtual machines, aka guests, if the ESX (or host) server time was on or after 12 August. Effected builds include 103908 & 103909, respectively ESX and ESXi. Without VMotion, features leveraged by Enterprise customers such as High Availability (HA) and Distributed Resource Scheduler (DRS) also failed, potentially crippling some VMware shops.

Within 24 hours, over 500 responses to the original post were added. One by one, as clocks in each successive time zone moved into Tuesday morning, the early adopters of 3.5 update 2 around the globe began to encounter "A General System error occurred: Internal error" in their Virtual Infrastructure (VI) client applications. As the numbers of host failures increased globally, the impacts of this incident had several obvious effects on the vendor:

The First Global Hypervisor Failure

Initial indications suggested that VMware erroneously released their latest ESX & ESXi server product builds with some sort of time-limitation which impacted the ESX host's license management. This release of ESXi was the first time any edition of ESX, (in this case, just ESXi), had been made freely available, in an earlier announcement by Paul Maritz, the new boss, head man, top dog, big cheese. Free, as in (light) beer. (I realize that the rule is never complain about free beer. But this one tastes terrible).

Technical blogs and news sites, including Slashdot and my beloved El Reg, 2, were quick to publish information as it became available. If VMware's KB is available, here are three related documents.

VMware's widely-adopted hypervisor is your newest single point of failure.

Speaking of titanic foul-ups, the new VMware chief, Maritz, is a former Microsoft executive.

Why This Is Important

(Anyone remotely familiar with the growing x86 virtualization trend can skip this first bit). Once upon a time, when fuel was cheap, men were men, and sheep were scared, data centers sucked up electricity and cooling like they were going out of style. Fast forward to today: they've gone out of style. Now, the big, mean, greedy corporations are trying to save money every possible way. Short of using commodity components like Google, (and if you're an ignorant middle management tard that thinks (irony) that's a good idea, learn this: Google's leveraging of commodity hardware is currently the rare exception, not the rule, dummy), the marketing hype surrounding the x86 virtualization industry rings reasonably true: consolidate many under-utilized or legacy x86 servers onto fewer, more powerful, x86 host servers. The guests can be built from scratch or P2V'ed. The increased resource utilization of those fewer host servers results in reduced cooling needs and less electricity consumed, lowering TCO, etc. etc. Less money, less power, more bang for your money. Funny, it's along similar lines to what Gates and Ballmer were preaching with the launch of Windows Server 2003: do more, with less.

Virtualization is decades old. There's heaps more available about the history of x86 virtualization online, just ask Google and Wikipedia.

Shiny

With a significantly smaller installation footprint, smaller memory overhead, no *nix bootstrap, no service console (well, sort-of), ESXi can also be delivered both traditionally and in newer fashions as bootable USB flash media, or similarly yet more significantly, embedded on server motherboards by OEM's. The latter represents a fundamental shift in how the hypervisor with the larger installation footprint, ESX, is and has been previously delivered and implemented.

Version 3.5 update 2 promised some fantastic new features, some of which may have been driving forces for early adoption. These and other attributes (especially being free (as in beer)) make ESXi attractive to a wider market, both new and existing customers. Not only does VMware currently have the majority of the x86 virtualization market, but their market share is steadily growing.

Bells and whistles aside, just how widespread was this little coding snafu?

Impact

The effects of this incident may have been fairly limited from a global IT perspective. Fast-forward five or ten years from now, when x86 virtualization is the standard, where management, budgets, and logic dictate that you virtualize first, and have to jump through considerable hoops to justify and actually get a physical server. The next global hypervisor failure will be a meltdown, potentially impacting hundreds of millions of people in some fashion. So much shit will hit the fan, oh, you get the picture.

Had the time bomb been, say, 90 or 180 days out --Holy Headache, Batman-- this could have been a bloody catastrophe. Exaggeration?

Before we get into the whole "putting-VMware-under-the-proverbial-microscope," just one tiny question: who in their right bloody-effing minds puts two-week old anything into production? It's one thing to drop it onto a sandbox, but production? Seriously. I'm not sure how the old dog-fooding-analogy covers this, so here goes: We (IT) ought to thoroughly kick (the shit out of) the tires on anything before it gets handed over to our customers. </soapbox> That's not news to > 90% of the (four) people that will read this, (hi mom!), so don't take it personally and email me. For those who just have to file a complaint, my preferred email address for complaints is comega@attrition.org. Now where's my shit-eating-grin emoticon?

Was the time bomb known about by VMware prior to 12 August? Yes. According to this article at NetworkWorld, VMware contacted at least one customer on Monday with advance warning. Thanks for the advance notice there, kids. Oh, wait... our shop didn't get any... (Yours truly is wondering, on behalf of paying customers everywhere, where the expletive was my phone call on the 11th?). In fairness, we did receive an email on the 12th, though. After it was on CNN. Kidding?

Why 36+ hours for a fix? (less the change-the-date hack). The initial estimate of noon 13 August Pacific time was later pushed back to 6 PM, leaving mattjk and thousands of others 48 hours from widespread bug-awareness-to-fix, unless you choose to take advantage of the advance patch *crosses fingers*. The full update (media and binary replacement) to ESX/ESXi has been repeatedly pushed back. The fourth email I received regarding this included: "...that we are experiencing a delay in releasing the new version of ESX/ESXi 3.5 Update 2. Our testing of this release is taking longer than anticipated...". NO RUSH, for everyone's sake, get it right. Let the 3.5 Update 2 early adopters roll back their clocks and continue to kick themselves.

How could any competent QA department let something as seemingly basic as recursive date testing slip by? I'm not even going there. Shit happens. Although, as shit-happening-goes, this is, like, sperm-whale-dump-huge.

Is this cluster funk Diane Green's baby, albeit indeliberately, having only recently left the company on 8 July? Who knows. Speculation is useless, and the

Numbers Don't Lie

Unless they're from the marketing department. Or a survey -- any survey. Anyway. Here are the numbers:

With their stock and market share at risk, how will VMware recover from this incident?

VMware's PR Department Will Lie Its Ass Off

Already in Australia, the great spin is underway. And an "oops, my bad" issued by Maritz, with a sad, puppy-dog-eyes accompanying photo of the new CEO to match. In the days and weeks following this incident, the caliber of expertise (or amateurism) of the PR campaign executed by VMware will dictate the degree of recovery or damage resulting from this incident. In the history of PR, there are a number of examples of massive incidents followed by stunning PR campaigns that either saved the ship, or resulted in utter defeat. (VMware should hire the PR firm(s) the tobacco industry uses).

Without a doubt, the greatest spin of the summer will be coming from the suddenly-in-the-spotlight VMware PR staffers as they scramble to make lemonade from...

So how can those (hopefully lying sonsofbitches) in VMware's PR make or break the future of VMware and ultimately save us from having to eat more Microsoft second-rate also-rans (read: dog crap), specifically Hyper-V? They'll have to hit home runs on several critical points:

  1. VMware's products and solutions are still world-class, industry-leading, top-of-the-line x86 virtualization nirvana. Workstation. ESX. ESXi. VMware Server. Converter. Lab Manager. (etc.) Features of the VI suite such as VMotion, HA & DRS -- and the unintended benefit, with a high-throughput WAN and offsite SAN replication, making DR (at least in theory), far easier than it ought to be.
  2. This incident will teach the VMware leadership a valuable lesson (... but not the folks in QA or whomever that missed this. I hypothesize that somewhere middle management put the squeeze on the testers, and this slipped through accordingly. Middle management bullshit was one of the leading (falling) domino's that ultimately brought down the STS Challenger). Fuck off and let your people do their jobs with a reasonable amount of time. Don't squeeze them. Don't predict an early release of The Next Big Thing to your customers. Let engineering and the related departments work their magic. For Pete's sake, piss off, pointy-haired asshats.
  3. VMware must continue to innovate, and stay ahead of the curve. Let IBM, Sun/Citrix/Xen, and most importantly Microsoft, eat dust.
  4. VMware's VI3 advanced features need to drop in price. Giving away ESXi is great. Now stop doubling the cost of my bloody servers because your customers (we) want to leverage Enterprise features. When Hyper-V becomes good enough, you'll price yourself right into oblivion.
  5. Stop apologizing in your updates over this and get back to work. Green's departure was unsettling to say the least. This incident has really shaken our faith, and trust takes time to rebuild. Just get back to it, already.

Obviously, the faith of the user base and customer loyalty will need to recover and remain strong to sustain sales in the immediate future. To an extent, I would compare some of VMware's customers to the proverbial "Mac Zealots." Of course, that's overstated, but still, we are some seriously loyal customers. Some of us having been around since the early days of Workstation. I remember when Workstation lost its first tooth, awwww...

And in all fairness, I'm not bashing Redmond's Hyper-V (much). I just don't want a second-rate virtualization solution. In typical Borg mentality, they assimilated the shop that wrote the book on PC-on-Mac virtualization. Why? Because a) they needed to get in to the market, and b) cheap! You know what else was cheap? QDOS. Quick & Dirty OS -- a synonym for dog shit. QDOS was the predecessor to MS-DOS. MS-DOS, in turn, was the foundation for basically every consumer-edition of Windows, (before XP, or NT 5.1), so that's Windows 1.0, 2.0, 286/386, 3.0, 3.1x, 95, 98 & SE and Me--and did THAT suck (everything without the NT kernel). Who's tired of eating Redmond's dog poo? (Hell, who isn't?)

(Note: Although some editions of Windows NT, e.g. Windows 2000 Professional, could be purchased & used by consumers, (and were, in fact, to an extent), that use represents a minority, and as such is generally outside the scope of this document).

Public Relations Disaster Recovery

Is there truly no such thing as bad press? If VMware's PR response gets this right -- which even those of us outside of the PR industry could appreciate -- this save could be one of the biggest corporate comebacks since Exxon-Valdez. Or the AOL DB SNAFU. Exaggeration? Watch their stock continue to drop.

On the flip side, not to defend the *cough* jackass *cough* early adopters, there are some interesting, perhaps cutting-edge features available in 3.5 update 2. I'll dial back the VMware horn-tooting just a bit, obviously being a huge fan of their products (and a shameless attention whore). Among an array of new features, many deserve the attention they're getting, including:

If you were running this two-week-old release of ESX/ESXi in production, I have one question: why? Were early ESX adopters trying to stay ahead of the service console security patches by upgrading to a newer version? Sure, it might be nice to get a newer patched release up and running, but is it worth the risk?

The First Step is Admitting You Have a Problem

Vmware is in a precarious position. This high-profile, customer-crippling snafu comes just weeks after the (wrongful) termination of the CEO, Diane Green. Green helped build the company to where it is today, 1.3 *pinky-to-lips* Billion dollars in revenue last year. These incidents, coupled with the seemingly inevitable departure of Green's husband, VMware's chief scientist, makes even the most avid VMware groupie nervous. The Next Big Screw-up could very well be the nail in their coffin, effectively creating a glass ceiling at the #2 position in the virtualization industry, and giving Redmond the window it needs to drive QDOS, oops, Hyper-V to the head of the hypervisor lunch line.

VMware would do well to beware the 800-lb gorilla: The next slip-up could very well have Microsoft eating their lunch. The next thing you know, you're like Citrix, Symantec, or whomever: suddenly your tertiary markets are your primary, and last year's core product is this year's Banyan Vines.

Early Adopters are Idiots

Especially handing over something right out of the box to your customers. (Brilliant! Jackass). There are a ton of reasons why the iPhone, with all the sex-appeal the consumer electronics market can stand, is so far away from enterprise adoption. It's buggy & unstable. It's easily cracked -- even post-enterprise management. It's just not ready for prime time. The Xbox 360, since its release, was fraught with many failures. Redmond had to work furiously to extend the warranties, hide the numbers, and backpeddle enough to stem the likely tide of a class-action suit.

Each IT shop is obviously unique in its own respects. If the nature of your work dictates staying on or near the leading edge, then chances are you're no stranger to the occasional show-stopper. While I can completely empathize with that driving force, in retrospect, staying one or two ESX (really VI) revisions behind might be more reasonable. If it's a year old and supported, what's the rush? Yes, some of the 3.5 update 2 features could potentially be very useful, but are any of them must-haves? Server 2008 is available in 3.5 update 1, and worked under different profiles (albeit unsupported) well before that.

Monkey Developers

Steve Ballmer must just be dancing naked around his office, like Robin Williams in that Central Park scene from The Fisher King. Just as giddy as a stoned tard. Seriously, the man responsible for my favorite little web jingle, and the leadership at MS as a whole, must be more than a little pleased at this colossal blunder. First off, because for once it isn't them making the colossal blunder, and second, whatever hurts VMware is good for Hyper-V. Even if Hyper-V is an adolescent, amateur-night, two-drink-minimum, you-must-be-shitting-me-you're-running-what-product.

This incident will strengthen the argument of running multiple hypervisors, (merciful Zeus, how I hate that ridiculous word), from multiple vendors. See Paul Korzeniowski's article here. Not just Redmond, this has to be good news for Citrix, Xen, Sun, Oracle, IBM, all of the other me-too kids on the x86-virtualization-block. Users now hesitant to take an initial plunge with the free ESXi, much less the free VMware Server (the 2.0 beta is great), will undoubtedly consider one of the alternatives. If you're a Microsoft shop, and you're already licensed for Hyper-V as part of Server 2008, why not?

That was rhetorical, dammit.

In Closing

Yes, VMware dropped the ball. Yes, it will compound their temporarily-declining stock price. Furthermore, it's ridiculous that such decline led to the ousting of Diane Green. The market is so ridiculously volatile, if you aren't cooking the books to keep the analysts at bay, honestly -- you could sneeze -- and see your numbers dip. That said, this ain't no sneeze, senator. This is a genuine cluster-f#ck of enormous-f#cking-proportions.

There's a good summary over at bmighty titled "Business Lessons From The VMWare Bug (And How It Was Handled)". Many good points and worth a read.

In the absence of a widely-adopted FOSS virtualization suite, (hypervisor just sounds so... marketing), VMware is the best contender with sufficient market share to keep the comparative Microsoft product where it belongs: lightly used, and often compared to the likes of *giggle* AOL.

These production-early-adopters and VMware's dodgy PR team will prove what I've suspected all along: the creed of Gregory House, M.D. is spot-on: all people are idiots, and everybody lies.

VMartums

Many thanks to Lyger, the legendary editor-in-chief of my loquacious train wrecks. And to Big J for letting me contribute. Insert your ball-shriveling-disclaimer here.

Copyright 2008 by Martums. Permission is granted to quote, reprint or redistribute provided the text is not altered, and appropriate credit is given.


main page ATTRITION feedback