Kirk Glerum Shares the Story of How “Watson” Came to Windows XP

6

Kirk Glerum was a longtime Microsoft employee who joined the company in 1987 as a software design engineer.  Glerum spent most of his career at the Redmond software working on Office.  Glerum is well known as the father of codename “Watson,” an error reporting system that helped developers fix bugs, which originally debuted in Office.  Glerum is now semi-retired and with Windows XP support ending today he shared the story of how Watson came to Windows:

Microsoft officially, finally, and truly, ends support for Windows XP today.

Windows XP was the most important OS of my Microsoft days. Really, it was the most important product of my Microsoft days, even though I was an Office guy the whole time.

In ’98, while working on Office 2000, I came up with the idea of ‘Watson’, for the forthcoming Office XP. Instead of just having Office apps (Word Excel etc.) crash and burn, they would crash and report. They’d still crash, oh my yes, but us geeks back in Redmond would receive the data that (we believed) would allow us to fix the bugs.

I was full-on Office, thinking of little outside our multi-billion-dollar hemisphere. While giving a proposal on the idea, someone asked me if the Windows XP team was working on anything similar. I said (as recounted to me years later; I’d sort of forgotten) “I don’t know and I don’t care. I work on Office.” Well, as it happened, they had been, but it was a hodgepodge. Their plan, if it even rose to that level, was to collect a huge and random set of data, blit it back to a server farm in Redmond, gently agitate the drives overnight, and have the solutions appear. My scheme was hard-nosed and specific: we’d collect the stack, the globals, the heap, and organize it all around ‘buckets’. Two users crashing in the same spot would be in the same bucket, have the same ID. We would walk the stack – by hand if necessary – and produce an autopsy of the crash. Fix the bug, feed the update to the next guy hitting the same bucket, who – I posited – had probably hit the same bug.

The Windows team, to their credit, abandoned their plan and embraced mine. Windows actually wasn’t the first team outside of Office to do so: we’d already sold Watson to MSN and Internet Explorer. One of the big decisions for Windows was how broad to cast the net. Would they collect data on all application crashes, or just those happening in Microsoft code? I endorsed the latter, and was totally wrong. They correctly saw the benefit of getting it for *everyone*, and we were off to the races.

My job, besides being old and cranky, the Godfather of the thing, was to run the servers. Every crashing process talked to my code, on my servers. Another big decision was how to build those servers. At one point we figured there would be Office Watson servers for the Office crashes, MSN Watson servers for MSN, IE Watson servers for IE, and Windows Watson servers for everything else. That would have necessitated my regularly packaging up my code, and delivering it to my peers on the other teams. It would have meant schedules, and ship dates, and program management, all the horrid stuff I’d managed to leave behind. So out of sheer laziness, I said “oh heck, just send all your crashes to my servers”. One set of servers, under my thumb, mine all mine, for everything. This turned out to be precisely the right solution.

Those decisions meant that instead of my receiving crash data from a few dozen apps, I would have data from a few dozen million apps. I was to get crash data from every Windows application in the world. In most cases, amusingly, without the knowledge of the teams that had written them. The great majority of whom, less amusingly, never lifted a finger to use the data.

I became a Data Cowboy, and had the time of my life. For ten years I rode the bull, doing my level best to process that firehose, sorting and sifting, analyzing, aggregating, directing data to the teams to fix their bugs. And they did! Things crash a whole lot less now, and damn right I claim some credit.

The Glerums will open a bottle of champagne tonight, and toast the greatest of all Operating Systems. To Microsoft Windows XP, and to the team that built it, thank you for the opportunity of a lifetime.

Sincerely,
Kirk Glerum

Of course Watson evolved over the years to become the Windows Error Reporting system.  Internally this had a huge impact on how Microsoft developed software.  In 2002, Steve Ballmer noted that error reports enabled the Windows team to fix 29% of all Windows XP errors with Windows XP SP1. Over half of all Microsoft Office XP errors were fixed with Office XP SP2.  Steven Sinofsky was famously a huge fan of ‘telemetry’ data or data collected by the error reporting service.  Many of the decisions made for Windows 8 were based off telemetry data instead of the traditional methods of feedback.

h/t Steven Sinofsky



About Author

Suril is a scientist, journalist and obsessive Microsoft observer. He holds an advanced degree in Biotechnology with minors in Biochemistry, Microbiology, and Molecular Biology. Send him tips on twitter: http://www.twitter.com/surilamin

  • Bugbog

    Must definitely have bled through to the end result of Windows Phone/WP8, as it crashes the least of all known O.S.’es I’ve used.

    • donzebe

      So true, windows 8 / windows phone are very solid O.S.

  • RichFrantz

    “They’d still crash, oh my yes,” hahhahahaha

  • truff

    Dr. Watson was around wayyyyy before Office 2000 and XP. I assume he means it was his idea to have the data transmitted over the internet in his ordered fashion?

    • http://www.regularspelling.com/ Daniel ‘sRc’ Cheney

      to be fair, old Dr. Watson is way different than Dr. Watson in XP. so sounds like it was yes his idea to collect the data and have it transmitted, and reappropriated the old Dr. Watson to be the core of the new system

  • Vitor Canova

    “…Many of the decisions made for Windows 8 were based off telemetry data instead of the traditional methods of feedback…”

    And because this they made all those changes many people had complained. I follow the Build Windows Blog and they documented all steps they use to make decisions based on those telemetry data.

    End up is not the best way to get some feedbacks. Like the way they removed the Start Menu because the majority of people just hit the WinKey and find what they want. Now after real feedbacks the Start Menu comes back in a new way.