My two cents on Watson
Feb. 16th, 2011 10:16 amFirst of all, this is a demonstration, not an experiment. And of course it's not fair; "Jeopardy!" contestants at this level know almost all of the answers, and it's really all about who buzzes in first. A computer will almost always buzz in first. Game over. And I'm surprised by how many people seem to think that the contestants don't get to read the question onscreen as soon as it's revealed.
So to everyone who's whining about this being neither an honest experiment nor a level playing field, my advice is to sit back and relax; enjoy the show.
But it's been interesting to look at Watson's mistakes. Getting "finis" instead of "terminus" is well within the realm of mistakes that a human player might make; so too "chic" instead of "class".
But "Toronto" was interesting for two reasons. First, that was an eminently fair question, as demonstrated by the fact that Ken, Brad, and I (heh) all got it right. Yet it involved too many levels of indirection for Watson's pattern-matching to make any headway. Here's a case where I would have loved to have the top three answers visible. And second, it was a classic example of where the computer's answer was not even in the correct category; this is the kind of mistake that makes smug humans laugh but it's actually far more impressive that Watson's filtering usually works. (This was one of the things that the Nova episode last week went into.) [Edited to add: the principal researcher at IBM explains that because FJ categories in particular are often misleading, Watson downplays the importance of the category title in weighing its responses. The question could just as well have been "In 1897, Boston was the first U.S. city to build one of these" --- the response to which would not have been the name of a U.S. city.]
It was a nice touch that the programmers adopted the "Jeopardy!" conventions of adding a bunch of question marks at the end to indicate uncertainty, just as they did with the "I'm going to have to guess" on the Daily Double where Watson's confidence level was low.
The Daily Double wagers were also interesting. Watson, like a human player, assessed the current scores and estimated its odds of success, and then computed a wager that maximized its expected utility. The only difference is that humans can carry one and a half significant digits in their heads, and Watson didn't have that limitation. So a bet of $947 makes perfect sense, and if it could have gotten more precise by wagering cents and fractional cents I'm sure it would have done so. [Edited to add: IBM explains the betting subsystem]
One point where I was impressed was when Watson correctly pronounced "Jean Valjean." I was also glad to see they got the Roman numerals bug fixed so that "Henry VIII" was said correctly; I do wonder whether they put in the exception for "Malcolm X". (I also wonder if my GPS would correctly handle "Houston St.", but that's another matter.)
I also was amused that Watson's preferred answer for "reinstate" was "reinstate 2" --- this, of course, is because the clue's "To bring back someone to his original function or position" matches "reinstate 2 : to restore to a proper condition : replace in an original or equivalent state" in the MW unabridged (or the equivalent in whatever dictionary they're using).
Overall, IBM got their money's worth. They've clearly demonstrated that a bank of System 7s running their very sophisticated software can effectively data mine a large data set that is not very structured, and can quickly assign confidence values to its results.
What I'd love to see is a similar demonstration -- roughly based on the "Jeopardy!" format, except that all three contestants respond to every clue, and perhaps with each clue being scored with a daily-double-like wager -- with Watson versus equivalent systems from Google and Microsoft.
So to everyone who's whining about this being neither an honest experiment nor a level playing field, my advice is to sit back and relax; enjoy the show.
But it's been interesting to look at Watson's mistakes. Getting "finis" instead of "terminus" is well within the realm of mistakes that a human player might make; so too "chic" instead of "class".
But "Toronto" was interesting for two reasons. First, that was an eminently fair question, as demonstrated by the fact that Ken, Brad, and I (heh) all got it right. Yet it involved too many levels of indirection for Watson's pattern-matching to make any headway. Here's a case where I would have loved to have the top three answers visible. And second, it was a classic example of where the computer's answer was not even in the correct category; this is the kind of mistake that makes smug humans laugh but it's actually far more impressive that Watson's filtering usually works. (This was one of the things that the Nova episode last week went into.) [Edited to add: the principal researcher at IBM explains that because FJ categories in particular are often misleading, Watson downplays the importance of the category title in weighing its responses. The question could just as well have been "In 1897, Boston was the first U.S. city to build one of these" --- the response to which would not have been the name of a U.S. city.]
It was a nice touch that the programmers adopted the "Jeopardy!" conventions of adding a bunch of question marks at the end to indicate uncertainty, just as they did with the "I'm going to have to guess" on the Daily Double where Watson's confidence level was low.
The Daily Double wagers were also interesting. Watson, like a human player, assessed the current scores and estimated its odds of success, and then computed a wager that maximized its expected utility. The only difference is that humans can carry one and a half significant digits in their heads, and Watson didn't have that limitation. So a bet of $947 makes perfect sense, and if it could have gotten more precise by wagering cents and fractional cents I'm sure it would have done so. [Edited to add: IBM explains the betting subsystem]
One point where I was impressed was when Watson correctly pronounced "Jean Valjean." I was also glad to see they got the Roman numerals bug fixed so that "Henry VIII" was said correctly; I do wonder whether they put in the exception for "Malcolm X". (I also wonder if my GPS would correctly handle "Houston St.", but that's another matter.)
I also was amused that Watson's preferred answer for "reinstate" was "reinstate 2" --- this, of course, is because the clue's "To bring back someone to his original function or position" matches "reinstate 2 : to restore to a proper condition : replace in an original or equivalent state" in the MW unabridged (or the equivalent in whatever dictionary they're using).
Overall, IBM got their money's worth. They've clearly demonstrated that a bank of System 7s running their very sophisticated software can effectively data mine a large data set that is not very structured, and can quickly assign confidence values to its results.
What I'd love to see is a similar demonstration -- roughly based on the "Jeopardy!" format, except that all three contestants respond to every clue, and perhaps with each clue being scored with a daily-double-like wager -- with Watson versus equivalent systems from Google and Microsoft.
(no subject)
Date: 2011-02-16 04:00 pm (UTC)(no subject)
Date: 2011-02-16 05:00 pm (UTC)(no subject)
Date: 2011-02-16 05:03 pm (UTC)(no subject)
Date: 2011-02-16 05:48 pm (UTC)(no subject)
Date: 2011-02-18 10:55 am (UTC)[edit: Oh, I should explain my appearance here. I found this page when I was Googling to find out why Watson had "reinstate 2" as its top answer on one of the questions, and this post answered it well! So thank you
(no subject)
Date: 2011-02-18 02:40 pm (UTC)But I disagree with your first point. Watson was given an unambiguous signal of when to buzz in, and could respond with a fixed and small delay to it. Human players don't wait for the light to go on; they are listening to the pace of Alex reading the question (or so Ken Jennings explained on NPR last night, and I've heard this from friends who have been on the show) and anticipating when the light will come on --- which is why sometimes humans buzz in early and are locked out, which Watson will never do, and sometimes humans buzz in late and miss their chance to answer.
So Watson *does* have an advantage, and pressed it, and won because of it. Which is fine with me --- if Watson had been unable to come up with the correct response in a matter of seconds, it would not have been able to make use of its advantage in buzzing in, so it doesn't diminish IBM's achievement to note this.
(no subject)
Date: 2011-02-18 07:53 am (UTC)(no subject)
Date: 2011-02-18 02:36 pm (UTC)(no subject)
Date: 2011-02-18 05:11 pm (UTC)