While demoing Live Search at the Web 2.0 Expo, people continually asked the same questions: “What makes Live different?” or “Show me some features that will make me want to switch from my search engine” or the extremely confrontational “Why do you think you’re better than Google?”
My first instinct was to dive in and show people the coolest features in Live Search (e.g., demoing Virtual Earth with an Xbox controller) or to let them play around with their own queries.
However, given my experience working for several startup search engines, I’ve come to realize that it's extremely difficult to convince someone that you’re better than another engine with words, features, or few carefully chosen queries.
So, after awhile, I started my demos with a caveat about the nature of a search engine: I implored my audience to try out Live Search for a week so that, in the words of the immortal Lavar Burton of Reading Rainbow, “But, you don’t have to take my word for it.”
Is this a cop-out? Why is demoing search so hard?
Search “Features”
When showing off a new version of Microsoft Word or Typepad or Yahoo Messenger, a good product marketing person will not just demonstrate features, but analyze their audience and demonstrate benefits that help users accomplish specific tasks. (This is just product marketing basics.)
A search engine, by contrast, has an extremely simple interface: you type in some words and hope that the engine will cough up pointers to helpful Web sites or give you a direct answer. The inner workings of a search engine, i.e. how those results were produced, are completely opaque to the user. Hundreds of features are used to rank results so that the right Web sites and answers show up on a page when you type in some string of words. Those features don't surfaces as demonstrable chunks that can be easily summarized or understood.
Common mistakes when evaluating a search solution
Which brings me to the biggest mistake people make: judging a search engine by typing in a few queries and analyzing the results. There are many interrelated reasons that this methodology fails:
- A few good/bad results don’t mean that all results will be good/bad – even if you try out five searches and all are good, how do you know if your sixth is also going to be good? That is, since you don’t know what is going on under the hood, you can’t make any predictions about the quality of future results.
- It’s hard to select a representative cross section of queries – people usually try out a few navigational queries, a vanity query, and a few queries that are either damn-near impossible or extremely obscure. None of these sets represents an accurate cross-section of your monthly query log.
- What you think is “good” may not be good for the majority of users – for navigational queries (e.g. “CNN.com”) the top result is clear. For more complicated queries, the top results are rarely obvious.
- Queries are out of context – we had this problem at SideStep all of the time. During usability studies, users who were simply evaluating the look-and-feel of the product and scanning for cheap flights without any end-goal were never as good as users who were actually trying to buy a flight for a real trip. A search engine should help you complete tasks, not just give you a pretty page or have links that look useful.
- People tend to focus on the first result – some queries just require one result. But many queries should be judged by the diversity of interesting results.
There are probably countless other mistakes that are made during solo evaluations of search. Therefore, search engines big and small realize that problems of ranking and relevance – the core of any search project – are solved only by lots and lots and lots of data from lots and lots and lots of people. To solve this data problem, we need to collect data from real users. For example, we run many thousands of queries past human judges and look at mountains of click data from the production site. After applying apply advanced statistical techniques to this data, we get the information we need to create algorithms that turn your few (mispelled) words and turn them into a useful page of results.
As one of my colleagues at Powerset always likes to remind me: this is rocket science.