Search is hard. Really, hard. As a veteran of three search engines, some might call me jaded. However, this post isn’t just an attempt at apologetics. I’ve learned something from those three experiences that I’d like to share with you.
If you’re looking to start a profitable business, it’s reasonable to wonder: can a startup ever pop up and beat Google? As entrepreneurs, we always think that there’s a way to innovate our way out of a problem. I’m going to argue that no amount of innovation can feasibly compete with Google or Bing. Though marketing concerns like switching costs are certainly a problem for a search startup, I’m going to focus on an even deeper problem: the mighty greenback. When it comes down to it, building a search engine is an incredibly expensive proposal.
In order to make some calculations, I created an equation that has two major components: hardware and people. I’ll give some explanations for each of the components and, in the process, show you how complicated and expensive a search engine is to build.
- Storage/Crawling (SC) – Two years ago, Google estimated the Web to be over 1 trillion documents and it’s growing every day. A search engine needs to crawl and store the entire Web and keep frequently changing pages up-to-date. Plus, your index is bigger than just Web pages. You’ll be storing all sorts of metadata for the page: anchor text, outbound links, and any kind of interesting data you’ve extracted or created.
- Relevance (R) – At bare minimum, your search engine will have to have results as good as Bing or Google’s. To achieve this, you’ll need servers to run relevance experiments, servers to store vast amounts of click-data, independent judges, and a big team to deal with all of the irrelevant/spammy sites out there. And don’t think you’re done if you’ve just created 10 blue links. Users have come to expect lots of other services: image search, stock quotes, weather answers, and news search, just to name a few. You’re either going to have to license content or build verticals of your own. Both are expensive propositions.
- Runtime (RT) – When you do a search on Bing, you get a list of just 10 relevant results in fractions of a second from a set of possibly billions of Web pages. How is that possible? The short answer is: lots, and lots, and lots of computers to calculate your results. Search engines typically use some kind of divide-and-conquer methodology, which means that whenever you issue a search, hundreds, or possibly thousands, of computers are involved in bringing you back results.
- People (P) – It’s unlikely, if not impossible, that you’re going to build a search engine with your buddy from college in a garage. Worse, you need really expensive employees: PhDs in computer science, machine learning experts, statisticians, search engine veterans, infrastructure geniuses, etc. I estimate that it takes a bare minimum of 250 people to build a search engine. Yikes!
- Time (t) – Even with the smartest people, it’s going to take time to get everything running properly. I estimate at least two years to get all of the components working together.
- Johnson Coefficient (Ĵ) – No equation is complete without an eponymous component. However, I didn’t just stick in Ĵ because I love my last name. The Johnson coefficient is a real concern for innovative search engines. It’s all well-and-good if you’ve created a search engine that is equivalent to Bing or Google, but usually you want to make something better. Unfortunately, better usually translates into more data and/or more processing . . . which translates into more machines. Thus, the Johnson coefficient is the “tax” that innovative companies will pay for their innovation.
I have two case studies: Powerset and SearchMe.
Powerset can’t be seen as a complete failure, since Powerset was acquired by Microsoft. At Powerset, all of our linguistic processing caused our index to be at least 10-20x bigger than a standard search engine. Plus, the linguistic processing and matching we did at the runtime was much more expensive than typical search engine retrieval. With about $25M injected into the company and 60 people, we were only able to index 2.5M documents of Wikipedia. To have indexed the whole Web would have been much, much more expensive.
SearchMe was a search engine whose gimmick was to show thumbnails of Web pages in a fancy cover flow interface. Imagine the cost of processing, storing, and serving all of those images. It was no surprise to me when SearchMe shut its doors, since it claimed it needed another $50M in capital (!!!) to survive.
However, I encourage you to Keep Hope Alive! Though I don’t think that another general purpose search engine could compete with Google, there’s a lot of opportunity on the periphery for special purpose search engines and vertical search engines. And heck, if you’re lucky like Powerset, you might get bought by Bing or Google!
The moral of the story is that, next time you hear someone proclaiming how they are going to be the next Bing or Google, smile and nod, and keep your hard earned money far away from an investment in that ill-fated venutre.