Forum Moderators: bakedjake
So I want to use something off the shelf. I'm using mysql right now to index about 10 million records, but it needs 1 Gig of RAM per million records to keep the index in memory. Anyone know of something better, faster and cheaper in hardware terms?
mm.
I'm busy rewriting our search engine and borrowing heavily from page and brin's stanford paper on the original version of google. So I hit the problem of phrase search and stopwords. When users perform an individual word search like:
cheap widgets
Then you just go look in your data barrels to see which documents contain 'cheap' and which documents contain 'widgets' and you serve up the intersection of the two data sets ordered by your ranking algo.
If users search for:
"cheap widgets"
it gets a bit more complex. 'cheap' must occur next to 'widgets' so you ask your first data barrel to find 'cheap' and the next data barrel to find 'widgets' at position (cheaps_position + 1).
So that's all fine and dandy. Problem is with stopwords. Search engines love stopwords because it means you can drop them from your index. 'the' is too common, so why index it. Not one is going to be that stupid to search on 'the'....
Unless they're doing a phrase search. 'the chronicles of riddick' is a possible search. And it's a real nasty one because so many documents contain 'the'. So you're forced to index stopwords because smart ass users want to do phrase searches. ;)
So what I'm doing is creating a completelly seperate data barrel for stopwords. Without boring you with my database schema, what's interesting is that the results of this is phrase searches starting with a stopword 'the' 'and' 'or' etc will be slow, and searching for individual stopwords will be slow. So stupid users get a slow response. Specific users get a fast response. I like it.
And what's maybe more interesting is the slowest search you can do on google is a phrase of stopwords:
"the and or"
"the but if"
Notice the seconds taken to execute the query. I've had it up to over 3 seconds at times.
Which brings me to my question. What is the slowest search possible on google? What will make the beast work up a sweat?