Experiences searching 100,000+ records

Forum Moderators: phranque

Message Too Old, No Replies

Experiences searching 100,000+ records

musicales

12:21 pm on Feb 18, 2004 (gmt 0)

I run a number of websites with relatively small databases, which I can search no problem (I use SQL server).

I've just been working on a site with over 100,000 products, and getting a speedy response seems nigh on impossible. I'm having to do a full text search on a field of length 255-500 or so.

Anyone found any ways to do this quickly and successfully, even if it involves something drastic like a separate dedicated server or something?

txbakers

12:53 pm on Feb 18, 2004 (gmt 0)

Use indexes on the database to speed up processing.

If you can, EXPLAIN the query before running them and see what indeces are being requested, then create those. Performance will improve.

plumsauce

7:49 pm on Feb 18, 2004 (gmt 0)

doing a full text search is a performance killer.

is there something else you can do to fake it?

like a cross reference hint table with indexes?

in this approach, every word has an id, every
product has an id, this involves three tables,
the product table, word table, and xref table.

you fill the word table and xref tables at
product load time.

musicales

11:21 pm on Feb 18, 2004 (gmt 0)

plumsauce thanks that's a nice idea. txbakers is that what you mean by 'EXPLAIN'?

txbakers

11:36 pm on Feb 18, 2004 (gmt 0)

No, Explain is a SQL command which allows you to see how the optimizer will attempt to solve the SELECT.

It shows what indexes will be used or will be created.

ggrot

5:59 am on Feb 24, 2004 (gmt 0)

What you want is an inverted document index - a preprocessing step.

Basically this means that you go through all your documents and make a collection of distinct words. You can improve this by removing common words (stopwords) like "the", "and", etc. The easiest way to recognize them is determine how often they are found. If they are found in say 80% or more of your documents, don't worry about them. You can also reduce this size by stemming, but thats complicated (do a google search for Porter's Stemmer).

Now that you have a list of words, go through your documents and keep track of which documents have which words in them. Your SQL table should have 2 fields:
WORD DOCUMENTID
Indexed on WORD. Now, to find all the documents in SQL with a given word, you simply run a query "SELECT * FROM <table> WHERE WORD=<word>;". The time required for SQL to perform this is minimal. Now just combine the results for multiple words and return the set of documents that contain all the words in question.

Of course, then comes the question of how to rank them - that's a science in itself and an unanswerable question.

phaze

9:49 am on Feb 29, 2004 (gmt 0)

ggrot, YUCK! Don't reinvent the wheel.

Simply use mysql 4's fulltext index capability. It has a configurable list of stopwords and dynamically adds stopwords if they appear in more than 50% of the final result set. The new version even supports blind query expansion, but is still in Beta.

I run a site that is a meta-search. We are inserting over 2000 records for each search into a table that has a fulltext index and over 300,000 records and immediatelly querying that same table. We are getting very fast response times for the entire process. Here's the trick:

Get your own dedicated linux server. Set up mysql with the fulltext index on the column that contains the chunks of text. Then use create a ramdisk and place the MYI index file that contains the fulltext index onto the ramdisk using symbolic links. For a table your size, the index shouldn't be larger than 100 megs, and with one gigabyte of memory you can easily do this.

This forces the fulltext index to stay in memory giving you a speed increase on inserts and queries.

Regarding the 'unanswerable question' of ranking your results. MySQL will rank by relevance for you. Just read the docs at [mysql.com...]

You also need to tweak your mysql configuration. Most importantly check the key_buffer parameter and make sure it's optimal for the amount of ram you have. RAM is they key. The more you have, the larger the linux filesystem cache, and the more indexes you can shove in that brand new ramdisk of yours and the faster you'll be.

Of course if you're stuck on SQL Server, then you're basically S.O.L.

Mark.
ps: Sticky mail me if you need help setting it up.

plumsauce

11:24 am on Mar 1, 2004 (gmt 0)

Of course if you're stuck on SQL Server, then you're basically S.O.L.

No, MSSQL has had full text search since v7. That's
not the issue. Full text search is a performance
killer in comparison to normal sql queries, simple as that.
It's all relative.

phaze

1:31 pm on Mar 1, 2004 (gmt 0)

Not in my experience.

JasonHamilton

5:35 pm on Mar 4, 2004 (gmt 0)

with mysql you should be able to get near instant search results with fulltext searching as long as the database is around a million or less records. once you get into 3-5 million records, i've found it to start chugging and taking 3-8 secs to reply to some queries.

musicales

5:57 pm on Mar 4, 2004 (gmt 0)

JasonHamilton- is that a million words or a million full text items?