Forum Moderators: phranque
I've just been working on a site with over 100,000 products, and getting a speedy response seems nigh on impossible. I'm having to do a full text search on a field of length 255-500 or so.
Anyone found any ways to do this quickly and successfully, even if it involves something drastic like a separate dedicated server or something?
is there something else you can do to fake it?
like a cross reference hint table with indexes?
in this approach, every word has an id, every
product has an id, this involves three tables,
the product table, word table, and xref table.
you fill the word table and xref tables at
product load time.
Basically this means that you go through all your documents and make a collection of distinct words. You can improve this by removing common words (stopwords) like "the", "and", etc. The easiest way to recognize them is determine how often they are found. If they are found in say 80% or more of your documents, don't worry about them. You can also reduce this size by stemming, but thats complicated (do a google search for Porter's Stemmer).
Now that you have a list of words, go through your documents and keep track of which documents have which words in them. Your SQL table should have 2 fields:
WORD DOCUMENTID
Indexed on WORD. Now, to find all the documents in SQL with a given word, you simply run a query "SELECT * FROM <table> WHERE WORD=<word>;". The time required for SQL to perform this is minimal. Now just combine the results for multiple words and return the set of documents that contain all the words in question.
Of course, then comes the question of how to rank them - that's a science in itself and an unanswerable question.
Simply use mysql 4's fulltext index capability. It has a configurable list of stopwords and dynamically adds stopwords if they appear in more than 50% of the final result set. The new version even supports blind query expansion, but is still in Beta.
I run a site that is a meta-search. We are inserting over 2000 records for each search into a table that has a fulltext index and over 300,000 records and immediatelly querying that same table. We are getting very fast response times for the entire process. Here's the trick:
Get your own dedicated linux server. Set up mysql with the fulltext index on the column that contains the chunks of text. Then use create a ramdisk and place the MYI index file that contains the fulltext index onto the ramdisk using symbolic links. For a table your size, the index shouldn't be larger than 100 megs, and with one gigabyte of memory you can easily do this.
This forces the fulltext index to stay in memory giving you a speed increase on inserts and queries.
Regarding the 'unanswerable question' of ranking your results. MySQL will rank by relevance for you. Just read the docs at [mysql.com...]
You also need to tweak your mysql configuration. Most importantly check the key_buffer parameter and make sure it's optimal for the amount of ram you have. RAM is they key. The more you have, the larger the linux filesystem cache, and the more indexes you can shove in that brand new ramdisk of yours and the faster you'll be.
Of course if you're stuck on SQL Server, then you're basically S.O.L.
Mark.
ps: Sticky mail me if you need help setting it up.