My first inclination was to have a seperate table for each word length, i.e. have 31 tables capable of storing words up to 31 chars long, though this becomes a bit cumbersome when finding the correct ID for batch of 100 words or so (i.e. 31 queries).
At the mo, I have this table
CREATE TABLE word_dictionary (
id mediumint(9) NOT NULL default '0',
hash smallint(5) unsigned NOT NULL default '0',
word varchar(32) NOT NULL default '',
PRIMARY KEY (id),
KEY hash (hash),
KEY hashword (hash,word)
) TYPE=MyISAM;
So if I grab a document from the web, and tokenise it, I want to be able to lookup this table and grab the ID's for each unique word as quick as it can be done. This table could grow up to about a million records, probably more.
The hash column is indexed, and the hash and word column are concatenated into a single key to prevent dupes being inserted.
Hash Column
The hash is 16 bits long
-5 Bits are used to indicate length
-2 Bits are used to indicate 1st two chars of word (only lowercase a-z is stored so 5 bits can be used for each unique char)
-1 Bit to say whether next char is between a-m or n-z
e.g. "webmasterworld"
Length - 14 chars - 01110
1st chars - WE- 10101 00101
next char - 0
HASH = 0111010101001010 = 30026
In practice, a couple of dozen hash keys would be looked up per query, not just the one...so all unique words on a page are queried once against the DB and their associated ID's are gathered
Word Column
- First 2 chars are stripped from word and can be found in hash
- The 'hashword' unique index prevents dupes from being entered on INSERTS
Looking up a word
So say the DB is full of words, and I look up a hash key, it may return more than one result for the hashkey but the query will be quicker searching on the SMALLINT hash key rather than the VARCHAR field of the word.
With all the words that fall under a hash, I'd be using PHP to find the exact match and gather the wordID from there by trundling through an array for the match.
Would any of you say this is one of the better ways to go about this? Tweaking all these bits is a bit of a fine art, I've toyed with a few variations of how to a word dictionary working at its best - so comments,hints and pointers are much welcome.
The slowest part in this whole thing is looking up the ID's in the DB...whichever way you cut it it takes more time than i'd like it to :)
I'm willing to sacrifice space for speed, so here's another suggestion:
1. A huffman tree has been built after looking at the chars used in each word. Each character takes between 4-6 depending on its frequency.
2a. words are assigned a 64bit BIGINT, if the word can be fitted inside 63 bits, the 1st bit is set to 0 and the word is hashed into the 64BIT number
2b. if the word is larger than 63 bits, the first bit is set to 1 a mysql table is searched to match the word. The ID of the corresponding word is then placed in the 63 bit hash.
The idea is that most words will be shorter than 10/11/12 chars and won't need a DB to find its corresponding hash.
Larger words are more sparsely used and (hopefully) will be quick to find within a table that contains fewer records.
Although most words will be smaller than the 8bytes used in the BIGINT to store the hash, I figure it's one of the more quicker ways to turn in a word into a more math friendly number.
What do y'all think, I've heard the saying "theres more than one way to skin a cat" but this turning words to ints and hashing scene really takes the michael!