word dictionary - hashing ids, mysql efficiency - Perl Server Side CGI Scripting forum at WebmasterWorld

A bit of a lead from the thread on optimising queries [webmasterworld.com] and one of Brett's killer posts [webmasterworld.com], I wonder what's the best way to deal with a word dictionary stored in a database. I keep reading about hash functions and would like to find the best way for using hashes to speed up lookups for words and their ID's.

My first inclination was to have a seperate table for each word length, i.e. have 31 tables capable of storing words up to 31 chars long, though this becomes a bit cumbersome when finding the correct ID for batch of 100 words or so (i.e. 31 queries).

At the mo, I have this table

CREATE TABLE word_dictionary (
id mediumint(9) NOT NULL default '0',
hash smallint(5) unsigned NOT NULL default '0',
word varchar(32) NOT NULL default '',
PRIMARY KEY (id),
KEY hash (hash),
KEY hashword (hash,word)
) TYPE=MyISAM;

So if I grab a document from the web, and tokenise it, I want to be able to lookup this table and grab the ID's for each unique word as quick as it can be done. This table could grow up to about a million records, probably more.

The hash column is indexed, and the hash and word column are concatenated into a single key to prevent dupes being inserted.

Hash Column
The hash is 16 bits long
-5 Bits are used to indicate length
-2 Bits are used to indicate 1st two chars of word (only lowercase a-z is stored so 5 bits can be used for each unique char)
-1 Bit to say whether next char is between a-m or n-z

e.g. "webmasterworld"
Length - 14 chars - 01110
1st chars - WE- 10101 00101
next char - 0

HASH = 0111010101001010 = 30026

In practice, a couple of dozen hash keys would be looked up per query, not just the one...so all unique words on a page are queried once against the DB and their associated ID's are gathered

Word Column
- First 2 chars are stripped from word and can be found in hash
- The 'hashword' unique index prevents dupes from being entered on INSERTS

Looking up a word
So say the DB is full of words, and I look up a hash key, it may return more than one result for the hashkey but the query will be quicker searching on the SMALLINT hash key rather than the VARCHAR field of the word.

With all the words that fall under a hash, I'd be using PHP to find the exact match and gather the wordID from there by trundling through an array for the match.

Would any of you say this is one of the better ways to go about this? Tweaking all these bits is a bit of a fine art, I've toyed with a few variations of how to a word dictionary working at its best - so comments,hints and pointers are much welcome.

word dictionary - hashing ids, mysql efficiency

small and fast word lookup

brotherhood of LAN

brotherhood of LAN

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week