Forum Moderators: coopster

Message Too Old, No Replies

Making a spell checker database

         

bobnew32

12:32 am on Oct 11, 2003 (gmt 0)

10+ Year Member



I have recently found a .txt file that contains each word in the English Language, and each word is separated onto each line.

What I want to do is write an easy php script that would add each word to the words database that I have created, and the only fields are "id" and "word". So there would be about 50,000 rows at the end, but the process should be easy to install. I just don't know how to do it at all. :(

GaryK

1:17 am on Oct 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What database are you using? Some of them include an import feature that makes it easy to create a table out of a text file or other input source.

ergophobe

2:41 am on Oct 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And by the way, once you do get it into a DB, make absolutely sure that you index not only on the id column, but also on the word column, otherwise it could take forever to speel check a document.

bobnew32

4:36 pm on Oct 11, 2003 (gmt 0)

10+ Year Member



Kool, I now have a 180,000 word database. Is there any code exerts out there (not aspell or any of that crap) where people wrote their own spell check code?

The system I want is extremely simple. A $string with the information to be spell checked, and then a for each word loop to check each word against the db. If anyone can help me write the for each word loop that would be extremely helpful.

MonkeeSage

6:02 pm on Oct 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



$str = "Thesse isd an inptu strig (that dspiratealy neds spelchekng!)";

One way is...

$word = strtok($str, ' '); 
while (!$word === false) {
// do whatever with $word here...
$word = strtok(' ');
}

Another is...

$str_arr = explode (' ', $str); 
foreach ($str_arr as $word) {
// do whatever with $word here...
}

Or yet another is...

$str_arr = explode (' ', $str); 
array_map("check", $str_arr);

function check($word) {
// do whatever with $word here...
}

There are other ways, too, I'm sure ;)

You'll probably need something like the [levenshtein() function [php.net]], [similar_text() function [php.net]] or the [soundex() function [php.net]] to make a key for each word and then find the best match against your DB (just thinking out loud).

Jordan

jatar_k

6:20 pm on Oct 11, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



the two major hurdles are continuously searching the db for each word and then suggesting a good substitute for mispellings.

Those functions are a good bet for suggestions MonkeeSage.

Also if you are doing multiple lines of text you will have to deal with all of punctuation as well. simple explodes and the likes may not be quite enough. I assume the sentences will have to be rebuilt afterwards as well. You may need multiple splits/explodes and then seperate storage for punctuation or something.

also thinking out loud. ;)

I have these images of loading the whole english language into an array and running constant checks against it and then the server starting to smoke. :)

MonkeeSage

6:43 pm on Oct 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



...then the server starting to smoke.

ROFL! ...'the little spell checker that could'...'I think I can, I think I can...I know I can!...*POOF*' heheh ;D

Seriously though, you bring up some good points jatar_k.

To save on queries and comparisons, mabye the words (all cleaned of punctuation and all that) could be stored in a sorted array and then one small chunk of the DB could be grabbed at a time (e.g., all words that begin with the letter A), then all the array words that start with that letter could be processed, &c. But then you'd have to deal with original order vs. sorted order so the reconstruction of the sentence is correct...mabye a multimensional array to hold the orderings or something...hmmm...

I think all I can say is "good luck bobnew32!" (and if your server dies, can I have it's stereo?)

Jordan

jatar_k

6:47 pm on Oct 11, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I keep running around with the idea of the spelling, letter by letter

could you split the words?
could you load chunks of the db based on a soundex?
could you try to grab the exact word and if not start running your comparisons?
could you just call dictionary.com and ask if they use open source software?

'the little spell checker that could'...'I think I can, I think I can...I know I can!...*POOF*'

LMAO, exactly what I was thinking

GaryK

7:46 pm on Oct 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've had personal experience with writing a spell checker (er, speel chucker) for a website.

The way I handled it was to use regular expressions on the string (a textarea usually) to be spell checked. With re I was able to pull out all HTML, punctuation, words in all caps, and words with numbers in them and put the leftovers (the words to be checked) into a string. Then I passed that string to a stored procedure in SQL Server which iterated through it and returned an output parameter that was populated with the misspelled words. BTW, I also add the misspelled words to a db table so they can be reviewed for possible inclusion in the main table.

After that it's just a matter of once again using regular expressions to find each misspelled word in your string and apply some sort of formatting to it, like making the font color red. Then I display the message on a preview page, much like the preview page here on WW.

If there are misspelled words the user can open a new window that takes them through the misspelled words one by one and offers alternatives from the db based on soundex values.

This dictionary has been running on my server for a few years now and has caused absolutely no server overload issues. The site I'm referring to hosts around 2,000 unique visitors per day, with about 10,000 uses of the spell checker, so your results may vary.

Proper db indexing, using stored procedures (if you can), and using compiled code instead of interpreted code all make a huge difference in terms of server load.

[Edited for clarity]

BlueSky

8:56 pm on Oct 11, 2003 (gmt 0)

10+ Year Member



I hope you are aware that your 180K dictionary is only a subset of the English language. The total number of words is much higher...at least 3 million. About 200K of these are commonly used. Your list is probably a subset of this. An educated person only has about a 20K-25K vocabulary, but depending on your site's subject matter and people's background, you may find visitors using some words not found within that 180K.

killroy

12:03 am on Oct 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In my research and implementation of search engines and spell checkers I found the double metaphone algorythm to be greatly superior to soundex. Also a stemmed index (such as porter stemmer 2) would help to quickly find possible matches. In my SE I use the DMP for reach and DMPO+stemmer for relevancy. The method is the same for a spell checker (finding alternative possible spellings). Once you have a large set of possible spellings, you can use a slower, more accurate function, such as edit distance for ranking.

If you "shop" around you will find many free dictionaries with more then a million english words freely available.

SN

ergophobe

10:08 pm on Oct 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




An educated person only has about a 20K-25K vocabulary

I suspect that is a measure of passive vocabulary (words people understand). I suspect that most people have much smaller active vocabs (words they use) than this. I remember reading that Shakespeare still has the largest vocabularly of any author in print, having used 16K words in his corpus. I think the second best was only something like 12K. I'll have to check on this now, though...

Tom