Forum Moderators: coopster
What I want to do is write an easy php script that would add each word to the words database that I have created, and the only fields are "id" and "word". So there would be about 50,000 rows at the end, but the process should be easy to install. I just don't know how to do it at all. :(
The system I want is extremely simple. A $string with the information to be spell checked, and then a for each word loop to check each word against the db. If anyone can help me write the for each word loop that would be extremely helpful.
One way is...
$word = strtok($str, ' ');
while (!$word === false) {
// do whatever with $word here...
$word = strtok(' ');
}
Another is...
$str_arr = explode (' ', $str);
foreach ($str_arr as $word) {
// do whatever with $word here...
} Or yet another is...
$str_arr = explode (' ', $str);
array_map("check", $str_arr);
function check($word) {
// do whatever with $word here...
} There are other ways, too, I'm sure ;)
You'll probably need something like the [levenshtein() function [php.net]], [similar_text() function [php.net]] or the [soundex() function [php.net]] to make a key for each word and then find the best match against your DB (just thinking out loud).
Jordan
Those functions are a good bet for suggestions MonkeeSage.
Also if you are doing multiple lines of text you will have to deal with all of punctuation as well. simple explodes and the likes may not be quite enough. I assume the sentences will have to be rebuilt afterwards as well. You may need multiple splits/explodes and then seperate storage for punctuation or something.
also thinking out loud. ;)
I have these images of loading the whole english language into an array and running constant checks against it and then the server starting to smoke. :)
...then the server starting to smoke.
ROFL! ...'the little spell checker that could'...'I think I can, I think I can...I know I can!...*POOF*' heheh ;D
Seriously though, you bring up some good points jatar_k.
To save on queries and comparisons, mabye the words (all cleaned of punctuation and all that) could be stored in a sorted array and then one small chunk of the DB could be grabbed at a time (e.g., all words that begin with the letter A), then all the array words that start with that letter could be processed, &c. But then you'd have to deal with original order vs. sorted order so the reconstruction of the sentence is correct...mabye a multimensional array to hold the orderings or something...hmmm...
I think all I can say is "good luck bobnew32!" (and if your server dies, can I have it's stereo?)
Jordan
could you split the words?
could you load chunks of the db based on a soundex?
could you try to grab the exact word and if not start running your comparisons?
could you just call dictionary.com and ask if they use open source software?
'the little spell checker that could'...'I think I can, I think I can...I know I can!...*POOF*'
LMAO, exactly what I was thinking
The way I handled it was to use regular expressions on the string (a textarea usually) to be spell checked. With re I was able to pull out all HTML, punctuation, words in all caps, and words with numbers in them and put the leftovers (the words to be checked) into a string. Then I passed that string to a stored procedure in SQL Server which iterated through it and returned an output parameter that was populated with the misspelled words. BTW, I also add the misspelled words to a db table so they can be reviewed for possible inclusion in the main table.
After that it's just a matter of once again using regular expressions to find each misspelled word in your string and apply some sort of formatting to it, like making the font color red. Then I display the message on a preview page, much like the preview page here on WW.
If there are misspelled words the user can open a new window that takes them through the misspelled words one by one and offers alternatives from the db based on soundex values.
This dictionary has been running on my server for a few years now and has caused absolutely no server overload issues. The site I'm referring to hosts around 2,000 unique visitors per day, with about 10,000 uses of the spell checker, so your results may vary.
Proper db indexing, using stored procedures (if you can), and using compiled code instead of interpreted code all make a huge difference in terms of server load.
[Edited for clarity]
If you "shop" around you will find many free dictionaries with more then a million english words freely available.
SN
An educated person only has about a 20K-25K vocabulary
I suspect that is a measure of passive vocabulary (words people understand). I suspect that most people have much smaller active vocabs (words they use) than this. I remember reading that Shakespeare still has the largest vocabularly of any author in print, having used 16K words in his corpus. I think the second best was only something like 12K. I'll have to check on this now, though...
Tom