Forum Moderators: coopster
break words into array of unique words
does the word exist in the word table
>> if yes, look for an occurance on the current page. If there is no occurance, make a new one, else update current occurance count for the page.
>> if no, create an new entry for the word, and a new occurance.
Not sure if thats too readable, but basically I'm stuck with 3 queries per word. If I have 500 unique words on the page then it's 1500 queries per published article. The articles will be kind of fluid in nature and need to be edited regularly (which means words and occurances will change each time an article is changed).
What I'm thinking is perhaps I can run some sort of table join between the word and occurance tables and then test together whether the word and occurance exists, or write something that reduces the amount of calls I have to make to the database.
My table structure looks like this:
Words:
------
word_id
word
Occurances:
-----------
Occurance_id
Word_id
Page_id
Does anyone have any advice on reducing the number of queries I have to perform for each word, or building something a little more efficient?
document{word}->#occurrences
To insert the document details into the database:
for each word {
insert into index_table values(page_id, word, #occurrences)
}
To extract documents containing a particular word:
select page_id, #occurrences from index_table where word = $word
You might consider building yourself one more table that contains StopWords like those suggested above. Too, if you're planning on displaying 'Latest Searches' you may want to add some foul language words to your StopWords repository so that these items don't appear on your site.
The search class I've built thus far filters stopwords already, but I might also impliment a bad words filter as suggested. Interesting idea to combine the words into the occurance table. That would solve my multiple query problem but sacrificing a bit of database space...
Doing some estimates it means only a few kb per article extra space in the database (plus a little bit for an index on the column), which is nothing compared to the performance benifit. Great suggestion!