Forum Moderators: coopster

Message Too Old, No Replies

building simple site search engine

Need a couple of tips

         

ironik

5:50 am on Apr 27, 2005 (gmt 0)

10+ Year Member



I'm kind of stuck on this one. The code currently works, but it's horribly inefficient. I'm building a simple search engine that takes words from an article and stores them in the database. Here's the basic process:

break words into array of unique words
does the word exist in the word table
>> if yes, look for an occurance on the current page. If there is no occurance, make a new one, else update current occurance count for the page.
>> if no, create an new entry for the word, and a new occurance.

Not sure if thats too readable, but basically I'm stuck with 3 queries per word. If I have 500 unique words on the page then it's 1500 queries per published article. The articles will be kind of fluid in nature and need to be edited regularly (which means words and occurances will change each time an article is changed).

What I'm thinking is perhaps I can run some sort of table join between the word and occurance tables and then test together whether the word and occurance exists, or write something that reduces the amount of calls I have to make to the database.

My table structure looks like this:

Words:
------
word_id
word

Occurances:
-----------
Occurance_id
Word_id
Page_id

Does anyone have any advice on reducing the number of queries I have to perform for each word, or building something a little more efficient?

anshul

11:56 am on Apr 27, 2005 (gmt 0)

10+ Year Member



Tips:

Use normalized db.
Skip words/phrases like to, of, and, so, it, like ++
Use regular expressions ( PCRE support of PHP ).
Read your files using PHP file-system functions.

arran

12:32 pm on Apr 27, 2005 (gmt 0)

10+ Year Member



I would use an intermediate data structure in your application and also denormalize (eliminate the 'Words' table - you do not need a word_id as each word will be unique). If you are only indexing one document at a time, use the following hash at the application level:

document{word}->#occurrences

To insert the document details into the database:

for each word {
insert into index_table values(page_id, word, #occurrences)
}

To extract documents containing a particular word:

select page_id, #occurrences from index_table where word = $word

Easy_Coder

1:24 pm on Apr 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



-->Skip words/phrases like to, of, and, so, it, like

You might consider building yourself one more table that contains StopWords like those suggested above. Too, if you're planning on displaying 'Latest Searches' you may want to add some foul language words to your StopWords repository so that these items don't appear on your site.

ironik

10:40 pm on Apr 27, 2005 (gmt 0)

10+ Year Member



Thanks for the responses ;)

The search class I've built thus far filters stopwords already, but I might also impliment a bad words filter as suggested. Interesting idea to combine the words into the occurance table. That would solve my multiple query problem but sacrificing a bit of database space...

Doing some estimates it means only a few kb per article extra space in the database (plus a little bit for an index on the column), which is nothing compared to the performance benifit. Great suggestion!

ironik

10:49 pm on Apr 27, 2005 (gmt 0)

10+ Year Member



Thanks also for the suggestion of an intermediate datastructure. I was doing a few more estimates and I can cut down something that would have taken 1500 queries into about 300 - 400. Much more manageable!