homepage Welcome to WebmasterWorld Guest from 54.237.98.229
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Databases
Forum Library, Charter, Moderator: open

Databases Forum

    
Keyword Matching Database
Similar to Google Adwords
whoisgregg

WebmasterWorld Senior Member whoisgregg us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4208828 posted 4:23 pm on Sep 29, 2010 (gmt 0)

I'm working on a feature for an internal site search tool. Marketing folks will be able to upload "ads" to appear alongside searches on our site.

These ads will need to use the same keyword matching types as offered by Google Adwords. For example:

Ad 1 - "custom widgets"
Ad 2 - [red furry widgets]
Ad 3 - widgets
Ad 4 - "furry widgets" -red

So, if, for example a user does a site search for red furry widgets then they should see Ad 2 and Ad 3. If they do a search for blue and green furry widgets, they'd see Ad 3 and Ad 4.

Here's my problem... For the life of me I cannot figure out how to set up these queries against the database! The only way I can think of is to actually loop through every row in the database and test it against the search terms, but there's got to be a better way.

Thoughts?

 

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4208828 posted 11:15 am on Sep 30, 2010 (gmt 0)

You may identify the phrases based on the double quotes and then use the LIKE to run the queries. To exclude rows that contain words like -red the NOT LIKE could be used. REGEXP and NOT REGEXP can be used for more complex searches.

Now the search criteria needs to be parsed, so if it's not a whole phrase, to break down into words and then broaden the search. Simple example

$search_words = array();
$check_array = explode(' ', $search);
for($i=0, $j=count($check_array); $i<$j) {
$pos = strpos($check_array[$i], '-');
if( $pos !== false && !$pos ) continue; // skip exceptions
$search_words[] = "db_col LIKE '%" . $check_array[$i] . "%'";
}

$search_string = 'where ' . implode(' or ', $search_words);
$q = mysql_query("select db_col from products " . $search_string);

This may match parts of the word so you may have to modify the search afterwards using common prefix characters space, hyphens, comas etc
$search_words[] = "db_col LIKE '%-" . $check_array[$i] . "%'";
$search_words[] = "db_col LIKE '% " . $check_array[$i] . "%'";

There are other methods using the FULLTEXT match/against for instance and you may want to check the string manipulation functions your database supports as the code assumes mysql

whoisgregg

WebmasterWorld Senior Member whoisgregg us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4208828 posted 1:24 pm on Sep 30, 2010 (gmt 0)

So given a user-entered search term of `red fuzzy widgets` I would first run this query against the ad table:

SELECT * FROM `ads` WHERE `keywords` LIKE '%red%' OR `keywords` LIKE '%fuzzy%' OR `keywords` LIKE '%widgets%';

Then only check those specific keyword phrases against the user-entered search term? I can see that working for quite a few situations, but I can think of a few situations where it doesn't -- particularly stemming.

For example, if the keyword is "blue widget" and the user entered `blue widgets` then doing a LIKE '%widgets%' would *not* bring up the singular form in the table. :/

I think I'll go do some research into stemming algorithms.

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4208828 posted 2:38 pm on Sep 30, 2010 (gmt 0)

There are certain ways to improve these simple search queries. For instance if we assume the majority of plural words end up in 's' or 'es' then if the first query fails you could initiate a second trimming words ending with these combinations of characters.

You could setup some sensitivity switches so when the query is executed again the chances increase to bring up results removing characters from the end and setting up a threshold to end the operation.
$check_array[$i] = substr($check_array[$i],0,-1);

As each word decreases in length, the chances to retrieve results increase and to compensate for the errors you could utilize functionality perhaps that has to do with pronouncing. In php, there are few like soundex, metaphone, similar_text etc. You could try converting the input words first through the soundex or when you retrieve results that may be inaccurate to filter them through the similar_text. Of course you will need to parse each resulting row again for this and can be intense.

Another thing you may want to do is to precede words in your query with expected separation characters so the results are more accurate.

SELECT * FROM `ads` WHERE `keywords` LIKE 'red%' OR `keywords` LIKE '% red%'......
because you expect 'red' to be at the beginning of a sentence or to be the beginning of a word (eg 'shredding' may also come up when searching for 'red'). For this kind of search you expect the beginning of the phrase to be more accurate than the end.

whoisgregg

WebmasterWorld Senior Member whoisgregg us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4208828 posted 5:21 pm on Sep 30, 2010 (gmt 0)

It seems to me though that this approach is computationally very expensive -- I'm practically comparing every row in the ads table against the search term in multiple ways. With some caching, I may be able to make this work when I have less than 1,000 ads, but how would it be handled with 100,000s of ads?

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4208828 posted 6:26 pm on Sep 30, 2010 (gmt 0)

It doesn't matter even if it's a million. Because you will only need enough records to populate a page the user views right? So that will be couple of dozen records? So you setup an index/limit because when you present the results to the user you deploy pagination.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Databases
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved