Welcome to WebmasterWorld Guest from 54.167.58.159

Forum Moderators: open

Message Too Old, No Replies

MySQL generate Google-search type of query in PHP

     

kajje

5:10 am on Nov 3, 2010 (gmt 0)

5+ Year Member



I have a table fully loaded with information.
I want to perform a (simplified) Google-search formatted query to get my results.


Search queries examples:
Q1) one two three
Q2) "one two" three
Q3) one two -three


Q1) should give a simple query
SELECT * FROM `myTable` WHERE `myField` LIKE '%one%' AND `myField` LIKE '%two%' AND `myField` LIKE '%three%';

Q2) should give
SELECT * FROM `myTable` WHERE `myField` LIKE '%one two%' AND `myField` LIKE '%three%'

Q3) should give
SELECT * FROM `myTable` WHERE `myField` LIKE '%one two%' AND `myField` NOT LIKE '%three%'


Now I feel mostly strong enough to build this query in php though it would not be completely fool-proof.

Any suggestions on how to make the php script simpler? Regular expressions (and if so, can you give me some examples)? How to get every string between 2 double-quotes in an array?

Any other suggestions on how to make more nice Google-query functionality?

Thanks in advance

enigma1

9:41 am on Nov 3, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Here is an example with a regular expression to extract portions of the string containing quotes.
$search = '"one two" three';
$pattern = '/(?>")[^"]*+"/';
preg_match_all($pattern, $search, $match_array);

You could also use
explode('"',$search);
and that will return an array splitting the string among the double quotes and then use a counter to join the segments that appear within the string.

In general you should first filter the input submitted by the user with a pattern that contains characters you expect. That's before breaking down string into segments.

Having all the segments in an array you could feed them into your queries but the search is not too efficient using simply the %segment% patterns.

You see if you search for "one" it will also match words like everyone or someone etc. You need to setup the queries to match expected characters. So if it's to expect the word one to take place somewhere in the mid of a string and you want to search for the exact work you could add a prefix like:
WHERE `myField` LIKE '% one%'
That will try to match something that starts with "one" and is a bit more efficient. Of course it will not work if the sentence begins with one or if there is a different separator like two-one or (one) etc. So the original filtering method is important not only to extract the segments but to replace unwanted characters with separators before feeding it into the queries.

Finally the search process should be repeated if there is a spelling mistake by reducing the number of characters in the segments before trying again up to a number of times. Right to Left character reduction could be applied till you get a match. Limits of characters in each word can improve the search. For instance if someone searches for
one or a three
you could eliminate any word that is less than 3 characters.

Because the queries at the end can be long enough with the various and operators for each segment you could get the db results and then filter them for the excluded words. In other words, the exclusion operator could be processed in php. Of course depends how many results you will present if it's to use pagination for example from the returned rows, you need the NOT inside the query.

rocknbil

3:08 pm on Nov 3, 2010 (gmt 0)

WebmasterWorld Senior Member rocknbil is a WebmasterWorld Top Contributor of All Time 10+ Year Member



This one is probably "close enough"

Q1) one two three
Q1) should give a simple query
SELECT * FROM `myTable` WHERE `myField` LIKE '%one%' AND `myField` LIKE '%two%' AND `myField` LIKE '%three%';


But it doesn't account for whole words, it will, for example, match on "anyone" and "someone". More below. These are mostly incorrect:

Q2) "one two" three
Q2) should give
SELECT * FROM `myTable` WHERE `myField` LIKE '%one two%' AND `myField` LIKE '%three%'


Though the liklihood is slim, it would match on "someone two". The % is a wildcard character, when first it means "anything before" and last, "anything after." What you'd need here is some form of mySQL regexp [dev.mysql.com]. Something like

select * from `myTable` where `myField` regexp '[[:<:]]one two[[:>:]]' and `myField` like '%three%'

or better,

select * from `myTable` where `myField` regexp '[[:<:]]one two[[:>:]]' and `myField` regexp '[[:<:]]three[[:>:]]'

[[:<:]] and [[:>:]] being word boundaries. You can test this on the command line:

select 'zero one two three four' regexp '[[:<:]]one two[[:>:]]';
--> 1

select 'zero one OOPS two three four' regexp '[[:<:]]one two[[:>:]]';
-- 0

Same with this one.

Q3) one two -three
Q3) should give
SELECT * FROM `myTable` WHERE `myField` LIKE '%one two%' AND `myField` NOT LIKE '%three%'


select * from `myTable` where `myField` regexp '[[:<:]]one two[[:>:]]' and `myField` not regexp '[[:<:]]three[[:>:]]'

There's no sense in extracting too many results then slogging through them in your programming, this makes for excessive use of resources and slow queries. You can do the regexps directly in the selects, see link above for more options. Though they are often more difficult to compose, regexps are more efficient than 'like' anyway.

Demaestro

3:59 pm on Nov 3, 2010 (gmt 0)

WebmasterWorld Senior Member demaestro is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Once you get your query working you will have to work on a sorting algo.

The beauty of Google was always the "relavance" factor. It isn't too difficult to script a light weight one yourself.

Basically you want to "bubble" up the most relevant results to the top.

You can do that by

Q1) one two three:

anything that starts with "one two three" moves all the way up

anything that starts with "one two" moves up

anything that starts with "one" moves up

anything that contains one & two & three moves up

anything that contains one & two moves up

and so on......

topr8

6:07 pm on Nov 3, 2010 (gmt 0)

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



IMO you need to rethink the query completely.

you need to create a fulltext index on 'myfield'

the syntax for searching is then slightly different
see the documentation:
[dev.mysql.com...]

enigma1

10:12 am on Nov 4, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



regexp in general is slower than like. Even if it looks you setup multiple like clauses vs a single regexp.

regexp is good if you need detailed control of a pattern, but for a general word/phrase search, like is better.