Forum Moderators: coopster
The query will be referenced with a table of words and phrases, so, out of the ordinary characters would ideally be stripped out so that there is more of a chance of the query matching with a word or phrase in the table.
After printing myself out a nice ASCII table, I know that I want to remove ASCII decimal characters 33 to 64 (! to @), 91 to 96 ([ to `) and 123 to 126 ({ to ~)..........where all instances are replaced with a space and all double spaces are replaced with a space
Instead of doing something like this.......
$q = the query string
$search = array
(
"'#'",
"'$'",
"'%'",
"'&'",
"'\''",
etc
);
$replace = array
(
" "
);
$q = preg_replace ($search, $replace, $q);
is their a better way of defining the 3 character ranges mentioned above?
33 to 64, 91 to 96 and 123 to 126
From my limited understanding I'm thinking that it's also possible to do this by defining alphanumeric [a-zA-Z0-9]....where all characters out this range are stripped?
What would be the best/leanest way to strip all non-alphanumeric characters?
function isAccepetd($field)
{
$value = $this->_getValue($field);
$pattern != "([a-bA-Z])";
if(preg_match($pattern, $value))
{
error
}
else
{
return true;
}
}
Failing that this seems to remove the "+" from the form if you use one.
$keywords = ereg_replace("([ ]+,)"," ",$keywords);
Cheers
This is part of the SE I intend to build from that directory ;)
Thanks for the code, I'm sure it will work fine, but I'm not too sure/keen on producing an error if they post a wrong character.
I guess a few searches, including valid words would use the -, though in my wordid table it would not contain the dash. In fact, replacing the dash with a space might help in defining what the word is....i.e. self-sustainable could match the "self" and "sustainable" instead of producing an error by replacing the - with a space and treating them as seperate words.
If a particular wordid contained both "self" and "sustainable" it would receive a bonus and most likely be the category/topic that the person is searching for.
I'm not quite going to deal with multiple languages and such with this script, but I hope do deal with those awkward searches and try make them as as uniform as possible :)
You will need to replace ¦ with the real ¦
Thanks, I see how youre referencing the decimal format there, I'm just not sure what you mean with the ¦, I retyped the ¦ in there and see what you mean with it being "real".....but after I put that line in, I get the message
No ending delimiter ''' found.......on line 13 which is this line
$text = preg_replace ($search, $replace, $q);
and the preg_replace does not get done. I have a hunch this is basic regex syntax that I should know.....
/added
youre quick :) I checked out the second post alternative and got the same message...I'll read more closely and post if i can get it working without the error
Have multiple sets of queries.
Mysql full text search which would pick out your "self-sustainable"
Then if that was null perform a second query on split words using AND
Then is that was null replace with OR
What do you think, would that be labour intensive. Would also act as a sort of ranking algo, albeit simple.
Cheers
You will need to replace ¦ with the real ¦
I believe this forum software replaces the real vertical bar character with the one you see here ¦. So you would need to replace it with the real vertical bar to indicate alternation in your regular expression.
I have a hunch this is basic regex syntax that I should know
A backslashed two or three didgit octal number matches the character with the specified value.
A backslashed x followed by a one or two digit hexadecimal number matches the character with the specified value.
Andreas
You guys are just too good ;)
ukgimp,
what you explained is pretty much what I have written down, though I'd still want to search self and sustainable as two seperate words.
If there is more than a single phrase entered as a query, then the query is split word by word and pushed into an array.
The phrase in itself is #1 to be searched for, and if it produces no results, the array of single phrases are searched individually for every element of the array.
/sidenote
andreas,ukgimp, you both know about that directory using the wordid list as category names in the directory....this is the same table to be used.
/sidenote
Then the algo would come down to these factors
1) How many words are in the query (divide their relevance by 1/total)
2) How many categories contain each of the words
3) If these words are categories in the directory, determine what level in the hierarchy they are in, and if it is a defining category (i.e. it is the last category)....then bonuses are applied and the websites within these categories containing the words "self" or "sustainable"
I just plan on using categories of a directory as a heavy influence on weighting search engine results.
A search on "news" brings international news for example, because most likely in the directory there will be a category called "news" that is high up in the category structure and thus gives extra relevance to a generic term.
If someone searches for $country $region news, then the value of each word (country, region, news" are searched for as a whole phrase and compared to the word dictionary......in this case there would be no match, but when the words are cleaned up of dashes and such they can be posted into an array and re-examined for a match.
If there is a category for $country, $region, news, then the elements of the array will match up well with the category
country > region > news
BETTER THAN
news or
region > news or
anothercountry > region > news
I'm sure you see where I'm going :) I think that things like search phrase order will also have to be taken into account, and generally anything else that moves!
A punnett square might come in handy ;)
At least with the regex provided, there is more of a chance that a query will match a word or phrase.....and maybe with another layer of script dealing with stemming the searches should appear OK