Strip All Non-Alphanumeric Characters from String - PHP Server Side Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

Strip All Non-Alphanumeric Characters from String

Php

brotherhood of LAN

11:56 am on Oct 23, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I am using preg_replace to get rid of unwanted characters that could be part of a search string.

The query will be referenced with a table of words and phrases, so, out of the ordinary characters would ideally be stripped out so that there is more of a chance of the query matching with a word or phrase in the table.

After printing myself out a nice ASCII table, I know that I want to remove ASCII decimal characters 33 to 64 (! to @), 91 to 96 ([ to `) and 123 to 126 ({ to ~)..........where all instances are replaced with a space and all double spaces are replaced with a space

Instead of doing something like this.......

^{$q = the query string
$search = array
(
"'#'",
"'$'",
"'%'",
"'&'",
"'\''",
etc
);
$replace = array
(
" "
);
$q = preg_replace ($search, $replace, $q);}

is their a better way of defining the 3 character ranges mentioned above?
33 to 64, 91 to 96 and 123 to 126

From my limited understanding I'm thinking that it's also possible to do this by defining alphanumeric [a-zA-Z0-9]....where all characters out this range are stripped?

What would be the best/leanest way to strip all non-alphanumeric characters?

ukgimp

12:10 pm on Oct 23, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Not an expert BoL, I have a form validator that could be adapted

function isAccepetd($field)
{
$value = $this->_getValue($field);
$pattern != "([a-bA-Z])";
if(preg_match($pattern, $value))
{
error
}
else
{
return true;
}
}

Failing that this seems to remove the "+" from the form if you use one.

$keywords = ereg_replace("([ ]+,)"," ",$keywords);

Cheers

brotherhood of LAN

12:30 pm on Oct 23, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Hello ukgimp,

This is part of the SE I intend to build from that directory ;)

Thanks for the code, I'm sure it will work fine, but I'm not too sure/keen on producing an error if they post a wrong character.

I guess a few searches, including valid words would use the -, though in my wordid table it would not contain the dash. In fact, replacing the dash with a space might help in defining what the word is....i.e. self-sustainable could match the "self" and "sustainable" instead of producing an error by replacing the - with a space and treating them as seperate words.

If a particular wordid contained both "self" and "sustainable" it would receive a bonus and most likely be the category/topic that the person is searching for.

I'm not quite going to deal with multiple languages and such with this script, but I hope do deal with those awkward searches and try make them as as uniform as possible :)

andreasfriedrich

12:35 pm on Oct 23, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Did you try

"'[\033-\064]�[\091-\096]�[\123-\126]'"

as your pattern?

You will need to replace � with the real �

Andreas

<added>

Sorry, but those values would have to be octal

"'[\041-\100]�[\133-\140]�[\173-\176]'"

</added>

[edited by: andreasfriedrich at 12:44 pm (utc) on Oct. 23, 2002]

andreasfriedrich

12:37 pm on Oct 23, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

"'[\033-\064\091-\096\123-\126]'"

should work as well.

Andreas

<added>

Sorry, but those values would have to be octal

"'[\041-\100\133-\140\173-\176]'"

</added>

[edited by: andreasfriedrich at 12:45 pm (utc) on Oct. 23, 2002]

brotherhood of LAN

12:43 pm on Oct 23, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You will need to replace � with the real �

Thanks, I see how youre referencing the decimal format there, I'm just not sure what you mean with the �, I retyped the � in there and see what you mean with it being "real".....but after I put that line in, I get the message

No ending delimiter ''' found.......on line 13 which is this line
$text = preg_replace ($search, $replace, $q);

and the preg_replace does not get done. I have a hunch this is basic regex syntax that I should know.....

/added

youre quick :) I checked out the second post alternative and got the same message...I'll read more closely and post if i can get it working without the error

brotherhood of LAN

12:47 pm on Oct 23, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

"'[\041-\100\133-\140\173-\176]'"

This one did not produce the error but still did not replace --- or other testers.

ukgimp

12:47 pm on Oct 23, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Perhaps there is another option that you could try. Would be interested my self if it is feasible (is it people?)

Have multiple sets of queries.

Mysql full text search which would pick out your "self-sustainable"

Then if that was null perform a second query on split words using AND

Then is that was null replace with OR

What do you think, would that be labour intensive. Would also act as a sort of ranking algo, albeit simple.

Cheers

ukgimp

12:49 pm on Oct 23, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

You will need to replace � with the real �

The pipes dont format correctly when copied off of here, they have that break in them. They need to be re inserted.

andreasfriedrich

12:52 pm on Oct 23, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

You will need to replace � with the real �

I believe this forum software replaces the real vertical bar character with the one you see here ¦. So you would need to replace it with the real vertical bar to indicate alternation in your regular expression.

I have a hunch this is basic regex syntax that I should know

A backslashed two or three didgit octal number matches the character with the specified value.

A backslashed x followed by a one or two digit hexadecimal number matches the character with the specified value.

Andreas

brotherhood of LAN

1:15 pm on Oct 23, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

ahhh, big cheers ukgimp and andreas, it works fine, I forgot to change the pipes in the second example.

You guys are just too good ;)

ukgimp,

what you explained is pretty much what I have written down, though I'd still want to search self and sustainable as two seperate words.

If there is more than a single phrase entered as a query, then the query is split word by word and pushed into an array.

The phrase in itself is #1 to be searched for, and if it produces no results, the array of single phrases are searched individually for every element of the array.

/sidenote
andreas,ukgimp, you both know about that directory using the wordid list as category names in the directory....this is the same table to be used.
/sidenote

Then the algo would come down to these factors
1) How many words are in the query (divide their relevance by 1/total)
2) How many categories contain each of the words
3) If these words are categories in the directory, determine what level in the hierarchy they are in, and if it is a defining category (i.e. it is the last category)....then bonuses are applied and the websites within these categories containing the words "self" or "sustainable"

I just plan on using categories of a directory as a heavy influence on weighting search engine results.

A search on "news" brings international news for example, because most likely in the directory there will be a category called "news" that is high up in the category structure and thus gives extra relevance to a generic term.

If someone searches for $country $region news, then the value of each word (country, region, news" are searched for as a whole phrase and compared to the word dictionary......in this case there would be no match, but when the words are cleaned up of dashes and such they can be posted into an array and re-examined for a match.

If there is a category for $country, $region, news, then the elements of the array will match up well with the category
country > region > news
BETTER THAN
news or
region > news or
anothercountry > region > news

I'm sure you see where I'm going :) I think that things like search phrase order will also have to be taken into account, and generally anything else that moves!

A punnett square might come in handy ;)

At least with the regex provided, there is more of a chance that a query will match a word or phrase.....and maybe with another layer of script dealing with stemming the searches should appear OK