Forum Moderators: coopster

Message Too Old, No Replies

Remove duplicate words

         

Blue_Dog

4:48 pm on Apr 21, 2007 (gmt 0)

10+ Year Member



Hi,

I have some chunks of text where I need to remove duplicate words. e.g. "one one, two three, two" should be "one, two three,)".
I could do that using arrays etc, but it's quite slow doing that way because of the size of the text. Is it possible to achieve the same result using regular expressions and preg_replace?

Many Thanks

eelixduppy

6:40 pm on Apr 21, 2007 (gmt 0)



Welcome to WebmasterWorld!

>> Is it possible to achieve the same result using regular expressions and preg_replace?

I do not think you are going to be able to do this with regular expressions.

I'd say the best way you are going to be able to do something like this is to explode [php.net]() the string into an array of words, and then use array_unique [php.net]() to get rid of the duplicates.

Good luck!

eelixduppy

1:58 pm on Apr 22, 2007 (gmt 0)



explode() might actually not be the best way to go about this because the punctuation will be included in the array, and therefore unless a comma, period, etc.. follows all the duplicate words, they won't "match". The following should work, however. :)

$string = "one one, two three, two";
$string = preg_replace("/([,.?!])/"," \\1",$string);
$parts = explode(" ",$string);
$unique = array_unique($parts);
$unique = implode(" ",$unique);
$unique = preg_replace("/\s([,.?!])/","\\1",$unique);
echo $unique;

It's not the best solution, but it works as long as there isn't the same punctuation in the string, as well.

Blue_Dog

9:11 pm on Apr 23, 2007 (gmt 0)

10+ Year Member



thanks, eelixduppy. your code works like a charm. it's a pity i've spent couple of hours writing the code using arrays...

eelixduppy

7:02 am on Apr 24, 2007 (gmt 0)



>>it's a pity i've spent couple of hours writing the code using arrays...

If your solution works flawlessly then I'd go with it. My code above may work for most instances, but it won't work for sentences with multiple periods, commas, etc... Whatever suits your needs best. :)