Forum Moderators: coopster

Message Too Old, No Replies

Check text string for duplicates

fastest way?

         

omoutop

8:32 am on Jun 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi all,

i am looking for a way to check a text string for duplicate words. But not all of them. I am interesting in finding two similar words that are next to each other (example: helo little one one). Whats the fastest way to do that?

omoutop

9:20 am on Jun 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



nevermind guys, i made myself a nice funtion...
Cant be sure if its the fastest way, but it serves my purpose fine (text string with max size of 100 words).

For anyone that may want it:

$str = "The The the Hello Truck Hello The the Fantastic bear";
<?php
function remove_duplicate($str)
{
$a = explode(" ",$str);
$b = count($a);

$j = strtolower($a[0]);
$k="";

for ($i=1;$i<=$b;$i++)
{
if (strtolower($j)!=strtolower($a[$i])) { $k .= $j." "; }
$j = $a[$i];
}

return $k;
}

// will return "the Hello Truck Hello the Fantastic bear"
?>

coopster

5:42 pm on Jun 29, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



How about using a regular expression [php.net] to detect and remove duplicate words?
$str = "The The the Hello Truck Hello The the Fantastic bear"; 
$pattern = "/\b([\w'-]+)(\s+\\1)+/i";
$replacement = "$1";
print preg_replace($pattern, $replacement, $string);

The pattern says to find any word boundary followed by a word (which is one or more "word" characters, apostrophes or hypens), followed by one or more of [one or more spaces followed by the word subpattern captured earlier].

A bit complex but with a bit of study you'll see how it works ... nicely ;)

siMKin

12:34 am on Jun 30, 2006 (gmt 0)

10+ Year Member



Nice one coopster!

I was interested to see if it could also compete in terms of parsing time, so i did a little test

<?php
function getmicrotime()
{
list($usec, $sec) = explode(" ",microtime());
return ((float)$usec + (float)$sec);
}

function removeDouble($str)
{
$words = explode(" ", $str);

$output = "";

foreach ($words AS $word)
{
if (!isset($oldWord) ¦¦ strtolower($oldWord)!= strtolower($word))
{
$output .= $word." ";
$oldWord = $word;
}

}
return substr($output, 0, -1);
}
function removeDoublePreg($str)
{
$pattern = "/\b([\w'-]+)(\s+\\1)+/i";
$replacement = "$1";
return preg_replace($pattern, $replacement, $str);
}

$str = "The The the Hello Truck Hello The the Fantastic bear";

$timeStart = getmicrotime();

echo removeDouble($str)." : ".(getmicrotime() - $timeStart)."<br>\n";

$timeStart = getmicrotime();

echo removeDoublePreg($str)." : ".(getmicrotime() - $timeStart)."<br>\n";
?>

The result:

The Hello Truck Hello The Fantastic bear : 0.000283002853394
The Hello Truck Hello The Fantastic bear : 0.0001380443573

Twice as fast, not bad ;-)
Especially if you take into account it works actually better because of the word-boundary thing (and not only with spaces)

texmex

3:14 am on Jun 30, 2006 (gmt 0)

10+ Year Member



What about when it is gramatically correct to have the duplicate words. Such as:

"That that that man says, is that that that man thinks." ;-)

omoutop

6:01 am on Jun 30, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Coopster, very nice! Impressive. I think i will keep your suggestion (time to practice the regular expressions).

Although I see no improvement on my case (small text up to 100 words, function is used once in page), it will do me good to experiment with the newly found knowledge.

Thanks again coopster

adni18

8:08 pm on Jun 30, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Or (on the gramatically correct thing):

"The last few nights he had had had had a noticable difference."

the_nerd

9:46 pm on Jun 30, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For once, German seems to be superiour to other languages:

wenn Robben hinter Robben robben, robben Robben Robben nach.

Can also be used with "Fliegen" and "Kugeln"

trillianjedi

8:18 am on Jul 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Coopster - absolutely brilliant. And educational at the same too!

Is there any way to use similar code to remove excessive punction? Like multiple exclamation marks and question marks for example?

TJ

coopster

7:14 pm on Jul 8, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Oh yes. We covered something similar to that one once too, took me a few minutes to find it though ...

Checking string for consecutive alpha numerics [webmasterworld.com]

You'll have to modify it to meet your needs -- let us know if you have any trouble.