Forum Moderators: coopster

Message Too Old, No Replies

Another Repeating Text Problem

How to identify this one

         

FourDegreez

3:21 pm on May 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I asked about repeating characters here, [webmasterworld.com ]

There is a tougher challenge I face involving repeating text, and I hope someone has a clever solution to this. How does one detect repeating strings of arbitrary length?

Let's say a user submits the following: "I am bored I am bored I am bored ..." with dozens of repetitions. I need to programmatically detect this.

But I can't think of any efficient way.

darrenG

6:56 pm on May 8, 2008 (gmt 0)

10+ Year Member



split into words
Test for repeated words[1]
If words[2] are repeated in the same positions in relation to words[1], the phrases are likely to have been repeated
Test for further repetitions in the same way.

Just a rough idea, hope that makes sense.

Wolf_man

3:42 pm on May 9, 2008 (gmt 0)

10+ Year Member



This is what I came up with, probably not the most effiecent, but hopefully its a start,

$input = "I am bored I am bored";
$words = split(" ",$input);
$pattern=array_shift($words);
$temp=$pattern;
while(sizeof($words)>0 && !$no_pattern){
if(preg_match_all("/".$pattern."/",$input,$hold) < 2 ) {
$no_pattern=true;
}else{
$temp=$pattern;
$pattern.=' '.array_shift($words);
}
}
$pattern=$temp;
echo "Repeated text is: ".$pattern;

I believe its a similar idea to what darrenG said

EDIT: not sure why my tabs are not showing up, hopefully you can still read the code ok.

also, this assumes that the pattern is at the begining of the string.

[edited by: Wolf_man at 3:49 pm (utc) on May 9, 2008]

henry0

9:17 pm on May 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It works and does not :)
you need to take care of all the stops words: and, this, that, etc.. etc..
plus you cannot figure legit repeat:
example in a content page about "the green house" you will surely repeat "the green house" a few times.

Further you cannot figure if a given sentence is made of 3 or 10 or whatever # of words.
So you need to look for a "pattern mode search"
pretty much like a search engine!