Forum Moderators: open

Message Too Old, No Replies

Madlib Spam

         

vabtz

3:29 pm on Aug 24, 2005 (gmt 0)



A competitor of mine is using madlib spam on his site. I thought this was a really a ballsy move since his site is really well ranked and a resource in his area.

The thing is though I found out after doing some research that the owner is a very well known industry voice on SEM and SEO.

My question is does anyone have any quantitative experience with madlib spam?

How long does it last in the serps?
...........

My thoughts so far are that the ability to detect it would be inversely proportional to the size of the document and the density of the keywords.

The reason I think this is would be true is that the computational cost of performing a one to one document comparison would be to high . Assuming this is true then the SE's need quick measures to find stuff like this. I can't think of a quick or computationally cheap method.

medowl

4:37 pm on Aug 26, 2005 (gmt 0)

10+ Year Member



>> My thoughts so far are that the ability to detect it would
>> be inversely proportional to the size of the document and
>> the density of the keywords.

Not necessarily. Because randomness is a part of most madlibbing, madlibs will eventually produce word combinations that are not found in normal text - a longer document might make it easier to spot certain patterns that rarely or never occur in real text, but do when a computer is jumbling words together. And the fact that madlib software uses some randomness doesn't mean that there isn't an algorithm at work, which can also leave a tell-tale signature - one that might be very faint in a small passage, but clear in a longer one.

There are lots of statistical measures that could be used to spot madlibbing (I don't know that they are being used, but they could). No technical measure will be perfect, but more could be done.

Yes, linguistic analysis takes computing power. But not all pages need to be analyzed. If a search engine just tried to identify and penalize pages in the top 10 or top 50 results for the 100,000 most popular search queries, they would do a lot to improve their product.

vabtz

7:13 pm on Aug 26, 2005 (gmt 0)



oh interesting I hadn't considered random text a part of madlib.

The competitor that I am referring to is using a template and dropping his keywords in to the template.

JayC

10:28 pm on Aug 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> The reason I think this is would be true is that the computational cost of performing a one to one document comparison would be to high .

In fact, though, a search engine wouldn't have to compare every word on page with every word on another page in order to attack the duplicate content problem. Just create a "sketch" of every document containing information about a number of samples taken from the content, then compare those much smaller sets of data to determine how fully the two sets intersect.

"Density of keywords" doesn't come into play, and you could compare only documents containing roughly the same amount of content with each other.

vabtz

12:29 am on Aug 27, 2005 (gmt 0)



Thats an interesting way to attack that.

In example the site in question has a forum. So they have literally over 100,000 pages to look at.

So looking at the intersection of words across the whole of the site would be very expensive.

Also many of the tokens are ones that have no meaning or little ( it the so if then ) but they are going to have a high concurrent rate.

I dunno I don't see how that would solve the problem in an effecient way.

Also looking at the incidence of blocks of texts would be difficult too.

hrm..

does any one know of any links to algorythms that score similarty between sets of tokens?

Any papers on this?

medowl

7:47 pm on Aug 27, 2005 (gmt 0)

10+ Year Member



>> The competitor that I am referring to is using a template and dropping his keywords in to the template.

If the software uses a template, that will leave a signature; some templates are more complicated than others, but usually by introducing randomness, which can also be detected. The question is, how bad does someone want to detect that type of activity?

A good link into linguistic analysis software:
[sil.org...]

I worked with this type of software a few years ago: it analyses the frequency of bigrams (word pairs) and trigrams (sets of 3 consecutive words) to get one 'texture measure' of passages. It is possible to compare two texts for similarity, or determine if a particular passage is consistent with a larger reference body (ie, non-spam English).
[speech.cs.cmu.edu...]

Book heaving Viagra jello amok? Mesothelioma with Sandy and Terry every weekend in the Hamptons? Know what I mean? Wink, Wink. Nudge, Nudge.

vabtz

2:50 am on Aug 28, 2005 (gmt 0)



wow thanks for the links and thoughtful response

I got some reading to do now.

:-)