Search Engines to detect cloaking?

Forum Moderators: open

Message Too Old, No Replies

Search Engines to detect cloaking?

benja

11:12 am on May 3, 2006 (gmt 0)

It's pretty obvious that SE have a lot of checks to detect cloaking. Such as JavaScript redirects, weird CSS and so on...

Now, the easiest to detect cloaking is actually to "see" the page. Looking at the HTML code fetched by Google, and comparing to what a browser displays.
Do do you think SE have technologies to let's say grab a screenshot of a page, analyze this screenshot in the same way as a scanning text detection software and comparing to the HTML?

I know it looks pretty complex but is it really so complex for Google e.g.?
They might even use the Google Toolbar to do that.

volatilegx

2:03 pm on May 3, 2006 (gmt 0)

> It's pretty obvious that SE have a lot of checks to detect cloaking.

I assume you actually mean spam in this sentence and not cloaking.

> Now, the easiest to detect cloaking is actually to "see" the page.

The best way for a search engine to detect cloaking is to compare the cache from a "known" spider to the cache from an "unknown" spider. By known and unknown, I mean relative to the cloaker.

We suspect that the major search engines are running spiders under browser user agents from IPs not registered to them. The spiders would have to be programmed to act just like browsers, requesting images, sending HTTP_REFERER headers, etc., in order to "fly below the radar".

I also suspect Google uses information collected from its toolbar and accelerator. The major engines may also have deals with Alexa and/or other companies that spider a lot but aren't considered search engines.

I believe they use algorithms that analyze the text content of the page in their comparisons. I doubt they use actual screen shots for their comparisons.

steveconnerie

7:15 pm on May 3, 2006 (gmt 0)

Benja - have to agree that it wouldn't be rocket science to run a text recognition on a screen shot of a website to detect cloaking, or even spamming as you mention, such as doorways and hidden text.

But then again, why can't SE's "understand" what is being displayed when there are open-source browsers such as Mozilla? That is to say, aren't SE's just browsers that index the content that they see?

I think there is something much much deeper at play here ....