|How exactly do search engines detect cloaking?|
It's technically impossible without cheating.
How do search engines check cloaking? The only way this can be done is for the search engines' bots to crawl all pages TWICE regularly (instead of once), with a faked User-agent header (corresponding to MSIE 6) one of the times, and then compare the two pages and see if anything differs.
For some reason, I find it very hard to believe that Google etc. would do this. Sounds illegal and fraudulent etc.
Also, random elements and advertisements and all sorts of data could be different by nature between these requests, making it very dangerous to deem any two pages of the same URL even within the same minute an attempt to "cloak".
But how would they be able to detect cloaking otherwise? FYI, I do cloaking (isn't it obvious?). But I only do it for things that should only be interesting to bots, such as certain META data and misc. other things that browsers never need. I don't see anything bad in this; I'm just trying to save bandwidth by not sending useless info to all clients.
So... can you shed any light on this?
Also, while I'm asking, why didn't HTTP include an "Is-robot" header? It would make things much easier. As it is now, we have to guess whether a request is from a robot or a human based on the User-agent. I currently do this by sniffing for "*bot*" and "*crawl*", but it's far from perfect, of course.
Even if it's not part of HTTP, why don't nice (non-malicious) bots send "Is-bot: true" or something? Would help me and others a lot to save bandwidth.
|It's technically impossible without cheating. |
There aren't really any rules to the game, other than what the individual webmasters set, so how can they be cheating?
Webmasters can set rules by denying access to anybody they want, cloaking certain content, requiring user registration, etc.
Back to the main discussion, search engines try to discover cloakers by using a number of techniques:
- By visiting using an IP address which is not registered to their company.
- By visiting using a non-spider user agent
- By comparing caches from different sources that they own; i.e., from their main spider and from their page accelerator
- By comparing caches from different sources that they don't own; i.e., from their main spider and from some other company's cache
- By using an algorithmic process to identify likely candidates and using human editors to verify.
|Also, while I'm asking, why didn't HTTP include an "Is-robot" header? |
It might be nice, but it would also be really easy to abuse. Why use the header... what benefit would it bring the owners of bots?
Great list there volatilegx!
How can you feasibly stay ahead of those bulleted points? I mean, there have to be some secrets that are still kept, yes?
That pretty much tells me that if you are cloaking based on the protocol, there are no issues at hand? I mean, of all the larger sites out there, what percentage of them do you think are using some method of cloaking?
My comments about protocol are referring mostly to IP Based Delivery.
> How can you feasibly stay ahead of those bulleted points?
There are some people developing algorithmic methods of detecting spiders/bots that try to hide themselves. incrediBILL comes to mind.
There is no way that I can think of to stop search engines from comparing caches of pages from various sources. That's why cloaking can be risky.
|Webmasters can set rules by denying access to anybody they want |
Indeed. And, in turn, search engines can set rules denying listing to those who try to game the system with cloaking.
It therefore becomes a question of who needs whom most - a webmaster wishing for free natural traffic from (say) Google, or Google having few pages less from billions available.
|It therefore becomes a question of who needs whom most - a webmaster wishing for free natural traffic from (say) Google, or Google having few pages less from billions available. |
What type of cloaking are we referring to here?
I mean, if I'm doing IP based delivery and controlling the bots while they are crawling my site, I would think I'm helping the search engines and they would actually appreciate the efforts I'm taking to keep certain things out of their indices.
For example, I might have a page that has all sorts of filters for the user to sort, display, etc. Googlebot is so good at indexing it will grab all of those filtered URIs. I sure don't want that to happen and I'm going to serve Googlebot a page that is minus the filters. Is there anything wrong with that?
Or maybe the page is heavy with <iframes> and other restrictive technologies. I may not display that to a bot. Is there anything wrong with that?
I believe this forum will become more active as we progress through this year and into 2008. The technical aspects of promoting larger scale sites revolve around what this forum is all about. :)
Maybe we should rename this forum to...
IP Based Delivery
That term Cloaking just has a negative connotation to it. Too many Star Trek episodes. :)
I think it this forum was called "Geo-Targeting" then it would sound a lot better - IMHO cloaking is a very much black SEO term and from search engine point of view it is a bad thing to do.
It is okay however to do geo-targeting on the basis of IP - search engines themselves generate different SERPs depending on where from request came from, however this is different from generating different content for IP/user-agent of the crawler - if you have different content (say language) for different countries it is one thing, but if you have different content for different search engines then you are gaming the system.
Cloaking with intent to deceive users is bad. Other than that, there are many many sites that cloak in various ways: Geo-targeting, server-side page adaptation to client capabilities (e.g. PDAs versus desktops), content negotiation, ephemeral page content removed for SE-cached pages, etc.
The word cloaking has been given a negative connotation by people prone to over-simplification and by the search engines' attempts to cater to those people in their "help" pages.
I have cloaked since before Google existed. I've reported problems to search engines and given them my site's URLs, and many times I have up-front told them as part of the problem description that the sites use the methods listed above. It has never been a problem.
It is cloaking with intent to deceive users that they don't like. If a competitor reports your site, and you are trying to deceive users by serving different content to spiders than to humans, then you're in trouble. But just because you serve different content to a spider than to a human does not ipso facto make it bad.
Google engineer #1: Hey look! -- This Clock thingy at the upper right corner of his page is different than in our cached version!
Google engineer #2: Nuke him! Nuke him! No wait, let's nuke the whole Class C range!
I don't think so...
The name of this forum is "Cloaking." If it were "Cloaking with intent to deceive users," then that would narrow the scope of discussion considerably. Only those truly guilty of cloaking with intent to deceive users need to worry about being caught.
[edited by: jdMorgan at 8:12 pm (utc) on Aug. 2, 2007]
Another trick for Google (or other SE with search toolbar) is to use the hash the toolbar make every time a user browser fetchs some web page and sends to G to compute PageRank, Phising threat, etc...
If this hash is different than the hash they have in their internal caches (main G bot, human bot, etc...) this will raise another flag to mark the site as black hat.