Msg#: 3144625 posted 5:57 am on Nov 3, 2006 (gmt 0)
The two most common ways I've seen for people to check for the presence of GoogleBot (be it for the purposes of cloaking or not) is to either check the IP address or to check the User-Agent string. The User-Agent string, however, isn't fool proof, since people can pretty trivially easily spoof the User-Agent. This leaves the IP address method as the most reliable one. The only thing I'm wondering about it is... how does one go about finding what GoogleBot's IP addresses are?
You can't check for the User-Agent strings and add all the IPs with the right User-Agent to a table, because you'd have the same problems User-Agent had. You could get a list of IPs from some website, but how does that website figure out what GoogleBot's IP addresses are?
I'm speaking at the Pubcon later this month on the topic of identifying search engine spiders. There are a number of techniques used to do it. Personally, my favorite is the semi-automated approach, where I have a CGI script that logs visits by users that behave in a certain way (like no HTTP_REFERER header, visiting pages rarely visited by humans, several requests in a short time, requests for html files but no images/css) and the script emails me the daily results of its logs. I then run the logs through another CGI script which tells me if any of the IP addresses are already in my lists. Then, I research all of the entries that are not in the lists.
Msg#: 3144625 posted 1:34 am on Nov 15, 2006 (gmt 0)
I think it's pretty well known that the SEs (at least G) send unidentified bots to spider sites and then compares the identified spider results vs. the unidentified spider results. So if you get caught by that trap then I guess your SOL. That and of course the human quality management team.
Volatilegx that seems like a clever system you've got set up. Have you identified many sneaky spiders that way?
Msg#: 3144625 posted 9:17 am on Nov 16, 2006 (gmt 0)
Yeah, I have, but if the search engines are really determined to create stealth spiders, there is little we can do to detect them. It is easy for them to create spiders that don't leave normal spider footprints.
Actually, I think it is unnecessary for them to do so. Google, for instance, could use its Google Accelerator data instead of stealth spiders.