homepage Welcome to WebmasterWorld Guest from 54.196.196.62
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Marketing and Biz Dev / Cloaking
Forum Library, Charter, Moderator: open

Cloaking Forum

    
detecting presence of GoogleBot
IP addresses vs. User-Agent strings
yawnmoth




msg:3144627
 5:57 am on Nov 3, 2006 (gmt 0)

The two most common ways I've seen for people to check for the presence
of GoogleBot (be it for the purposes of cloaking or not) is to either check the IP address or to check the User-Agent string. The User-Agent string, however, isn't fool proof, since people can pretty trivially easily spoof the User-Agent. This leaves the IP address method as the most reliable one. The only thing I'm wondering about it is... how does one go about finding what GoogleBot's IP addresses are?

You can't check for the User-Agent strings and add all the IPs with the right User-Agent to a table, because you'd have the same problems User-Agent had. You could get a list of IPs from some website, but how does that website figure out what GoogleBot's IP addresses are?

 

volatilegx




msg:3145192
 5:26 pm on Nov 3, 2006 (gmt 0)

I maintain lists of IP addresses of search engine spiders at iplists.com. Also, check out the Search Engine Spider Identification Forum [webmasterworld.com].

I'm speaking at the Pubcon later this month on the topic of identifying search engine spiders. There are a number of techniques used to do it. Personally, my favorite is the semi-automated approach, where I have a CGI script that logs visits by users that behave in a certain way (like no HTTP_REFERER header, visiting pages rarely visited by humans, several requests in a short time, requests for html files but no images/css) and the script emails me the daily results of its logs. I then run the logs through another CGI script which tells me if any of the IP addresses are already in my lists. Then, I research all of the entries that are not in the lists.

Dan

DanA




msg:3145230
 5:53 pm on Nov 3, 2006 (gmt 0)

If the user agent string is spoofed, you won't find googlebot.com in the host name.

brizad




msg:3156805
 1:34 am on Nov 15, 2006 (gmt 0)

I think it's pretty well known that the SEs (at least G) send unidentified bots to spider sites and then compares the identified spider results vs. the unidentified spider results. So if you get caught by that trap then I guess your SOL. That and of course the human quality management team.

Volatilegx that seems like a clever system you've got set up. Have you identified many sneaky spiders that way?

volatilegx




msg:3158315
 9:17 am on Nov 16, 2006 (gmt 0)

Yeah, I have, but if the search engines are really determined to create stealth spiders, there is little we can do to detect them. It is easy for them to create spiders that don't leave normal spider footprints.

Actually, I think it is unnecessary for them to do so. Google, for instance, could use its Google Accelerator data instead of stealth spiders.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / Cloaking
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved