homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

Robot from colo is on behalf of Google?!
Colo ISP claims robot crawling my site is from 3rd party Google contractor

 2:56 am on Nov 12, 2007 (gmt 0)

Our site was being systematically crawled by a robot from a colocation ISP in Washington state (from IP 208.99.195.xx). I called them up, and they said it was a crawler operated by a crawling company that crawls data for search engines, specifically for Google. I told them that the User-Agent was plain vanilla Mozilla (Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727), but they insisted it was on behalf of Google.

Why would Google be crawling from IPs that are not theirs? Why would they be ignoring our global deny in our robots.txt file? Why would the User-Agent be obfuscated?

Is this a crawl to check for cloaking or the like? We blocked the entire colo with a 403 -- is that a mistake?



 3:04 am on Nov 12, 2007 (gmt 0)

It is very interesting, would be good to find out who that crawling company is and investigate further. There are a lot of cloakers who need to be able to check up all other IPs used by the firm.


 3:34 am on Nov 12, 2007 (gmt 0)

To clarify, I didn't speak to the crawling company (which remains unknown) -- only to the colocation ISP. A member of the night staff at the colocation ISP let slip that the IP address was used by such a crawling company. (I suspect that he was not really supposed to tell me that.)


 3:39 am on Nov 12, 2007 (gmt 0)

Another clarification: We ourselves are not cloaking. However, we have a content-heavy site that is frequently ripped off and cloned for a quick contextual ad profit, so we are ruthless in blocking robots that we don't know (and the ISPs and even the entire countries from which those robots originate).


 4:03 am on Nov 12, 2007 (gmt 0)

After looking up info related to that IP address range, I'm also serving them 403s -- That whole operation looks to be fairly questionable, IMO.

No robots.txt compliance, No crawling.


Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved