Robot from colo is on behalf of Google?!

Forum Moderators: goodroi

Message Too Old, No Replies

Robot from colo is on behalf of Google?!

Colo ISP claims robot crawling my site is from 3rd party Google contractor

yodokame

2:56 am on Nov 12, 2007 (gmt 0)

Our site was being systematically crawled by a robot from a colocation ISP in Washington state (from IP 208.99.195.xx). I called them up, and they said it was a crawler operated by a crawling company that crawls data for search engines, specifically for Google. I told them that the User-Agent was plain vanilla Mozilla (Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727), but they insisted it was on behalf of Google.

Why would Google be crawling from IPs that are not theirs? Why would they be ignoring our global deny in our robots.txt file? Why would the User-Agent be obfuscated?

Is this a crawl to check for cloaking or the like? We blocked the entire colo with a 403 -- is that a mistake?

vincevincevince

3:04 am on Nov 12, 2007 (gmt 0)

It is very interesting, would be good to find out who that crawling company is and investigate further. There are a lot of cloakers who need to be able to check up all other IPs used by the firm.

yodokame

3:34 am on Nov 12, 2007 (gmt 0)

To clarify, I didn't speak to the crawling company (which remains unknown) -- only to the colocation ISP. A member of the night staff at the colocation ISP let slip that the IP address was used by such a crawling company. (I suspect that he was not really supposed to tell me that.)

yodokame

3:39 am on Nov 12, 2007 (gmt 0)

Another clarification: We ourselves are not cloaking. However, we have a content-heavy site that is frequently ripped off and cloned for a quick contextual ad profit, so we are ruthless in blocking robots that we don't know (and the ISPs and even the entire countries from which those robots originate).

jdMorgan

4:03 am on Nov 12, 2007 (gmt 0)

After looking up info related to that IP address range, I'm also serving them 403s -- That whole operation looks to be fairly questionable, IMO.

No robots.txt compliance, No crawling.

Jim