Our site was being systematically crawled by a robot from a colocation ISP in Washington state (from IP 208.99.195.xx). I called them up, and they said it was a crawler operated by a crawling company that crawls data for search engines, specifically for Google. I told them that the User-Agent was plain vanilla Mozilla (Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727), but they insisted it was on behalf of Google.
Why would Google be crawling from IPs that are not theirs? Why would they be ignoring our global deny in our robots.txt file? Why would the User-Agent be obfuscated?
Is this a crawl to check for cloaking or the like? We blocked the entire colo with a 403 -- is that a mistake?
To clarify, I didn't speak to the crawling company (which remains unknown) -- only to the colocation ISP. A member of the night staff at the colocation ISP let slip that the IP address was used by such a crawling company. (I suspect that he was not really supposed to tell me that.)
Another clarification: We ourselves are not cloaking. However, we have a content-heavy site that is frequently ripped off and cloned for a quick contextual ad profit, so we are ruthless in blocking robots that we don't know (and the ISPs and even the entire countries from which those robots originate).