homepage Welcome to WebmasterWorld Guest from 174.129.163.183
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Robot from colo is on behalf of Google?!
Colo ISP claims robot crawling my site is from 3rd party Google contractor
yodokame




msg:3502487
 2:56 am on Nov 12, 2007 (gmt 0)

Our site was being systematically crawled by a robot from a colocation ISP in Washington state (from IP 208.99.195.xx). I called them up, and they said it was a crawler operated by a crawling company that crawls data for search engines, specifically for Google. I told them that the User-Agent was plain vanilla Mozilla (Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727), but they insisted it was on behalf of Google.

Why would Google be crawling from IPs that are not theirs? Why would they be ignoring our global deny in our robots.txt file? Why would the User-Agent be obfuscated?

Is this a crawl to check for cloaking or the like? We blocked the entire colo with a 403 -- is that a mistake?

 

vincevincevince




msg:3502496
 3:04 am on Nov 12, 2007 (gmt 0)

It is very interesting, would be good to find out who that crawling company is and investigate further. There are a lot of cloakers who need to be able to check up all other IPs used by the firm.

yodokame




msg:3502499
 3:34 am on Nov 12, 2007 (gmt 0)

To clarify, I didn't speak to the crawling company (which remains unknown) -- only to the colocation ISP. A member of the night staff at the colocation ISP let slip that the IP address was used by such a crawling company. (I suspect that he was not really supposed to tell me that.)

yodokame




msg:3502500
 3:39 am on Nov 12, 2007 (gmt 0)

Another clarification: We ourselves are not cloaking. However, we have a content-heavy site that is frequently ripped off and cloned for a quick contextual ad profit, so we are ruthless in blocking robots that we don't know (and the ISPs and even the entire countries from which those robots originate).

jdMorgan




msg:3502505
 4:03 am on Nov 12, 2007 (gmt 0)

After looking up info related to that IP address range, I'm also serving them 403s -- That whole operation looks to be fairly questionable, IMO.

No robots.txt compliance, No crawling.

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved