Forum Moderators: open
2002
crawl4.googlebot.com - - [06/Mar/2002:05:52:29 -0800] "GET /robots.txt HTTP/1.0" 200 204 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
crawl4.googlebot.com - - [08/Apr/2002:12:04:55 -0700] "GET /robots.txt HTTP/1.0" 200 204 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
crawl5.googlebot.com - - [01/May/2002:18:25:27 -0700] "GET /robots.txt HTTP/1.0" 200 204 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
2006
crawl-66-249-66-52.googlebot.com - - [01/Jun/2006:00:02:01 -0700] "GET /robots.txt HTTP/1.1" 200 9770 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
(no compatible; and no +http://www.google.com/bot.html, just Googlebot/2.1 as the entire UA)
66.249.65.65 5/31
66.249.66.101 4/26, 5/02
66.249.66.106 5/10 to 5/24
66.249.66.114 4/27
66.249.66.115 4/15 to 4/25
66.249.66.116 5/03 to 5/04
66.249.66.200 3/08, 3/31, 4/4 to 4/14
66.249.66.243 4/30, 5/01
66.249.66.73 4/28
66.249.66.75 5/05 to 5/09
66.249.66.99 5/25 to 5/30
Kept meaning to ask about them, too.
IP address: 66.249.65.65
Reverse DNS: crawl-66-249-65-65.googlebot.com
IP address: 66.249.66.101
Reverse DNS: crawl-66-249-66-101.googlebot.com
FWIW redux, we talked about something kind of related a couple of weeks ago, about how non-googlebot UAs are using G's IPs as proxies [webmasterworld.com]. I now block a slew of G's IPs because Googlebot never came through them, but iffy visitors did.
Also, in addition to the missing 'parts' of the UA Gary reported, note the incorrect capitalization --
GoogleBot/2.1
-- and here's the typical form:
Googlebot/2.1
Now, might "GoogleBot" be a beta version, or even a brand-new one? I guess. So I'd assess it by the usual checks -- except it fails the Googlebot ID test, ditto the googlebot.com Host test. So, let's see.
Did it ask for robots.txt? Did it heed it?
Absent more info, or confirmation from G one way or another, I'm betting it's a fake, using G's IP(s) as a proxy.
Then again, there's the new, dedicated AdsBot-Google [webmasterworld.com] for AdWords, so beats me whether, or if, "GoogleBot/2.1" is also AdWords-related?
Well, that sure clears things up -- clear as mud, that is:) Sorry!
I don't do Adwords, but I do have Adsense, and the bot for that is Mediapartners-Google/2.1, which then works its way through with the UA "compatible; Googlebot/2.1; +http://www.google.com/bot.html". Coincidence? No idea ;)
All you have to do to verify it's Google is do a WHOIS <ip> and it will typically show "OrgName: Google Inc." if it's really them.
Matter of fact, I bounce everything claiming to be Google not hosted on a Google block of IPs, just send it packing.
Why you might ask?
Because you'll find Google actually crawling through proxy servers that cloak directories of websites to Google, which then crawls your site through the proxy and pages can be hijacked in this manner as the SE's are stupid.
This is another reason I block proxies too so when SEs can't crawl through them it's no ptoblem.
Yeah but the point is that we think people are spoofing a user agent and routing through a google-owned proxy
If Google permits that, then they get what they deserve, not a lot we can do about it unless they cough up a definitive list of crawler IPs.
However, even if they were, most people don't use NOARCHIVE and the server caches your pages anyway and the content can be had there as well.
I don't do PHP either. Most of my code is compiled into .dll files written in C++ and VB.NET.
I was hoping there was some other way to determine if it was someone using a proxy
or Google provides
[google.com...]
SamSpade.org
keeps track of many items, most of which for reasons on mail spam.