What's the IP, Gary?
Going back awhile, all I ever show are these kinds of host+string combos in connection with "Googlebot/2.1" (from two different sites' logs):
crawl4.googlebot.com - - [06/Mar/2002:05:52:29 -0800] "GET /robots.txt HTTP/1.0" 200 204 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
crawl4.googlebot.com - - [08/Apr/2002:12:04:55 -0700] "GET /robots.txt HTTP/1.0" 200 204 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
crawl5.googlebot.com - - [01/May/2002:18:25:27 -0700] "GET /robots.txt HTTP/1.0" 200 204 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
crawl-66-249-66-52.googlebot.com - - [01/Jun/2006:00:02:01 -0700] "GET /robots.txt HTTP/1.1" 200 9770 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I started seeing this one in early March coming from the IPs & dates listed below:
(no compatible; and no +http://www.google.com/bot.html, just Googlebot/2.1 as the entire UA)
18.104.22.168 4/26, 5/02
22.214.171.124 5/10 to 5/24
126.96.36.199 4/15 to 4/25
188.8.131.52 5/03 to 5/04
184.108.40.206 3/08, 3/31, 4/4 to 4/14
220.127.116.11 4/30, 5/01
18.104.22.168 5/05 to 5/09
22.214.171.124 5/25 to 5/30
Kept meaning to ask about them, too.
FWIW, picking a couple at random:
IP address: 126.96.36.199
Reverse DNS: crawl-66-249-65-65.googlebot.com
IP address: 188.8.131.52
Reverse DNS: crawl-66-249-66-101.googlebot.com
FWIW redux, we talked about something kind of related a couple of weeks ago, about how non-googlebot UAs are using G's IPs as proxies [webmasterworld.com]. I now block a slew of G's IPs because Googlebot never came through them, but iffy visitors did.
Dan the IP Address was 184.108.40.206.
Pfui I'll check that thread out. Thanks.
Thanks everyone else.
Thanks Pfui. I knew they all came from Google IPs and were crawlers. My question is the same as GaryK's (I think), why the incomplete UA for Googlebot?
Or, do you mean the these are not actually visits by a crawler but someone using G's IP as a proxy?
Whois comes back to Google for that IP #, its within the NetRange: 220.127.116.11 - 18.104.22.168.
Its not spoofing.
Can't connect directly, so probably a spider, maybe a special one. -Larry
Nancy, yes that's the essence of my question. I just need to know if this is a legitimate crawler from Google.
It wouldn't surprise me at all if someone was spoofing Googlebot through G's IPs. In four years, I've never seen Googlebot come in from other than .googlebot.com and in recent months I've seen all too many NON Google UAs come in through G's IPs, and go where Googlebot is not allowed to go.
Also, in addition to the missing 'parts' of the UA Gary reported, note the incorrect capitalization --
-- and here's the typical form:
Now, might "GoogleBot" be a beta version, or even a brand-new one? I guess. So I'd assess it by the usual checks -- except it fails the Googlebot ID test, ditto the googlebot.com Host test. So, let's see.
Did it ask for robots.txt? Did it heed it?
Absent more info, or confirmation from G one way or another, I'm betting it's a fake, using G's IP(s) as a proxy.
Gary, nancy, do you two 'do' Google ads? Because here's some more info [webmasterworld.com] from a current thread in Google Search News [webmasterworld.com]. The poster describes your same situ, says it's AdWords.
Then again, there's the new, dedicated AdsBot-Google [webmasterworld.com] for AdWords, so beats me whether, or if, "GoogleBot/2.1" is also AdWords-related?
Well, that sure clears things up -- clear as mud, that is:) Sorry!
> Absent more info, or confirmation from G one way or another, I'm betting it's a fake, using G's IP(s) as a proxy.
Seems likely to me, too.
|Gary, nancy, do you two 'do' Google ads? |
I tried AdSense for a few months last year and then stopped using it.
So, in the absence of anything official from Google I guess we're considering this user agent a faker?
The robot with this UA "Googlebot/2.1", coming from 66.249.72.* visits only the pages I'm advertising in Adwords. It does check robots.txt.
I haven't put up any ads in June so I have no data on the editors hand checking. Prior to June the editor would visit from 66.102.6.*
My post last night evidently got lost in cyberspace ...
|Gary, nancy, do you two 'do' Google ads? |
No, never ads of any kind.
Also, "my" googlebot is Googlebot/2.1 - notice the bot is correct form without capitalization.
Is there some way we can get Google to confirm if either of these two bots is legitimate?
I haven't seen this one on my sites yet, but would add it's not totally implausible that it's something to do with Adwords.
I don't do Adwords, but I do have Adsense, and the bot for that is Mediapartners-Google/2.1, which then works its way through with the UA "compatible; Googlebot/2.1; +http://www.google.com/bot.html". Coincidence? No idea ;)
Determining if it's Google isn't hard.
All you have to do to verify it's Google is do a WHOIS <ip> and it will typically show "OrgName: Google Inc." if it's really them.
Matter of fact, I bounce everything claiming to be Google not hosted on a Google block of IPs, just send it packing.
Why you might ask?
Because you'll find Google actually crawling through proxy servers that cloak directories of websites to Google, which then crawls your site through the proxy and pages can be hijacked in this manner as the SE's are stupid.
This is another reason I block proxies too so when SEs can't crawl through them it's no ptoblem.
|All you have to do to verify it's Google is do a WHOIS <ip> and it will typically show "OrgName: Google Inc." if it's really them. |
Yeah but the point is that we think people are spoofing a user agent and routing through a google-owned proxy.
Thanks, volatilegx, for that clarification.
Can you explain if there is a way to determine if someone is spoofing through a google-owned proxy?
BTW, I don't to adsense or adwords and I get many visits every day from Mediapartners bot.
|Yeah but the point is that we think people are spoofing a user agent and routing through a google-owned proxy |
If Google permits that, then they get what they deserve, not a lot we can do about it unless they cough up a definitive list of crawler IPs.
However, even if they were, most people don't use NOARCHIVE and the server caches your pages anyway and the content can be had there as well.
Nancy, the way I do it is via Dan's list of IP Addresses. Basically if a user agent claims to be from Google and it's not on Dan's list I serve it a robots.txt file that disallows everything. If it's not really from Google but ignores robots.txt and starts crawling it will quickly fall into a spider trap that lets me know about it so I can investigate further.
Thanks GaryK, I know about Dan's list, but I just don't have the time to learn how to create a spider trap (don't know a thing about Php), I was hoping there was some other way to determine if it was someone using a proxy.
Of course ... that's why you all have spider traps, right? :)
That's part of the reason Nancy. Another reason is to stop abusive spiders from bringing a website to a screeching halt. For example, if I see a user agent taking more pages at a time than anyone could possibly read, or even skim, I will stop them dead in their tracks so my users don't have to suffer from a slow or non-responsive server.
I don't do PHP either. Most of my code is compiled into .dll files written in C++ and VB.NET.
|I was hoping there was some other way to determine if it was someone using a proxy |
or Google provides
keeps track of many items, most of which for reasons on mail spam.