Forum Moderators: open
208.109.8.nnn - - [18/Jan/2009:07:20:22 -0700] "GET www.example.html HTTP/1.1" 403 474 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
All the pages showing as 4xx error at GWT matched up with requests 403'd from this range belonging to Godaddy: 208.109.000.000 - 208.109.255.255
Any other documentation to Googlebot using Godaddy? Did I miss this?
[edited by: incrediBILL at 12:25 am (utc) on Jan. 21, 2009]
[edit reason] Obscured IPs [/edit]
I wouldn't allow Google or anyone else to crawl from an IP range with no reverse DNS, no matter the consequences. There's simply too much hassle involved (and a slight risk of opening the sites to scraping).
If it's really Google, then they simply need to 'get with the program' on this and fix the DNS issue.
Same thing for Amazon Compute Cloud... No rDNS, no content.
</curmudgeon mode>
Jim
Regardless, the 4xx errors listed at GWT are the same pages 403'd in my logs, and from the Googlebot using the Godaddy range. Also seeing same IP using Mediapartners-Google.
I've now allowed this IP to use the Google UAs. My alternative is to block it and accrue more errors at GWT?
Wilderness pointed out that the fake one was using two spaces after the semicolon. However when you paste the UA into the forum here two spaces are removed by the forum software; even on the fake one so you'dneed to check your raw logs in something like notepad. The crawlies will probably read that post and correct it eventually tho.
If you have crawl delay type commands in robots.txt and doesn't obey them, another indicator, or any other behaviours.
However the point of this thread is that Google Webmaster Tools reports about 50 4xx errors for the very same web pages that were requested by Googlebot UA coming from the Godaddy range. Some of these were the only requests for that page the entire day.
If this were not an authentic Googlebot request, then how could these 403s be in the GWT report?
I see no other explanation than it *is* Googlebot and for some yet to be explained reason, it is crawling from a Godaddy IP address.
I appreciate any/all possible hypotheses.
Google's WMTs usually doesn't update that fast so the 403s probably weren't from any crawl made on that day, it could've been weeks earlier.
Telephoned Godaddy's NOC security dept and they concluded the requests aren't even from Godaddy at all, but originate from Guatemala. Now I'm really confused as to why these specific 403's are shown in my WMT report.
Did Google open a DC in Guatemala recently?
On the other hand, Google themselves recommend checking reverse-DNS on Googlebot requests, and rejecting requests that don't pass this test.
Jim
I had briefly allowed this range not knowing what else to do. Then when I learned it was coming from Guatemala, I removed the IP from the white list again since I hear about botnets, malicious agents, etc coming from that region.
It came back again as Googlebot on the 20th and was 403'd for 13 web pages. Today (2 days later) these same 13 pages show up in WMT as "HTTP errors" and "Errors for URLs in Sitemaps."
That's all the testing I need to do. This is authentic in my book. So far only coming from: 208.109.8.205
And BTW, this same IP address has also visited twice as FF and referred by Google UK search.
Please keep a very close eye on this, and I suggest some additional digging around. I can't give you a good/exact reason why, but I am getting a 'tingle' that this is the result of a DNS hack or a proxy hijack -- or both...
I don't find that IP range resolving to Guatemala, and I don't see Google using a GD address range without setting up rDNS on it when they have explicitly recommended to us (Webmasters) that we check rDNS on all gBot requests and block those that don't check out.
Something is really non-copasetic here...
Jim
I also think this is a proxy issue of some sort that Google might not be aware of. keyplyr, is there any way you can get that ip to cough up it's http headers the next time it visits?
...keyplyr, is there any way you can get that ip to cough up it's http headers the next time it visits?
When I say I believe this Googlebot to be authentic, I agree the IP must be a proxy or a crawl coming from a new DC not fully implemented yet, but whatever the reason I dare not ban it since it seemingly has affected the WMT and may hurt the index of my site at some point.
tpl_footer.php - Line 43: ..<?php echo TEXT_YOUR_IP_ADDRESS . ' ' . $_SERVER['REMOTE_ADDR']; ?> .....
Then again the domains are hosted on GG, one site resolves to 72.167.232.45 the other to 72.167.232.44(another coincidence?).
The search on Google for site:oneof2sites +"Your IP Address is: 208.109.8.205" returns only homepages of thouse sites.
One of the lines from TRACERT for .45 address goes like this:
16 84 ms 93 ms 83 ms ip-208-109-112-98.ip.secureserver.net [208.109.112.98] which is on the same range that the crawler is brodcasting from. - another coincidence!
The verdic is - I have too muct free time on my hands.
That would explain GoDaddy's ability to trace the requests to Guatemala. It wouldn't explain why correlated crawl errors are showing up in GWT.
On the other hand, if it actually is Google, there are some possible explanations why: it could be part of their Safe Browsing initiative, testing to discover whether websites redirect or deliver malware based on the requestor's UA or IP, or it could be testing for cloaked content - delivering different content to search engines vs regular visitors, based on UA or IP.
Watch your logs. All you need to be sure it's not really Google is one slip-up: it ignores robots.txt or one malicious request with an RFI or SQL injection attempt. Maybe create a trap in robots.txt, disallowing Googlebot from a page that is non-existent anyway.
[edited by: SteveWh at 4:09 pm (utc) on Jan. 23, 2009]
Google has now dropped 39 web pages out of the index for this site, the same 39 pages I 403'd because the range did not match that which verified as Google.
Came back again from Godaddy range, this time a different C and D. So far from:
208.109.8.205 on Jan 18,19,20
208.109.31.114 on Jan 25
Still testing.
googlebot -> proxy -> keyplyr -> proxy -> google
(or some such).
Why? Possibly trying to work out some new interception technique?
How? As noted above: poisoned DNS?
Keyplr - have you checked your DNS? From several geographic locations? Is your domain name secure and set up exactly as you would expect? If you click on one of the pages in google does it go to your exact domain and IP? Ditto from the cache?
If you check for [208.109.8.205...] there is no web page, not even a 404 or 410; just a socket error. At least some of the other IPs in the range return a web page, which is quite common. The implication is either there is no server OR the IP access has been blocked with no error response (eg 403) - in my experience unusual but possible. (NOTE: I use Sam Spade for such things to avoid trojanned sites!)
If you put the IP 208.109.8.205 into google (here in UK) you get two results: this thread and the Zen cart mentioned above. The cache appearance of the zen store suggests it has never been set up. The IP for the store is on another godaddy range 72.167.232.86 (noted above) which returns 403 for http at the IP.
keyplr - is your site a sales one? With popular products? If so could this be a way to quickly build a phishing site.
The store domain was registered 5th January - check google to see the domain name.
The IP 208.109.8.205 is given in google's cache header in the <base> tag but is also at the foot of the page saying "Your address is 208.109.8.205" - odd, because it isn't and shouldn't it be the googlebot IP anyway? (View Source of the cache.) Looking at the page through Sam Spade it gives my correct IP (but not the domain name?).
This suggests that the web page may have been filtered through 208.109.8.205. Which (unless this is keyplr's domain) means he's not alone.
These are only ramblings, you understand. I could be totally misleading you. :(
However, I do not believe google would be sending a bot to your pages from godaddy and acting adversely when it was rejected. Makes no sense and is completely unethical.
If this is true, it's very scary since it will likely happen again, and to other webmasters. It's not illegal to purchase a business IP range and then use it as a proxy. What's puzzling is that the giant Googlebot has crawled using it. I would have thought they have safe guards against this.
I now allow this C and D range, but I feel I am leaving my site open to abuse. However at this point I see no other alternative as Googlebot has crawled from this range on 4 separate days. The damage is now up to 50 web pages that have been dropped from the Google index due to these 403 errors (according to GWT.)
As an extra safety measure, I have now put my site on a dedicated server IP just to rule out a few uncertainties.
My own guess is noted above.
Have you checked out any of my (and other peoples') suggestions?
The second entry in google for 208.109.8.205 has now gone. The only one now is this thread. Maybe google has twigged it's not an operational site but that's no help as I think it was just another scraped site.