Googlebot uses Godaddy IP range

Forum Moderators: open

Message Too Old, No Replies

Googlebot uses Godaddy IP range

keyplyr

12:06 am on Jan 21, 2009 (gmt 0)

Checked my Google Webmaster Tools (GWT) and found dozens and dozens of 4XX crawl errors. Checked the links and all was OK. Then checked the raw logs and found I've been 403ing Googlebot because it is using a range not belonging to Google (I white list by IP range.)

208.109.8.nnn - - [18/Jan/2009:07:20:22 -0700] "GET www.example.html HTTP/1.1" 403 474 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

All the pages showing as 4xx error at GWT matched up with requests 403'd from this range belonging to Godaddy: 208.109.000.000 - 208.109.255.255

Any other documentation to Googlebot using Godaddy? Did I miss this?

[edited by: incrediBILL at 12:25 am (utc) on Jan. 21, 2009]
[edit reason] Obscured IPs [/edit]

incrediBILL

12:24 am on Jan 21, 2009 (gmt 0)

That IP is not a Googlebot IP, not even close, would be blocked on my site.

Don't know what you're seeing but it's awfully fishy and I'm not coming up with any obvious explanations at the moment.

jdMorgan

12:34 am on Jan 21, 2009 (gmt 0)

I wouldn't allow Google or anyone else to crawl from an IP range with no reverse DNS, no matter the consequences. There's simply too much hassle involved (and a slight risk of opening the sites to scraping).

If it's really Google, then they simply need to 'get with the program' on this and fix the DNS issue.

Same thing for Amazon Compute Cloud... No rDNS, no content.

</curmudgeon mode>

Jim

keyplyr

12:56 am on Jan 21, 2009 (gmt 0)

Maybe a coincidence that the site is hosted at Godaddy.

Regardless, the 4xx errors listed at GWT are the same pages 403'd in my logs, and from the Googlebot using the Godaddy range. Also seeing same IP using Mediapartners-Google.

I've now allowed this IP to use the Google UAs. My alternative is to block it and accrue more errors at GWT?

caribguy

1:53 am on Jan 21, 2009 (gmt 0)

What happens when you click on the "linked from" link in WMT?

Should look similar to this:

Pages that link to http://www.example.com/not-here
1 � 2 of 2
URL Discovery Date
http://example.net/ Jan 17, 2009
http://example.net/referring-page.html Dec 8, 2006

keyplyr

2:10 am on Jan 21, 2009 (gmt 0)

What happens when you click on the "linked from" link in WMT?

These are "Errors for URLs in Sitemaps" and "HTTP errors" which do not offer "link from" info. Only "Not found" does that. But thanks caribguy.

Megaclinium

6:17 am on Jan 21, 2009 (gmt 0)

With all the fake googlebots around, doubt is google. On that other discussion they were coming in from all over, even the planet in houston.

Wilderness pointed out that the fake one was using two spaces after the semicolon. However when you paste the UA into the forum here two spaces are removed by the forum software; even on the fake one so you'dneed to check your raw logs in something like notepad. The crawlies will probably read that post and correct it eventually tho.

If you have crawl delay type commands in robots.txt and doesn't obey them, another indicator, or any other behaviours.

keyplyr

7:12 am on Jan 21, 2009 (gmt 0)

Megaclinium thanks for your comments. I am aware of the many Googlebot spoofs, which is the very reason I white list all known Google IP addresses and only allow Google UAs access based on this white list.

However the point of this thread is that Google Webmaster Tools reports about 50 4xx errors for the very same web pages that were requested by Googlebot UA coming from the Godaddy range. Some of these were the only requests for that page the entire day.

If this were not an authentic Googlebot request, then how could these 403s be in the GWT report?

I see no other explanation than it *is* Googlebot and for some yet to be explained reason, it is crawling from a Godaddy IP address.

I appreciate any/all possible hypotheses.

incrediBILL

10:00 am on Jan 21, 2009 (gmt 0)

Googlebot UA coming from the Godaddy range. Some of these were the only requests for that page the entire day.

Google's WMTs usually doesn't update that fast so the 403s probably weren't from any crawl made on that day, it could've been weeks earlier.

keyplyr

10:27 am on Jan 21, 2009 (gmt 0)

Google's WMTs usually doesn't update that fast so the 403s probably weren't from any crawl made on that day, it could've been weeks earlier.

The crawl dates are listed at WMTs. I went back into my archived raw logs to Jan 17 & 18 to match the data, but I continue to see hits from the IP.

Telephoned Godaddy's NOC security dept and they concluded the requests aren't even from Godaddy at all, but originate from Guatemala. Now I'm really confused as to why these specific 403's are shown in my WMT report.

Did Google open a DC in Guatemala recently?

enigma1

2:27 pm on Jan 21, 2009 (gmt 0)

In my opinion this indicates dns cache poisoning and bad referrals. This issue goes on for years now and could come from incorrect settings of zones. Did you setup the dns for your server and zones or your host did that for you?

jdMorgan

2:44 pm on Jan 21, 2009 (gmt 0)

Without getting in touch with Google, I don't see how this can be resolved. The correspondence between the WMT crawl dates and errors and keyplyr's raw logs is strong, and does seem to indicate that he *did* see Google crawling from non-standard addresses.

On the other hand, Google themselves recommend checking reverse-DNS on Googlebot requests, and rejecting requests that don't pass this test.

Jim

thetrasher

3:21 pm on Jan 21, 2009 (gmt 0)

Maybe a coincidence that the site is hosted at Godaddy.

Maybe it's the SSL proxy for your site?
https://www.example.com/

keyplyr

10:55 pm on Jan 22, 2009 (gmt 0)

[Update]

I had briefly allowed this range not knowing what else to do. Then when I learned it was coming from Guatemala, I removed the IP from the white list again since I hear about botnets, malicious agents, etc coming from that region.

It came back again as Googlebot on the 20th and was 403'd for 13 web pages. Today (2 days later) these same 13 pages show up in WMT as "HTTP errors" and "Errors for URLs in Sitemaps."

That's all the testing I need to do. This is authentic in my book. So far only coming from: 208.109.8.205

And BTW, this same IP address has also visited twice as FF and referred by Google UK search.

Key_Master

11:16 pm on Jan 22, 2009 (gmt 0)

That ip is showing up in two cached pages on Google. There is something going on there.

jdMorgan

11:38 pm on Jan 22, 2009 (gmt 0)

Hey key,

Please keep a very close eye on this, and I suggest some additional digging around. I can't give you a good/exact reason why, but I am getting a 'tingle' that this is the result of a DNS hack or a proxy hijack -- or both...

I don't find that IP range resolving to Guatemala, and I don't see Google using a GD address range without setting up rDNS on it when they have explicitly recommended to us (Webmasters) that we check rDNS on all gBot requests and block those that don't check out.

Something is really non-copasetic here...

Jim

Receptional Andy

11:58 pm on Jan 22, 2009 (gmt 0)

Jim: the cached pages mentioned above are outputting the visitor's IP address (i.e. "your IP:") so it seems very likely that Googlebot has crawled on 208.109.8.205, albeit on a small scale.

Key_Master

12:10 am on Jan 23, 2009 (gmt 0)

Both of those domains are hosted on Godaddy servers. I don't believe that is a coincidence. I'm assuming Godaddy has some internal information leading them to believe that ip is being controlled from Guatemala. You'd think they just shut it down.

I also think this is a proxy issue of some sort that Google might not be aware of. keyplyr, is there any way you can get that ip to cough up it's http headers the next time it visits?

keyplyr

12:39 am on Jan 23, 2009 (gmt 0)

...keyplyr, is there any way you can get that ip to cough up it's http headers the next time it visits?

No, I don't have access. The web site is at a GD shared hosting account.

When I say I believe this Googlebot to be authentic, I agree the IP must be a proxy or a crawl coming from a new DC not fully implemented yet, but whatever the reason I dare not ban it since it seemingly has affected the WMT and may hurt the index of my site at some point.

blend27

2:32 pm on Jan 23, 2009 (gmt 0)

Both of the Domains in Google Cache are Using ZenCart(a coincidence?) from what I Can Tell.

tpl_footer.php - Line 43: ..<?php echo TEXT_YOUR_IP_ADDRESS . ' ' . $_SERVER['REMOTE_ADDR']; ?> .....

Then again the domains are hosted on GG, one site resolves to 72.167.232.45 the other to 72.167.232.44(another coincidence?).

The search on Google for site:oneof2sites +"Your IP Address is: 208.109.8.205" returns only homepages of thouse sites.

One of the lines from TRACERT for .45 address goes like this:

16 84 ms 93 ms 83 ms ip-208-109-112-98.ip.secureserver.net [208.109.112.98] which is on the same range that the crawler is brodcasting from. - another coincidence!

The verdic is - I have too muct free time on my hands.

SteveWh

4:05 pm on Jan 23, 2009 (gmt 0)

Is it possible that a website on a GoDaddy shared server was hacked such that it is being used as a proxy for a crawler that is using a spoofed Googlebot UA?

That would explain GoDaddy's ability to trace the requests to Guatemala. It wouldn't explain why correlated crawl errors are showing up in GWT.

On the other hand, if it actually is Google, there are some possible explanations why: it could be part of their Safe Browsing initiative, testing to discover whether websites redirect or deliver malware based on the requestor's UA or IP, or it could be testing for cloaked content - delivering different content to search engines vs regular visitors, based on UA or IP.

Watch your logs. All you need to be sure it's not really Google is one slip-up: it ignores robots.txt or one malicious request with an RFI or SQL injection attempt. Maybe create a trap in robots.txt, disallowing Googlebot from a page that is non-existent anyway.

[edited by: SteveWh at 4:09 pm (utc) on Jan. 23, 2009]

keyplyr

9:55 am on Jan 26, 2009 (gmt 0)

[update]

Google has now dropped 39 web pages out of the index for this site, the same 39 pages I 403'd because the range did not match that which verified as Google.

Came back again from Godaddy range, this time a different C and D. So far from:

208.109.8.205 on Jan 18,19,20
208.109.31.114 on Jan 25

Still testing.

dstiles

12:08 am on Jan 27, 2009 (gmt 0)

SteveWh - It might explain it IF the proxy were feeding the pages back to google as well as doing something else with them:

googlebot -> proxy -> keyplyr -> proxy -> google
(or some such).

Why? Possibly trying to work out some new interception technique?

How? As noted above: poisoned DNS?

Keyplr - have you checked your DNS? From several geographic locations? Is your domain name secure and set up exactly as you would expect? If you click on one of the pages in google does it go to your exact domain and IP? Ditto from the cache?

If you check for [208.109.8.205...] there is no web page, not even a 404 or 410; just a socket error. At least some of the other IPs in the range return a web page, which is quite common. The implication is either there is no server OR the IP access has been blocked with no error response (eg 403) - in my experience unusual but possible. (NOTE: I use Sam Spade for such things to avoid trojanned sites!)

If you put the IP 208.109.8.205 into google (here in UK) you get two results: this thread and the Zen cart mentioned above. The cache appearance of the zen store suggests it has never been set up. The IP for the store is on another godaddy range 72.167.232.86 (noted above) which returns 403 for http at the IP.

keyplr - is your site a sales one? With popular products? If so could this be a way to quickly build a phishing site.

The store domain was registered 5th January - check google to see the domain name.

The IP 208.109.8.205 is given in google's cache header in the <base> tag but is also at the foot of the page saying "Your address is 208.109.8.205" - odd, because it isn't and shouldn't it be the googlebot IP anyway? (View Source of the cache.) Looking at the page through Sam Spade it gives my correct IP (but not the domain name?).

This suggests that the web page may have been filtered through 208.109.8.205. Which (unless this is keyplr's domain) means he's not alone.

These are only ramblings, you understand. I could be totally misleading you. :(

However, I do not believe google would be sending a bot to your pages from godaddy and acting adversely when it was rejected. Makes no sense and is completely unethical.

keyplyr

1:01 am on Jan 27, 2009 (gmt 0)

I do not believe google would be sending a bot to your pages from godaddy and acting adversely when it was rejected. Makes no sense and is completely unethical.

Yup, here lies the dilemma.

keyplyr

7:11 pm on Jan 28, 2009 (gmt 0)

My conclusions are the somehow an authentic Googlebot gets onto a proxy range, possibly following links from a Guatemala based directory site that forwards links through CGI.

If this is true, it's very scary since it will likely happen again, and to other webmasters. It's not illegal to purchase a business IP range and then use it as a proxy. What's puzzling is that the giant Googlebot has crawled using it. I would have thought they have safe guards against this.

I now allow this C and D range, but I feel I am leaving my site open to abuse. However at this point I see no other alternative as Googlebot has crawled from this range on 4 separate days. The damage is now up to 50 web pages that have been dropped from the Google index due to these 403 errors (according to GWT.)

As an extra safety measure, I have now put my site on a dedicated server IP just to rule out a few uncertainties.

dstiles

12:05 am on Jan 29, 2009 (gmt 0)

I really don't think it's directly google's fault. I think this should be brought to google's attention and to godaddy's attention, since it's their server's (apparent) proxy. If they claim it's a Guatemala IP that's using their proxy then ask them why.

My own guess is noted above.

Have you checked out any of my (and other peoples') suggestions?

The second entry in google for 208.109.8.205 has now gone. The only one now is this thread. Maybe google has twigged it's not an operational site but that's no help as I think it was just another scraped site.