is this Googlebot legit?

Forum Moderators: open

Message Too Old, No Replies

is this Googlebot legit?

"Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html)"

Nkona

4:53 pm on Oct 31, 2008 (gmt 0)

We run a site where our outbound links is the most important content. We use a php script to redirect our users to the external URL so that the URLs are a bit harder to scrape.

The php script that does the redireting lives in a directory that is blocked by robots.txt.

We've seen a considerable amount of traffic to this php script where the user agent is "Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...] and comes from an IP that does not resolve to Google.

The suspect traffic usually comes in bursts that are too fast to be a human using a browser.

My 1st reaction was to block the suspect traffic, but I wondering if this bot could be Google contrators looking for cloaking or other QOS issues?

Any ideas?

Thanks

jdMorgan

8:03 pm on Oct 31, 2008 (gmt 0)

Google themselves recommend that you do a reverse-DNS lookup on anything that claims to be Googlebot. So it's a good bet that Google employees and contractors are aware that some Web sites will block access if the rDNS does not resolve back to Google.

You must decide for yourself, but all of my sites enforce this rDNS check as well as many other restrictions, and they do just fine in Google. Several have been 'hand reviewed' with no problems. I see the review requests coming from Google's internal-use IP address ranges, and usually with a popular browser user-agent string like Firefox (although I'll bet at least some will be changing over to the "Google Chrome" browser now).

It's pretty simple: If your site is clean and makes no attempt at anything dodgy or any violation of the spirit of Google's "Quality Guidelines," then don't worry about site-security measures that you must use to keep out scrapers and bad-bots. Google is well-aware of the "bad actors" on the Web, and if you are not in fact a bad actor, then they're quite smart enough to figure that out -- They do have a rather huge amount of data about the Web, you know... :) If you are not trying to out-smart them, then no worries. If you are, then maybe we'll see you soon in the "My site was banned" threads! (I hope not) ;)

The only concern you need have with respect to securing your site is that your solutions are technically correct. For every hour spent coding, spend at least two testing -- Check the HTTP response headers, seek and correct duplicate content problems, and run a very tight, quality-controlled ship.

Jim

Nkona

11:01 pm on Oct 31, 2008 (gmt 0)

Good point on Google being aware that webmaster are looking for this type of thing and would be blocking it.

And, no, you'll not see me on the 'banned' section :). Remember, the area of the site I'm concerned about is supposedly blocked by robots.txt

GaryK

9:54 pm on Nov 5, 2008 (gmt 0)

I'm not sure if this is relevant or not, but the user-agent that Nkona posted is the same one that Webmaster Tools uses to analyze your robots.txt file. However the IP Address does resolve to Google.

jdMorgan

7:08 pm on Nov 9, 2008 (gmt 0)

Note that the "+" is missing ahead of "http:" in that UA string, but appears to be present in real googlebot requests, including the ones from the GWT robots.txt checker.

Jim

GaryK

7:31 pm on Nov 9, 2008 (gmt 0)

Oops, I missed that. Thanks, Ted. Yes, what I have on file for GWT does indeed have a "+" before the "http:" in the UA.