Forum Moderators: phranque

Message Too Old, No Replies

How to block fake Googlebots?

Using htaccess to block fake search engine bots.

         

grandma genie

6:08 pm on Oct 1, 2010 (gmt 0)

10+ Year Member



Hello jd and all,

I believe a fake googlebot visited my site yesterday. Here is the entry from the server log:

209.235.192.nn - - [30/Sep/2010:17:02:40 -0400] "GET / HTTP/1.1" 200 31375 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
209.235.192.nn - - [30/Sep/2010:17:02:41 -0400] "GET /old/ HTTP/1.1" 404 8747 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
209.235.192.nn - - [30/Sep/2010:17:02:41 -0400] "GET /forum/ HTTP/1.1" 404 8747 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
209.235.192.nn - - [30/Sep/2010:17:02:41 -0400] "GET /forums/ HTTP/1.1" 404 8747 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
209.235.192.nn - - [30/Sep/2010:17:02:41 -0400] "GET /vb/ HTTP/1.1" 404 8747 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
209.235.192.nn - - [30/Sep/2010:17:02:41 -0400] "GET /vbulletin/ HTTP/1.1" 404 8747 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The IP does not belong to Google. How do you block this type of visitor in htaccess? I doubt blocking the IP will do much good. I understand from another entry in the Search Engines forum this particular fake has visited other sites with a variety of other IPs. - Grandma_genie

jdMorgan

7:10 pm on Oct 1, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> How do you block this type of visitor in htaccess?

You block all user-agents claiming to be googlebot unless the IP address range is within Google's IP address range.

Jim

grandma genie

11:37 pm on Oct 1, 2010 (gmt 0)

10+ Year Member



Hi jd: Please pardon my ignorance, but the only stuff I can see is what I posted. The user agent here appears to be a legitimate one, but when you check the IP you can see this particular one is from gandalf.volutionmedia.com. Does the Apache server see something I can't see? Is this something only my host can do? I can block the IP, but I assume whoever is doing this is not from that IP or even that host. How do you block someone when you don't know who they are?

jdMorgan

12:30 am on Oct 2, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You don't deny access based on what you don't see, you deny access based on what you do see. Validate the Googlebot requests based on their HTTP headers and known googlebot IP addresses.

Basically, if the UA claims to be googlebot and any of these aspects of the request are wrong, kick the request to the curb with a 403.

Code deleted -- See corrected code below.

This is a simplified version of some code that I use on a few sites. Therefore, this version has not been tested. So hopefully, no typos...

The drawback is that if you use code like this, you have to keep on the lookout for any changes that Google might make which cause a real Googlebot to get denied because its requests no longer exactly match the defined profile. It's not a problem if you deny the real Googlebot for a day, but don't let it go on for a week!

The only time I've seen the profile change is when the Googlebot user-agent itself has been changed, so this is not a terribly big worry as long as you're checking your logs or stats for 403 errors once every few days.

This code can be optimized, but I posted it in "simplified expanded form" to make it easier to read and to modify if you wish to do so.

There are other headers you could check as well, but the purpose here is to have a "low-tolerance" rule, not a "zero-tolerance" rule...

Jim

[edit] Deleted incorrect code. See new post below. [/edit]

[edited by: jdMorgan at 1:48 pm (utc) on Oct 4, 2010]

grandma genie

2:16 am on Oct 2, 2010 (gmt 0)

10+ Year Member



Thank you, Jim. I check my server logs daily, so I'll keep my eyes peeled for a real Googlebot that gets served a 403. I do block the Google image bot now, because I'd rather not have them index my pictures. I am very impressed with the power of the htaccess file. Wilderness suggested this:

# deny when UA contains Googlebot EXCEPT from IP range.
RewriteCond %{REMOTE_ADDR} !^66\.249\.
RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteRule .* - [F]

I need to learn how to read this code better. I think it is, as Spock would say, fascinating.

Jeannie

sublime1

1:48 am on Oct 3, 2010 (gmt 0)

10+ Year Member



Jeannie --

Fascinating indeed, but also dangerous!

Unless I am mistaken, the code you have here would send a "Forbidden" (HTTP 403) response to any user agent containing "Googlebot" unless it's from an IP address starting 66.249.x.x -- as Jim's example shows Google has a number of other IP addresses, meaning you would block their real bot from any but the 66.249.x.x address.

I am guessing this is not what you want.

Jim's code is very carefully crafted to explicitly exclude the currently published Googlebot IP addresses. And I believe him when he says he checks his 403s frequently.

So, to prevent some unwanted traffic from a rogue bot claiming to be Google, you're taking the risk of excluding a real Googlebot.

Is it worth the risk, the performance hit that all requests take to process these rules, the maintenance to find, one day, that your site is being excluded from Google's index?

If your goal is to reduce load on your server, having your site excluded from Google's index is very effective indeed :-). (Sorry, speaking from a rather bitter experience of mine of the past...)

Fighting rogue bots is a losing proposition, in my experience. However, someone that comes up with a good, reliable, safe method for doing it could have quite a good business.

Now that's fascinating.

Just my opinion...

Tom

grandma genie

2:24 am on Oct 3, 2010 (gmt 0)

10+ Year Member



The formula did not work on this entry. I don't know why.
137.110.222.nnn - - [02/Oct/2010:04:16:52 -0400] "GET / HTTP/1.1" 200 31375 "http://www.google.com/search?hl=en&source=hp&btnG=Google+Search&q=bunny" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I check my logs daily. Don't want to block Google. But I just ended up blocking an IP pretending to be an MSNbot. I have to assume anyone downloading hundreds of pages from a site can't be up to any good. Especially when they are hiding behind a fake user agent. I've been hacked before and want to keep any visitor looking for server vulnerabilities away. The fake Google bot from 209.235.192.nn had been to a number of sites with different IPs and different host names, so trying to keep that type of hacker away is very important. I think that is what Jim's code was trying to do.

jdMorgan

2:26 am on Oct 3, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In practice, the code I posted has been non-problematic. Once I had the list of Googlebot IP ranges, it took no more maintenance, and I never blocked Googlebot for more than a few hours even while developing that list. Googlebot assumed that the 403's due ot my incomplete IP address list were errors on my site (because they were) and just waited a few hours to be allowed back in. No damage was done to the rankings, and those rankings were and are quite high.

If you watch your logs/stats on a daily basis, then this approach is viable for a small site. For a very large site, it might not scale very well without a specific "blocked a 'bot" error logging script, and perhaps one that would e-mail the Webmaster when it was invoked...

Googlebot is very forgiving of errors, because many if not most Web sites are bursting-full with errors. If you accidentally give Google a 403 or a 500, or even a 404 on your home page for 24 hours, I doubt it will cause any harm -- unless your site is "suspect" for other reasons. Just don't let it go much longer than that...

Jim

jdMorgan

1:46 pm on Oct 4, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, it was as I feared. I provided a "simplified" version of some googlebot-validation code that I use on some of my servers above, but I got in a hurry and provided some bad code. Due to the precedence of logical operators in mod_rewrite, that code was logically broken as a result of my attempt to simplify it. It also had several non-googlebot IP addresses mixed in. Whiles those addresses do resolve to Google, they are not used by googlebot itself.

The following rule-set will likely work a lot better...

# Validate Googlebots
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$
RewriteCond %{HTTP:Accept} ^\*/\*$
RewriteCond %{HTTP:Accept-Encoding} ="gzip,deflate"
RewriteCond %{HTTP:Accept-Language} =""
RewriteCond %{HTTP:Accept-Charset} =""
RewriteCond %{HTTP:From} ="googlebot(at)googlebot.com"
RewriteCond %{REMOTE_ADDR} ^66\.249\.(6[4-9]|7[0-9]|8[0-46-9]|9[0-5])\. [OR]
RewriteCond %{REMOTE_ADDR} ^216\.239\.(3[2-9]|[45][0-9]|6[0-3])\.0
# Optional reverse-DNS-lookup replacement for IP-address check lines above
# RewriteCond %{REMOTE_HOST} ^crawl(-([1-9][0-9]?|1[0-9]{2}|2[0-4][0-9]|25[0-5])){4}\.googlebot\.com$
RewriteRule ^ - [S=1]
# Block invalid Googlebots
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
RewriteRule ^ - [F]

Note that the optional reverse-DNS line will only work on servers which allow the use of reverse-DNS lookups.

Further, once this rDNS lookup is triggered, the format of your access log file will change; It will no longer show IP addresses as the first entry on each line, but will instead show remote hostnames. This can greatly affect your server administration process, and may cause some 'stats' programs to stop correctly reporting server access summaries. Once your server gets into this mode, it will remain that way until it is re-started.

If you have server configuration privileges, you can easily change your log file format so that it displays Remote_Addr instead of Remote_Host as the first entry on each line, regardless of whether rDNS is enabled by changing the first token in the logging format from %h to %a. See Apache mod_log_config

Sorry if my previous post caused any confusion or problems...

Jim