Forum Moderators: phranque
#
# Validate Googlebot user-agent and IP, respond with 403-Forbidden if invalid
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Googelbot [NC]
RewriteCond %{HTTP_USER_AGENT}!^Mozilla/5\.[0-9]+\ \(compatible;\ Googlebot/2\.[0-9];\ \+http://www\.google\.com/bot\.html\)$ [OR]
RewriteCond %{REMOTE_ADDR}!^66\.249\.
RewriteRule .* - [F]
Since I can't test from Googles IP, I tried giving a blank referrer and image bots UA and wandered my site, I also could get to images fine, but any non-image files returned a 403. I removed that from htaccess and had total access again.
So I'm a bit lost, why is image bot getting this error on non image files, with a valid Google IP? And why could I access the images without a valid Google IP, but not html pages? Same pattern but it makes no sense to me.
Am I looking at the wrong chunk in htaccess?
Mediapartners-Google/2.1 and regular (real) Google bots are not hitting this problem so far that I can find, just the image bot.
First, is there code above this snippet that bypasses it for image files? Something like:
# Bypass remaining rules for image requests
RewriteRule \.(gif夸pe?g如ng在mp)$ - [L]
RewriteCond %{HTTP_USER_AGENT} Googlebot刖oogelbot [NC]
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/5\.[0-9]+\ \(compatible;\ Googlebo[b]t(-Image)?/[1-9]\.[0-9];\ \[/b]+http://www\.google\.com/bot\.html\)$ [OR]
RewriteCond %{REMOTE_ADDR} !^66\.249\.
RewriteRule .* - [F]
Usually, you can safely delete the bypass rule, but doing so may result in slower server performance. Look to the organization of your rules to see if you can simply move it to a position after the Googlebot checks.
Jim
# Switch images on these sites to something I prefer to send out
RewriteEngine On
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?annoyance\.com/ [NC]
RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ /img/bandwidth.jpe [L]
#
# Forbid images to this
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?bigthief\. [NC]
RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ - [F]
Just so I'm understanding, the code you wrote means "if 'Googlebot' or 'Googelbot' is anywhere inside the user agent, but isn't exactly
Mozilla (version#) (compatible; Googlebot(or Googlebot-Image) 2.(version 0-9) +http://www.google.com/bot.html)
and coming from 66.249.etc.etc.. then block it" right?
Since Googlebot-Image contains the word Googlebot but not the rest of the expected user agent string, it had got snagged for a 403.. and somehow the hotlink code a few folders deeper (This line... RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ /img/bandwidth.jpe [L]) allowed it access to those files? [L] means "If you hit this rule, skip the rest" right?
I removed the anti-hotlink from the separate htaccess and put it all inside the main one (since that will better control the order they're read?), and added with your modified code. Everything else I'm leaving as is for now. I'm still not sure if this is a good order and since I'm not understanding how that line allowed it access to the image folder.
Here's what have so far (sanitized):
AddHandler application/x-httpd-php .shtml
ErrorDocument 404 /404.shtml
ErrorDocument 403 /403.shtml
Options All -Indexes
#
# 410 Permanent removed
Redirect gone /really/gone.cgi
#
Options +FollowSymLinks
RewriteEngine On
RewriteBase /
#
# Block libwww-perl except from AltaVista, Inktomi, and IA Archiver
RewriteCond %{HTTP_USER_AGENT} ^libwww-perl/[0-9] [NC]
RewriteCond %{REMOTE_ADDR}!^209\.73\.(1[6-8][0-9]¦19[01])\.
RewriteCond %{REMOTE_ADDR}!^209\.131\.(3[2-9]¦[45][0-9]¦6[0-3])\.
RewriteCond %{REMOTE_ADDR}!^209\.237\.23[2-5]\.
RewriteCond %{REMOTE_ADDR}!^208\.70\.
RewriteCond %{REMOTE_ADDR}!^207\.241\.224\.0\/20
RewriteRule .* - [F]
#
# Block Java and Python URLlib except from Google and Yahoo Python
RewriteCond %{HTTP_USER_AGENT} ^(Python[-.]?urllib¦Java/?[1-9]\.[0-9]) [NC]
RewriteCond %{REMOTE_ADDR}!^207\.126\.2(2[4-9]¦3[0-9])\.
RewriteCond %{REMOTE_ADDR}!^64\.233\.172\.
RewriteCond %{REMOTE_ADDR}!^216\.239\.(3[2-9]¦[45][0-9]¦6[0-3])\.
RewriteRule .* - [F]
#
# Block most random-letter. non-Mozilla user-agents
RewriteCond %{HTTP_USER_AGENT}!^Mozilla
# 15 or more chars with no "/.{};" characters
RewriteCond %{HTTP_USER_AGENT} ^[a-z0-9\ ]{15,}$ [NC]
# no vowels after 5 characters
RewriteCond %{HTTP_USER_AGENT} [b-df-hj-np-tvwxz]{5,} [NC]
RewriteRule .* - [F]
#
# Block Fake Googlebots - Validate Googlebot user-agent and IP
RewriteCond %{HTTP_USER_AGENT} Googlebot¦Googelbot [NC]
RewriteCond %{HTTP_USER_AGENT}!^Mozilla/5\.[0-9]+\ \(compatible;\ Googlebot(-Image)?/[1-9]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$ [OR]
RewriteCond %{REMOTE_ADDR}!^66\.249\.
RewriteRule .* - [F]
#
# Block blank referer -AND- user-agent (except for head, favicon and feed requests)
RewriteCond %{REQUEST_METHOD}!^HEAD$
RewriteCond %{HTTP_REFERER}<>%{HTTP_USER_AGENT} ^<>$
RewriteRule!\.(ico¦rss)$ - [F]
#
# Block a few more bad guys
SetEnvIfNoCase User-Agent "(Some¦Bad¦Guys)" banned
Order Allow,Deny
Deny from ###.###.###.##
Allow from all
Deny from env=banned
#
#
# Block images from these sites
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?leach\.com/ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?hugeleach\. [NC]
RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ - [F]
#
# Switch images on these sites
RewriteEngine On
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?annoyance\.com/ [NC]
RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ /clipart/bandwidth.jpe [L]
#
# DONE BLOCKS - ONTO REDIRECTS
#
# redirect old pages to new urls
rewriterule ^old\.htm$ http://www.example.com/new/ [R=301,L]
#
# Remove useless?junk, keep needed
RewriteCond %{THE_REQUEST} [?]
RewriteCond %{REQUEST_URI}!^/need/string/here\.php$
RewriteRule ^(.*)$ http://www.example.com/$1? [R=301,L]
#
# remove multiple slashes anywhere in url
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . http://www.example.com%1/%2 [R=301,L]
#
# Remove extra URL-path info if filetype present in URL
RewriteRule ^([^.]+\.[^/]+)/ http://www.example.com/$1 [R=301,L]
#
# index.shtml and index.php to /
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.(shtml¦php)\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(shtml¦php)$ http://www\.example\.com/$1 [R=301,L]
#
# non www to www
# RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST}!^www\.example\.com$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
I know you can't say it will work for sure since all servers are slightly different, but does that seem fairly logical?
The only thing even mildly eye-catching was the ".*\.(jpe?g夙if在mp如ng)$" patterns, which could be replaced with the shorter and entirely-equivalent "\.(jpe?g夙if在mp如ng)$" -- The leading ".*" on an un-anchored pattern is meaningless.
Your analysis of my modified pattern was essentially correct, except that the pattern will now match Gbots 1.0 through 9.9.
So, I don't know. Flush your browser cache and re-test with the modified Gbot pattern, and let us know what the results are.
Jim
I uploaded the changes and tested as best I can that everything still works as I intend. (Flushed cache, used Live HTTP Headers and user agent switcher to test every variation I can think of).
I'll have to wait for image bot to hit again to see how it reacts so I may not know for a day or few.
Looked over the logs again and Image Bot got a 403 on an image not in the archive folder, always 403's on non image files, 200 for all archived images (the ones that had been protected by the hotlink code). Really odd, this is why htaccess will never cease to confuse me.
Thanks again for your help.
Mental note:
When using a user agent that is blocked on all my websites.. always, always, always change it back to normal *before* wandering off. I nearly gave myself a heart attack when I got a 403 when I went back. Took me more time than I'd like to admit to realize what happened ;)
66.249.xx.xx - - [25/Jan/2007:09:59:27 -0800] "GET /archived/images/picture.jpg HTTP/1.1" 403 - "-" "Googlebot-Image/1.0"
I've removed that test for fake googlebots for now. For this site, image search traffic is very good. I'll look at the logs again in a few hours to see if it can crawl with the fake-googlebot-test removed.. maybe there's some other rule blocking it?
The IP number is really Googles, checked that first.
Regular Googlebot is crawling fine.
66.249.xx.xx - - [25/Jan/2007:05:47:38 -0800] "GET /folder/file.shtml HTTP/1.1" 200 11712 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Oh wait... could the rule still be disallowing it because the rest doesn't match, ie, no "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" that the other bots leave behind in the UA, it just calls itself "Googlebot-Image/1.0"?
# Block Fake Googlebots
RewriteCond %{HTTP_USER_AGENT} Googelbot [NC]
RewriteRule .* - [F]
#
# Validate Googlebot IP
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
RewriteCond %{REMOTE_ADDR} !^66\.249\.
RewriteRule .* - [F]
#
# Validate Googlebot user-agent
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
# Search bot
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/5\.[0-9]+\ \(compatible;\ Googlebot/[1-9]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$
# Adwords bot
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/5\.[0-9]+\ \(compatible;\ Googl[b]eBo[/b]t/[1-9]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$
# Image bot
RewriteCond %{HTTP_USER_AGENT} !^Googlebot-Image/[1-9]\.[0-9]\)$
# Adsense bot
RewriteCond %{HTTP_USER_AGENT} !^Mediapartners-Google/[1-9]\.[0-9]\ (+http://www\.googlebot\.com/bot\.html\)$
# Mobile bot - Note that pattern is not start-anchored
RewriteCond %{HTTP_USER_AGENT} !Googlebot-Mobile/[1-9]\.[0-9];\ \+http://www.google.com/bot.html\)$
RewriteRule .* - [F]
# Valid Googlebot IP address
RewriteCond %{REMOTE_ADDR} ^66\.249\.
# Search bot
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.[0-9]+\ \(compatible;\ Googlebot/[1-9]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$ [OR]
# Adwords bot
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.[0-9]+\ \(compatible;\ Googl[b]eBo[/b]t/[1-9]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$ [OR]
# Image bot
RewriteCond %{HTTP_USER_AGENT} ^Googlebot-Image/[1-9]\.[0-9]\)$ [OR]
# Adsense bot
RewriteCond %{HTTP_USER_AGENT} ^Mediapartners-Google/[1-9]\.[0-9]\ \(\+http://www\.googlebot\.com/bot\.html\)$ [OR]
# Mobile bot - Note that pattern is not start-anchored
RewriteCond %{HTTP_USER_AGENT} Googlebot-Mobile/[1-9]\.[0-9];\ \+http://www.google.com/bot.html\)$
# Skip next rule if valid Googlebot request
RewriteRule .* [S=1]
#
# This rule is skipped by the previous rule if it detects a valid Googlebot request
RewriteCond %{HTTP_USER_AGENT} Googlebot刖oogelbot [NC]
RewriteRule .* - [F]
None of the code above has been tested -- typos are likely.
Jim
I just took a peek at my logs and both Googlebot (real IP) and Googlebot-Image are crawling like madmen, so I'm letting it alone for a day or so. Give them a bit of time to realize it was a glitch on my end and let them settle down. Once it gets quieter again I'll try one of those you wrote above.
Thanks for taking the time to explain what the code meant, I find trying to understand (or even locate) any info at the Apache docs to be near to impossible for beginners like me, especially when I don't know the technical terms to even search. So, thank you again.