Forum Moderators: phranque

Message Too Old, No Replies

Googlebot-Image triggering 403 on non image files

         

LunaC

6:02 pm on Jan 24, 2007 (gmt 0)

10+ Year Member



I'm using this to block fake googlebots and it works well except if Googlebot-Image/1.0 tries to access a non image file (robots.txt, page.html). Image bot can access images fine. The Ip address in the logfile is Googles, so it shouldn't be getting it at all.

#
# Validate Googlebot user-agent and IP, respond with 403-Forbidden if invalid
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Googelbot [NC]
RewriteCond %{HTTP_USER_AGENT}!^Mozilla/5\.[0-9]+\ \(compatible;\ Googlebot/2\.[0-9];\ \+http://www\.google\.com/bot\.html\)$ [OR]
RewriteCond %{REMOTE_ADDR}!^66\.249\.
RewriteRule .* - [F]

Since I can't test from Googles IP, I tried giving a blank referrer and image bots UA and wandered my site, I also could get to images fine, but any non-image files returned a 403. I removed that from htaccess and had total access again.

So I'm a bit lost, why is image bot getting this error on non image files, with a valid Google IP? And why could I access the images without a valid Google IP, but not html pages? Same pattern but it makes no sense to me.

Am I looking at the wrong chunk in htaccess?

Mediapartners-Google/2.1 and regular (real) Google bots are not hitting this problem so far that I can find, just the image bot.

jdMorgan

8:35 pm on Jan 24, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There are apparently two problems.

First, is there code above this snippet that bypasses it for image files? Something like:


# Bypass remaining rules for image requests
RewriteRule \.(gif夸pe?g如ng在mp)$ - [L]

Second, if there is, and if you remove it, then Googlebot/2.0 will be allowed to access anything on your site, but the other Googlebots won't be allowed. You'd need to expand your pattern to accept them:

RewriteCond %{HTTP_USER_AGENT} Googlebot刖oogelbot [NC]
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/5\.[0-9]+\ \(compatible;\ Googlebo[b]t(-Image)?/[1-9]\.[0-9];\ \[/b]+http://www\.google\.com/bot\.html\)$ [OR]
RewriteCond %{REMOTE_ADDR} !^66\.249\.
RewriteRule .* - [F]

I have recommended the "bypass code" mentioned above, but you have to use it only ahead of your specific non-image internal rewrite and redirect rules. If your rules are highly-organized, then this is a good way to save some CPU time, since images usually comprise the majority of requests from a server and there is no need to run the rules that deal with rewriting or redirecting non-image URLs if the request is for an image.

Usually, you can safely delete the bypass rule, but doing so may result in slower server performance. Look to the organization of your rules to see if you can simply move it to a position after the Googlebot checks.

Jim

LunaC

12:58 am on Jan 25, 2007 (gmt 0)

10+ Year Member



In a deep subfolder with the archived images (that it was able to spider) I have this htaccess (the other rules are in /.htaccess), could this be triggering the bypass rule you mentioned?:


# Switch images on these sites to something I prefer to send out
RewriteEngine On
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?annoyance\.com/ [NC]
RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ /img/bandwidth.jpe [L]
#
# Forbid images to this
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?bigthief\. [NC]
RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ - [F]

Just so I'm understanding, the code you wrote means "if 'Googlebot' or 'Googelbot' is anywhere inside the user agent, but isn't exactly
Mozilla (version#) (compatible; Googlebot(or Googlebot-Image) 2.(version 0-9) +http://www.google.com/bot.html)
and coming from 66.249.etc.etc.. then block it" right?

Since Googlebot-Image contains the word Googlebot but not the rest of the expected user agent string, it had got snagged for a 403.. and somehow the hotlink code a few folders deeper (This line... RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ /img/bandwidth.jpe [L]) allowed it access to those files? [L] means "If you hit this rule, skip the rest" right?

I removed the anti-hotlink from the separate htaccess and put it all inside the main one (since that will better control the order they're read?), and added with your modified code. Everything else I'm leaving as is for now. I'm still not sure if this is a good order and since I'm not understanding how that line allowed it access to the image folder.

Here's what have so far (sanitized):


AddHandler application/x-httpd-php .shtml
ErrorDocument 404 /404.shtml
ErrorDocument 403 /403.shtml
Options All -Indexes
#
# 410 Permanent removed
Redirect gone /really/gone.cgi
#
Options +FollowSymLinks
RewriteEngine On
RewriteBase /
#
# Block libwww-perl except from AltaVista, Inktomi, and IA Archiver
RewriteCond %{HTTP_USER_AGENT} ^libwww-perl/[0-9] [NC]
RewriteCond %{REMOTE_ADDR}!^209\.73\.(1[6-8][0-9]¦19[01])\.
RewriteCond %{REMOTE_ADDR}!^209\.131\.(3[2-9]¦[45][0-9]¦6[0-3])\.
RewriteCond %{REMOTE_ADDR}!^209\.237\.23[2-5]\.
RewriteCond %{REMOTE_ADDR}!^208\.70\.
RewriteCond %{REMOTE_ADDR}!^207\.241\.224\.0\/20
RewriteRule .* - [F]
#
# Block Java and Python URLlib except from Google and Yahoo Python
RewriteCond %{HTTP_USER_AGENT} ^(Python[-.]?urllib¦Java/?[1-9]\.[0-9]) [NC]
RewriteCond %{REMOTE_ADDR}!^207\.126\.2(2[4-9]¦3[0-9])\.
RewriteCond %{REMOTE_ADDR}!^64\.233\.172\.
RewriteCond %{REMOTE_ADDR}!^216\.239\.(3[2-9]¦[45][0-9]¦6[0-3])\.
RewriteRule .* - [F]
#
# Block most random-letter. non-Mozilla user-agents
RewriteCond %{HTTP_USER_AGENT}!^Mozilla
# 15 or more chars with no "/.{};" characters
RewriteCond %{HTTP_USER_AGENT} ^[a-z0-9\ ]{15,}$ [NC]
# no vowels after 5 characters
RewriteCond %{HTTP_USER_AGENT} [b-df-hj-np-tvwxz]{5,} [NC]
RewriteRule .* - [F]
#
# Block Fake Googlebots - Validate Googlebot user-agent and IP
RewriteCond %{HTTP_USER_AGENT} Googlebot¦Googelbot [NC]
RewriteCond %{HTTP_USER_AGENT}!^Mozilla/5\.[0-9]+\ \(compatible;\ Googlebot(-Image)?/[1-9]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$ [OR]
RewriteCond %{REMOTE_ADDR}!^66\.249\.
RewriteRule .* - [F]
#
# Block blank referer -AND- user-agent (except for head, favicon and feed requests)
RewriteCond %{REQUEST_METHOD}!^HEAD$
RewriteCond %{HTTP_REFERER}<>%{HTTP_USER_AGENT} ^<>$
RewriteRule!\.(ico¦rss)$ - [F]
#
# Block a few more bad guys
SetEnvIfNoCase User-Agent "(Some¦Bad¦Guys)" banned
Order Allow,Deny
Deny from ###.###.###.##
Allow from all
Deny from env=banned
#
#
# Block images from these sites
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?leach\.com/ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?hugeleach\. [NC]
RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ - [F]
#
# Switch images on these sites
RewriteEngine On
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?annoyance\.com/ [NC]
RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ /clipart/bandwidth.jpe [L]
#
# DONE BLOCKS - ONTO REDIRECTS
#
# redirect old pages to new urls
rewriterule ^old\.htm$ http://www.example.com/new/ [R=301,L]
#
# Remove useless?junk, keep needed
RewriteCond %{THE_REQUEST} [?]
RewriteCond %{REQUEST_URI}!^/need/string/here\.php$
RewriteRule ^(.*)$ http://www.example.com/$1? [R=301,L]
#
# remove multiple slashes anywhere in url
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . http://www.example.com%1/%2 [R=301,L]
#
# Remove extra URL-path info if filetype present in URL
RewriteRule ^([^.]+\.[^/]+)/ http://www.example.com/$1 [R=301,L]
#
# index.shtml and index.php to /
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.(shtml¦php)\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(shtml¦php)$ http://www\.example\.com/$1 [R=301,L]
#
# non www to www
# RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST}!^www\.example\.com$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

I know you can't say it will work for sure since all servers are slightly different, but does that seem fairly logical?

jdMorgan

1:17 am on Jan 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There's nothing in that file that would explain the image behaviour; Perhaps we both got fooled by your browser cache. The code in a lower-level .htaccess file can't bypass code in the higher-level file, though it can "un-do" or override the actions of a higher-level file later.

The only thing even mildly eye-catching was the ".*\.(jpe?g夙if在mp如ng)$" patterns, which could be replaced with the shorter and entirely-equivalent "\.(jpe?g夙if在mp如ng)$" -- The leading ".*" on an un-anchored pattern is meaningless.

Your analysis of my modified pattern was essentially correct, except that the pattern will now match Gbots 1.0 through 9.9.

So, I don't know. Flush your browser cache and re-test with the modified Gbot pattern, and let us know what the results are.

Jim

LunaC

3:55 am on Jan 25, 2007 (gmt 0)

10+ Year Member



I must have missed flushing my cache for that test (I swear it sticks sometimes until I restart the browser), I can't see how I could have been allowed to view images with that UA otherwise.

I uploaded the changes and tested as best I can that everything still works as I intend. (Flushed cache, used Live HTTP Headers and user agent switcher to test every variation I can think of).

I'll have to wait for image bot to hit again to see how it reacts so I may not know for a day or few.

Looked over the logs again and Image Bot got a 403 on an image not in the archive folder, always 403's on non image files, 200 for all archived images (the ones that had been protected by the hotlink code). Really odd, this is why htaccess will never cease to confuse me.

Thanks again for your help.

Mental note:
When using a user agent that is blocked on all my websites.. always, always, always change it back to normal *before* wandering off. I nearly gave myself a heart attack when I got a 403 when I went back. Took me more time than I'd like to admit to realize what happened ;)

LunaC

6:25 pm on Jan 25, 2007 (gmt 0)

10+ Year Member



Hmm, now it's getting 403 on all files, and looks like it went through hundreds on that site :( Hopefully it will realize this is just a glitch and to recrawl. Here's a bit from the logs (definitely after the change):

66.249.xx.xx - - [25/Jan/2007:09:59:27 -0800] "GET /archived/images/picture.jpg HTTP/1.1" 403 - "-" "Googlebot-Image/1.0"

I've removed that test for fake googlebots for now. For this site, image search traffic is very good. I'll look at the logs again in a few hours to see if it can crawl with the fake-googlebot-test removed.. maybe there's some other rule blocking it?

The IP number is really Googles, checked that first.

Regular Googlebot is crawling fine.
66.249.xx.xx - - [25/Jan/2007:05:47:38 -0800] "GET /folder/file.shtml HTTP/1.1" 200 11712 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Oh wait... could the rule still be disallowing it because the rest doesn't match, ie, no "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" that the other bots leave behind in the UA, it just calls itself "Googlebot-Image/1.0"?

jdMorgan

9:08 pm on Jan 25, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, your best bet should you decide to pursue this would be to add a specific RewriteCond for Googlebot-Image and be done with it. Because of the multiple layers of AND-OR-AND this involves, the rule must be rewritten if the code is to be maintainable:

# Block Fake Googlebots
RewriteCond %{HTTP_USER_AGENT} Googelbot [NC]
RewriteRule .* - [F]
#
# Validate Googlebot IP
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
RewriteCond %{REMOTE_ADDR} !^66\.249\.
RewriteRule .* - [F]
#
# Validate Googlebot user-agent
RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
# Search bot
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/5\.[0-9]+\ \(compatible;\ Googlebot/[1-9]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$
# Adwords bot
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/5\.[0-9]+\ \(compatible;\ Googl[b]eBo[/b]t/[1-9]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$
# Image bot
RewriteCond %{HTTP_USER_AGENT} !^Googlebot-Image/[1-9]\.[0-9]\)$
# Adsense bot
RewriteCond %{HTTP_USER_AGENT} !^Mediapartners-Google/[1-9]\.[0-9]\ (+http://www\.googlebot\.com/bot\.html\)$
# Mobile bot - Note that pattern is not start-anchored
RewriteCond %{HTTP_USER_AGENT} !Googlebot-Mobile/[1-9]\.[0-9];\ \+http://www.google.com/bot.html\)$
RewriteRule .* - [F]

Or you could write it based on what you'll accept, rather than what you will reject:

# Valid Googlebot IP address
RewriteCond %{REMOTE_ADDR} ^66\.249\.
# Search bot
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.[0-9]+\ \(compatible;\ Googlebot/[1-9]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$ [OR]
# Adwords bot
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.[0-9]+\ \(compatible;\ Googl[b]eBo[/b]t/[1-9]\.[0-9];\ \+http://www\.google\.com/bot\.html\)$ [OR]
# Image bot
RewriteCond %{HTTP_USER_AGENT} ^Googlebot-Image/[1-9]\.[0-9]\)$ [OR]
# Adsense bot
RewriteCond %{HTTP_USER_AGENT} ^Mediapartners-Google/[1-9]\.[0-9]\ \(\+http://www\.googlebot\.com/bot\.html\)$ [OR]
# Mobile bot - Note that pattern is not start-anchored
RewriteCond %{HTTP_USER_AGENT} Googlebot-Mobile/[1-9]\.[0-9];\ \+http://www.google.com/bot.html\)$
# Skip next rule if valid Googlebot request
RewriteRule .* [S=1]
#
# This rule is skipped by the previous rule if it detects a valid Googlebot request
RewriteCond %{HTTP_USER_AGENT} Googlebot刖oogelbot [NC]
RewriteRule .* - [F]

Search companies could make our lives much easier if they would define and enforce a company-wide user-agent string taxonomy, stick to it, and quit releasing specialty 'bot after specialty 'bot on the Web -- I can see no reason why any search company needs more than one or two robots to crawl the Web. One or two link crawlers/page fetchers feeding a database from which multiple special-purpose back-ends could draw data would greatly reduce whitelist maintenance and server and 'net bandwidth, and just seems to make a lot more sense -- to me, anyway. The back-ends themselves could fetch robots.txt to get Disallow/Allow status for individual pages, directories, and page types and dispatch a media fetcher to GET media files -- Those seem to be the only "per back-end" fetches needed.

None of the code above has been tested -- typos are likely.

Jim

LunaC

10:03 pm on Jan 25, 2007 (gmt 0)

10+ Year Member



I completely agree, MSN is even worse for seemingly random UA's sometimes. I've seen MS IP's in the logs with blank UA and referrer, those got a 403 as well.. but I feel it was deserved (I didn't see any request for robots.txt) so I'm leaving that as is.

I just took a peek at my logs and both Googlebot (real IP) and Googlebot-Image are crawling like madmen, so I'm letting it alone for a day or so. Give them a bit of time to realize it was a glitch on my end and let them settle down. Once it gets quieter again I'll try one of those you wrote above.

Thanks for taking the time to explain what the code meant, I find trying to understand (or even locate) any info at the Apache docs to be near to impossible for beginners like me, especially when I don't know the technical terms to even search. So, thank you again.