I try to block spiders, I have a workaround for a IE5-bug which looks for favicon.ico in all the wrong places and finally protect my images from being used on other websites (direct linking to my gifs/jpgs from other sites).
So here are some snippets:
<snip 1>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
.... (more spiders)
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.mysite.com.* - [F]
</snip 1>
<snip 2>
RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?mysite.com/.*$ [NC]
RewriteRule \.(gif¦jpg)$ - [F]
</snip 2>
<snip 3>
RewriteEngine on
RewriteRule ^(.+)/favicon\.ico$ /favicon.ico [R=permanent] [L]
</snip 3>
All have the "RewriteEngine on" in it. Can I just leave this to the top one and add the other "rewrites" to it?
And if I may add it: can anyone see a problem here with the syntax? My guess is, that within snippet 2, I could also write:
RewriteCond %{HTTP_REFERER} !^http://[^/.]\.mysite.com/.*$ [NC]
Yes, starting the engine once at the top is enough.
In snippet 2, just the one line
RewriteCond %{HTTP_REFERER} !^http://[^/.]\.mysite.com/.*$ [NC]
will block anything without your site as the referer, including requests made
without a referer. This will block many users who come in through a proxy
server (such as users on corporate networks). The inclusion of the !^$ Cond
allows a blank referer field. This is just one of those trade-offs...
Also, decide whether you want to block users viewing your page images in
Google's or Gigabot's cache, or using Google or AltaVista's "Translate this
page" features. All of these will load the page images from your server, but
provide the text from their server. I added lines in my .htaccess to permit
this by http_referer but again, it's your choice.
You could also trim the second line in snippet 2 to
RewriteCond %{HTTP_REFERER} !^http://[^/.]\.mysite.com/ [NC]
Since having the ".*" right before the end anchor means you don't care
what the end of the string contains. The only time you need .* next to a
start or end anchor is when you wish to create a backreference to be used
to create the rewrite destination, such as
RewriteRule xyx/(.*) abc/$1
and here, the end anchor "$" is not needed.
Also, did you know that you can use any name you want for your favorites icon,
as long as the filename and the "link rel" tag on the page agree?
Hope this helps,
Jim
I think I misunderstood your question about snippet 2.
I like your first version better, but I chose to use:
RewriteCond %{HTTP_HOST} ^!www.mydomain.com
RewriteRule ^(.*)$ [mydomain.com...] [L,R=permanent]
This gives anyone using a "shorthand" URL for your site an external permanent
redirect to the correct canonical URL. It will save you having to worry about
multiple URLs being picked up by search engines from people and sites using an
"informal" version of your address. When a user comes in on "mydomain.com" the
redirect will cause his browser to re-request the page at "www.mydomain.com"
and the URL in the user's address bar will "correct itself".
This simplifies reading site access logs and having to worry about whether some
search engine will mistakenly ban you for having duplicate content on "two
domains".
Jim
Thanks for making this clear.
But it seems I screwed it up. :(
Here's my shortened .htaccess. It won't do what it's supposed to - it doesn't stop pics being used on other sites, it won't redirect users typing in www.mydomain.com to mydomain.com but it does redirect to custom error pages. Some other redirects not posted here do work too. All URL switched to "mydomain.com" - this is a replacement of the project's actual URL which is in the .htaccess of course.
Here it is:
IndexIgnore .htaccess */.??* *~ *# */HEADER* */README* */_vti*
AddType image/x-icon .ico
RedirectMatch permanent ^/weblog [mydomain.com...]
ErrorDocument 401 /401.shtm
ErrorDocument 404 [mydomain.com...]
ErrorDocument 500 [mydomain.com...]
ErrorDocument 501 [mydomain.com...]
<Files .htaccess>
order allow,deny
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
more spiders added here
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.mydomain.com.* - [F]
RewriteCond %{HTTP_HOST} ^!www.mydomain.com
RewriteRule ^(.*)$ [mydomain.com...] [L,R=permanent]
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?mydomain.com/.*$ [NC]
RewriteRule \.(gif¦jpg)$ - [F]
RewriteRule ^(.+)/favicon\.ico$ /favicon.ico [L,R=permanent]
Hmm... OK, first, you shouldn't have "http://" in the left side of a rule.
Everything in the left side is assumed to be a local path. Combining your
spider list and your iaea.org line, try:
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
... more spiders added here
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule .* - [F,L]
This should work fine, if you escape the "dots" and be sure that "www" is either
present or absent in both lines.
RewriteCond %{HTTP_HOST} !^www\.mydomain\.com [NC]
RewriteRule ^(.*)$ [mydomain.com...] [L,R=permanent]
Complex stuff in negated patterns sometimes produces funny results, try:
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://www\.mydomain\.com/ [NC]
RewriteCond %{HTTP_REFERER} !^http://mydomain\.com/ [NC]
RewriteRule \.(gif¦jpg¦jpeg?)$ - [F,L]
Check that you really need the "RewriteBase directive. I've had that cause
trouble, too.
If you still get 404's, have a look at your error logs, and check to see where
the each of the rule is redirecting to. That will often give you a hint of
something simple, like a missing "/" or something.
Rats, now I'm where I can't get to my own .htaccess... If the above doesn't fix
your problem, I'll stickymail you the similar parts of it later tonight if you'd
like to see another example.
Also, let me know if I've misunderstood again - regex is complicated enough
without the added ambiguity of human language!!! :) Now I'm clicking "send" and
fervently hoping I didn't add more typos!
Jim
OK, one more thing... Reviewing my .htaccess, I do not have the trailing
slashes on these RewriteConds:
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://www\.mydomain\.com/]
RewriteCond %{HTTP_REFERER} !^http://mydomain\.com/
RewriteRule \.(gif¦jpg¦jpeg?)$ - [F,L]
Mine look like this:
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://www\.mydomain\.com
RewriteCond %{HTTP_REFERER} !^http://mydomain\.com
RewriteRule \.(gif¦jpg¦jpeg?)$ - [F,L]
Also, any of these rulesets that cover multiple domains can be simplified if you
move them below the ruleset that redirects all visitors to your "standard"
domain name. In that case, you only have to check for the one domain name,
since all image references should come from it and not from the "alias" domain.
Jim