Forum Moderators: phranque

Message Too Old, No Replies

allowing access to specific bots

for protected image directories

         

sssweb

3:13 pm on Dec 17, 2006 (gmt 0)

10+ Year Member



I have the following .htaccess in my image directories to block hotlinking and direct user access (code was auto-generated by my host and works in testing):

AuthUserFile /dev/null
AuthGroupFile /dev/null
RewriteEngine On
RewriteCond %{HTTP_REFERER}!^http://example.com[NC]
RewriteCond %{HTTP_REFERER}!^http://www.example.com [NC]
RewriteRule /* http://example.com [R,L]

My understanding (PLEASE correct me if I'm wrong) is that this blocks access to SE's trying to index my images. Can someone please tweak the above code to block access to all EXCEPT SE's?

Below is a list of main ones. I know Google has a special image bot (indicated below); if any others do too, please edit the identifier below. Feel free to add other major bots too.

If there's a better solution to this whole issue (a useful link will do), I'm interested in that as well.

RewriteCond %{HTTP_USER_AGENT} ^Googlebot-Image.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^yahoo.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^msnbot.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^inktomi.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^zyborg.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^webcrawler.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^gigabot.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^scrubby.* [NC,OR]
# these last ones are image indexers I got from a bot DB list
RewriteCond %{HTTP_USER_AGENT} .*ImageScape.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla\s3\.01.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^CydralSpider.* [NC,OR]

[edited by: jdMorgan at 7:27 pm (utc) on Dec. 20, 2006]
[edit reason] example.com [/edit]

sssweb

5:47 pm on Dec 17, 2006 (gmt 0)

10+ Year Member



The following stills blocks hotlinking & direct access, and looks like it should allow the indicated bots. If anyone sees different, please post.

Also, any comments on the bot list are welcome.

AuthUserFile /dev/null
AuthGroupFile /dev/null
RewriteEngine On
RewriteCond %{HTTP_REFERER}!^http://mysite.com[NC]
RewriteCond %{HTTP_REFERER}!^http://www.mysite.com [NC]

RewriteCond %{HTTP_USER_AGENT}!^Googlebot-Image.* [NC]
RewriteCond %{HTTP_USER_AGENT}!^yahoo.* [NC]
RewriteCond %{HTTP_USER_AGENT}!^msnbot.* [NC]
RewriteCond %{HTTP_USER_AGENT}!^inktomi.* [NC]
RewriteCond %{HTTP_USER_AGENT}!^zyborg.* [NC]
RewriteCond %{HTTP_USER_AGENT}!^webcrawler.* [NC]
RewriteCond %{HTTP_USER_AGENT}!^gigabot.* [NC]
RewriteCond %{HTTP_USER_AGENT}!^scrubby.* [NC]
RewriteCond %{HTTP_USER_AGENT}!.*ImageScape.* [NC]
RewriteCond %{HTTP_USER_AGENT}!^Mozilla\s3\.01.* [NC]
RewriteCond %{HTTP_USER_AGENT}!^CydralSpider.* [NC]

RewriteRule /* http://example.com [R,L]

[edited by: jdMorgan at 7:26 pm (utc) on Dec. 20, 2006]
[edit reason] example.com [/edit]

mkhines

8:44 pm on Dec 19, 2006 (gmt 0)

10+ Year Member



Hi there,

I only want to block one bot that is ignoring the robots.txt file... its name is -

web2.gold.funnelback.com and IP is 64.72.112.53

How do I edit/Where do I put this code? Does this go right into Apache 2 httpd file?

Here is my attempt at writing the necessary code based on your examples -

RewriteCond %{ HTTP_USER_AGENT} ^web2.gold.funnelback.com*

But where does that go? I don't see an htaccess file anywhere on our server.

Thanks for any help you can give!

Megan

sssweb

4:21 pm on Dec 20, 2006 (gmt 0)

10+ Year Member



See: [webmasterworld.com...]

Other helpful links:

tutorial: www.workingwith.me.uk/articles/scripting/mod_rewrite

cheat sheet for symbol meanings: www.ilovejackdaniels.com/mod_rewrite_cheat_sheet.pdf

various applications: thejackol.com/htaccess-cheatsheet

[edited by: jdMorgan at 7:14 pm (utc) on Dec. 20, 2006]
[edit reason] De-linked [/edit]

jdMorgan

7:25 pm on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




RewriteEngine on
RewriteCond %{HTTP_REFERER} !^http://(www\.)?example\.com [NC]
RewriteCond %{HTTP_USER_AGENT} !Googlebot-Image [NC]
RewriteCond %{HTTP_USER_AGENT} !yahoo [NC]
RewriteCond %{HTTP_USER_AGENT} !^msnbot [NC]
RewriteCond %{HTTP_USER_AGENT} !zyborg [NC]
RewriteCond %{HTTP_USER_AGENT} !webcrawler [NC]
RewriteCond %{HTTP_USER_AGENT} !^Gigabot [NC]
RewriteCond %{HTTP_USER_AGENT} !scrubby [NC]
RewriteCond %{HTTP_USER_AGENT} !ImageScape [NC]
RewriteCond %{HTTP_USER_AGENT} !^Mozilla\ 3\.01 [NC]
RewriteCond %{HTTP_USER_AGENT} !CydralSpider [NC]
RewriteRule .* - [F]

Inktomi was bought-out by Yahoo, so I doubt that you still need that one. I removed many of the start-anchors, since few of those user-agents actually Start With the indicated string. I also removed the redundant and unnecessary ".*" patterns from the end of the patterns as well -- If you don't use and end-anchor, they're not needed, and will only slow you down.

There is little use trying to redirect a request for an included image to an HTTP page. It won't work, because browsers just don't know how to do that -- You can't load a page into the spot where an <img src="..."> tag has been used. So, just return a 403-Forbidden as shown.

Jim

[edited by: jdMorgan at 7:28 pm (utc) on Dec. 20, 2006]