Forum Moderators: phranque

Message Too Old, No Replies

.htaccess problem

         

gizmotoy

8:02 pm on Jul 7, 2007 (gmt 0)

10+ Year Member



I'm having some difficulty getting .htaccess working properly on my domain. I host a lot of photography, and I've recently been getting images served to myspace/livejournal/etc. and the entire site downloaded using things like Teleport Pro. They're really causing a drain on the server and I'd like to block both. After reading around I came up with the following, but for some reason it doesn't work. I downloaded Teleport Pro to test it out and it can still crawl the entire site. Maybe I have an error somewhere that I'm not catching? If someone could point me in the right direction I'd appreciate it.

.htacces contents:
RewriteEngine On

RewriteCond %{HTTP_REFERER} ^http://(.+\.)?myspace\.com/ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?blogspot\.com/ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?livejournal\.com/ [NC]
RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ - [F]

RewriteCond %{HTTP_USER_AGENT} ^.*Backweb.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*gotit.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Bandit.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Ants.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Buddy.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Crawler.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Wget.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Grabber.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Sucker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Downloader.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Siphon.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Collector.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Snagger.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Widow.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Snake.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Vacuum.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Pump.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Teleport.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Reaper.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Mag-Net.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Memo.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*pcBrowser.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*SuperBot.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*leech.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Stripper.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Offline.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Copier.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Mirror.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*HMView.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*HTTrack.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*JOC.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*likse.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Recorder.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*GrabNet.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Likse.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Navroad.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*attach.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Magnet.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Surfbot.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Whacker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*FileHound.*$
RewriteRule /* [mydomain.com...] [L,R]

jdMorgan

9:33 pm on Jul 7, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does any of this work? -- For example, if you completely flush your browser cache, and then visit one of the myspace pages that is hotlinking your images, are the images still visible? (Note that is is critical to get your cache flushed after any successful image load. Otherwise, the image will be displayed from your cache and this will make your test results invalid.)

If so, is there anything in your server error log?

The reason I ask is that the first part of your code should work as long as you have changed the broken pipe "¦" characters to solid pipes -- Posting on this forum modifies those characters.

The second part won't work because it creates an infinite loop: The URL you are redirecting to will match the rule's pattern and get rewritten again and again, until either the server or the client reaches its maximum redirection limit.

Assuming that you are not using a custom 403 error page, I suggest you replace that rule with


RewriteRule !^robots\.txt$ - [F]

If you do use a custom 403 error page, then try something like this:

RewriteCond $1 !^robots\.txt$
RewriteCond $1 !^path_to_custom_403_error_page.html$
RewriteRule (.*) - [F]

One thing that can be very helpful in these cases where you've got basic problems is to start simple. Put a rule like:

RewriteEngine on
RewriteRule ^foo\.html$ http://www.WebmasterWorld.com/ [R=301,L]

in your .htaccess file all by itself. Then request "foo.html" from your server. If you land back here at WebmasterWorld, then at least you know that a dirt-simple redirect will work...

Also, it appears that you've edited most of the RewriteCond patterns, and added unnecessary ".*" patterns to the beginning and end of these patterns. This slows down your server, while changing the pattern in a way that won't change the practical behaviour, won't have any effect at all, or worst of all, will block legitimate visitors.

For a quick regular-expressions review:
^match-must-start-with-this-string
match-must-end-with-this-string$
^match-exactly-this-string$
match-must-contain-this-string

It's easy to see that since ".*" matches anything at all --including a blank string-- that the following two patterns are functionally identical:

^.*foo.*$
foo

By adding those extraneous ".*" subpatterns and altering the pattern anchors, you've done two things: Slowed down the server and/or made the patterns less specific. In some cases, these less-specific patters can be dangerous, in that they'll block access by legitimate visitors as well as the unwelcome ones. I suggest reviewing the lists of blocked user-agents in the other threads here, and paying careful attention to the "starts-with" versus "ends-with" versus "exactly-matches" versus "contains" notations. In *most* cases, the authors were careful to properly-anchor each of the patterns for the specific user-agent to be matched.

For more information, see the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].

Jim

gizmotoy

4:14 am on Jul 8, 2007 (gmt 0)

10+ Year Member



Thank you so much for the excellent response. The first part of the rule had been working, but the second part was indeed broken. By doing the tests you recommended and making the changes you suggested I was able to track down and fix the problem. All seems to be working properly now, as I've tested a few of the User Agent blocks using a Firefox plugin to impersonate them.

During this I noticed that the default behavior of Teleport Pro now seems to be to impersonate IE 5.0 instead of itself, so it probably won't do a lot of good in the future. For now the ones that are causing the most problems are using a correctly-identified agent, so that's a plus.

Thanks again for the help sorting that out. Those resources you mentioned are great and hopefully I should be able to sort out future problems on my own.

jdMorgan

6:03 pm on Jul 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In addition to these simple user-agent access controls, check out the PERL and PHP forum libraries. There are two scripts (one in each) that are quite useful: A PERL script that traps bad robots based on robots.txt violations, and a PHP script that traps bad robots based on rate of page access and other behaviours. I commend both to you, with thanks to Key_Master, xlcus, and AlexK for their contributions.

Jim

gizmotoy

7:20 pm on Jul 8, 2007 (gmt 0)

10+ Year Member



Thanks for the suggestions, I'll check them out.