Forum Moderators: phranque

Message Too Old, No Replies

avoiding htaccess redirects

         

revrob

9:09 am on Aug 31, 2010 (gmt 0)

10+ Year Member


I have some bot traps and .htaccess ReWriteCond statements that send certain requests straight to the bot traps and send me an email to alert me.

I have some modified ReWriteCond statements that detect certain IP addresses and direct them to a landing page and send me an email.

All this works fine for me and others on other IP addresses, who have tested them for me.

Yesterday I checked my logs and saw a fairly aggressive four hour visit from an IP range (and ISP) that I am having trouble with.
This visitor (no useragent disclosed) seemed to have read robots.txt on an earlier date and was working through a (selected) list of the locations on the robots.txt disallow section - there is no way they could have known these non-existent directories unless they had read robotx.txt previously. This inluded several "non existent" locations on the site that were subject to .htaccess rewrite/redirect statements. But every time this visitor just got a 200 response. He didn't trigger the redirect, he didn't fall into any of the bot traps, I never got any emails, and his IP was never added to a "deny from" line in my .htaccess.

Here is a typical log entry - first of all from ME trying it all out this morning using Firefox browser. The directory /slides/ is in my robots.txt as a /Disallow directory, and any requests for /slides/ are redirected to a bot trap. I visited, and triggered the trap.

109.152.xx.xx - - [31/Aug/2010:08:17:04 +0200] "GET /slides/IMG_****.html HTTP/1.1" 200 326756 www.mydomain "-" "Mozilla/5.0 (Windows; U; Windows ************Gecko/20100722 Firefox/3.6.8" "-"
109.152.xx.xx - - [31/Aug/2010:08:17:04 +0200] "GET /favicon.ico HTTP/1.1" 403 - www.mydomain "-" "Mozilla/5.0 (Windows****************Gecko/20100722 Firefox/3.6.8" "-"


Yesterday my aggressive crawler did this:
78.145.xx.#*$! - - [30/Aug/2010:18:52:30 +0200] "GET http://www.mydomain/slides/IMG_4555.html HTTP/1.0" 200 645 - "-" "-" "-"
78.145.xx.#*$! - - [30/Aug/2010:18:52:31 +0200] "GET http://www.mydomain/slides/IMG_4556.html HTTP/1.0" 200 645 - "-" "-" "-"
78.145.xx.#*$! - - [30/Aug/2010:18:52:32 +0200] "GET http://www.mydomain/slides/IMG_4557.html HTTP/1.0" 200 645 - "-" "-" "-"

He carried on with this malarkey for about four hours including repeated attempted trawls through that (non existent) photo album, and other genuine areas of the site. He regularly requested urls that should have triggered traps but they didn't.

The things I notice are
- that the GET request gives the whole url (whereas when I go there the GET request leaves out the domain name)
- that he conceals his user agent (so I am suspicious)
- that his visits never get anything other than the parent html file, and don't download the other images etc on the pages he requests.
- that he is visiting areas that are non existent and listed as /Disallow in robots.txt - so he is up to no good.
-that even when he visits non-existent areas of the site, he gets a 200 response and not a 404
- that when he visits a trap he still gets a 200 and doesn't trigger the trap.

The four hours worth of log of his visit contains not a single HTTP 403 code, whereas if I visit to ask for any of the booby trapped pages and directories I get an immediate ban etc. I and someone else have checked in the last few hours - the traps DO work as designed for our visits - but not for his.

Here is the relevant bit from my .htaccess - from the beginning of the file to the end of the rewrite code. I've munged most of the lines except the one I'm referring to.

****************************************
Rewriteengine ON

RewriteRule ^$ /index.html [R,NC,L]
#
RewriteCond %{REQUEST_URI} !/trap/*****warning\.php$
RewriteCond %{REQUEST_URI} !/trap/****st\.php$
RewriteCond %{REQUEST_URI} !^/trap/****st\.php$
# should rewrite everything starting with *****
# except the warning.php

RewriteRule ^****/ /trap/****st.php [L]
RewriteRule ^slides/ /trap/****st.php [L]
RewriteRule ^*******.php /trap/****st.php [L]


*********************************************************

I would be grateful for any advice as to how this is being done. I have adapted most of my traps etc from info on this site. I do not speak either fluent php OR htaccess!

Many thanks.

jdMorgan

1:56 pm on Aug 31, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



First, try redirecting this client to the correct URL-path if the requested URL-path contains your own domain:

# Redirect if hostname is present in requested URL-path (with several variations) and matches my domain
RewriteCond $2 ^(www\.)?mydomain\.com$ [NC]
RewriteRule ^/?(https?://)?([^.:/]+(\.[^.:/]+)+)\.?(:[0-9]+)?(/.*)?$ http://www.mydomain.com/$5 [R=301,L]
# Else return 403 if someone else's domain is in there
RewriteRule ^/?(https?://)?[^.:/]+(\.[^.:/]+)+) - [F]

If that makes this user-agent keep coming back again and again while still prepending the protocol and hostname to the request line, then simply return a 403-Forbidden response when any hostname is present in the requested URL-path.

And if that doesn't work, and you cannot have this guy blocked at the firewall, then return a zero-byte 200 response (empty page) just to keep your bandwidth down.

I should point out that according to the HTTP protocol, it *is* acceptable to include a protocol and the entire hostname in the URL-path, but it is almost never done.

To be clear, your trap rules are set up to detect an HTTP request that looks like this:
GET /trapped-URL-path.html HTTP/1.1
Host: www.mydomain.com

But this user-agent is sending:
GET http://www.mydomain.com/trapped-URL-path.html HTTP/1.1
Host: www.mydomain.com

and so is bypassing your access-control rules because none of the patterns match.

Another set of rules that may be helpful in this case is

# Ban requests with literal hyphens for either or both user-agent and referrer
RewriteCond %{HTTP_USER_AGENT} ^-$ [OR]
RewriteCond %{HTTP_REFERER} ^-$
RewriteRule ^ /trap-script.php [L]
#
# Block requests with blank user-agent and referrer
RewriteCond %{HTTP_USER_AGENT}%{HTTP_REFERER} ^$
RewriteRule ^ - [F]

But I strongly suggest that you get your original trap rules working again first by trying to detect/redirect the client when it adds the protocol and domain to the requested URL-path.

I do not speak either fluent php OR htaccess!
If you want to be able to handle the Web as it exists today, it's time to start getting fluent (or start putting big money into a "consultant account")...

Jim

revrob

5:01 pm on Aug 31, 2010 (gmt 0)

10+ Year Member



thanks.
I think I see what you are getting at - although - just to add - I've sort of concluded that what this this guy is doing is running some url list through a telnet session, because the logs show a HTTP/1.0 request rather than HTTP/1.1 - and when I try the telnet method, I get exactly the same lack of user agent in the log entry - namely
"-" "-" "-"

I've added the paragraph you suggested, to my htaccess file, modified for my domain. Site still works! Not sure how to "test" whether the hack he is using still works though as my attempts at telnet commands only generated error messages from the web server so I may just have to wait and see for more of that type of request and see what happens.

As for the hyphens etc in the user agents, yes - I understand that one too.

And with regard to the "big money into a consultant account" - ROFL (small community charity websites)! - ha ha - I AM the consultant and my website budget is a nice round one, once the hosting has been paid for.

But thanks for the help and for responding so promptly. I'm in a bit of a running battle with one of our very big ISPs over here, trying to establish my right to control access to my site against their non-compliant, non-permitted, non-identified covert customer tracking and website spidering, so it's all getting a bit unpleasant.

wildbest

5:38 pm on Aug 31, 2010 (gmt 0)

10+ Year Member



the logs show a HTTP/1.0 request rather than HTTP/1.1

I have a lot of those as well. Is there any SEO risks if I serve [F] to all HTTP/1.0 requests?

g1smd

7:27 pm on Aug 31, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteRule ^$ /index.html [R,NC,L]


I am not liking the 302 redirect from root to named index file.

That is marginally better than a 301 redirect, but really "/" should be the canonical URL and the DirectoryIndex directive should set the filename that is served by that URL request.

revrob

8:38 pm on Aug 31, 2010 (gmt 0)

10+ Year Member



The .htaccess stuff and my traps are from other threads at this site (which you g1smd were kind enough to participate in back then)
[webmasterworld.com...]

and the original scripts came from these discussions I think
[webmasterworld.com...]
[webmasterworld.com...]

- could you please explain what should be there instead of /index.html and what you mean by the "DirectoryIndex directive setting the filename that is served by that URL request" then I'm all ears.

I'll have a better idea how the current arrangements are working after the normal SEO bots (wanted and unwanted) have spent the night doing their normal visits. At the moment all seems to be well.

Whether yesterdays nasty four hour crawler will be back of course, I don't know.

Once again, many thanks.