Forum Moderators: phranque

Message Too Old, No Replies

Blocking a rogue spider

Blocking a rogue spider

         

revrob

10:38 am on Dec 6, 2010 (gmt 0)

10+ Year Member


I am trying to block an unco-operative spider via .htaccess.

I have set up a system that seems to work when I visit my own site, having spoofed the useragent of the spider, with my own browser, using Firefox's useragent switcher addon - I get redirected to the landing page, and a php script sends me an email to notify me of the visit - but none of this works when the REAL bot visits - with the identical useragent string.

I've broken all the actual html links for this posting.

The two relevant log entries are:

Me, spoofing the useragent of the bot:

123.345.67.123 - - [04/Dec/2010:19:52:32 +0100] "GET / HTTP/1.1" 302 233 mydomain.org.uk "-" "(compatible; various stuff here .NET CLR ; http:/ /www.spiderdomain.co.uk/products/string-string/)" "-"

This gets caught by my REWRITE statements, and redirects appropriately to the required landing page which then runs a script to notify me by email - all as intended.


This is the REAL bot and it escapes all my .htaccess traps - it is NOT redirected to the landing page, and so no script runs.
12.345.67.43 - - [04/Dec/2010:18:52:02 +0100] "GET /robots.txt HTTP/1.0" 302 225 www.mydomain.org.uk "http:/ /www.mydomain.org.uk/robots.txt" "(compatible; various stuff here .NET CLR ; http:/ /www.spiderdomain.co.uk/products/string-string/)" "-"

The only difference I can see in those two log entries are that the real spider asks for www.mydomain.org.uk and my spoofed request is for mydomain.org.uk without the www

My php scripts are obviously okay as they all work properly when I spoof the useragent myself and visit the site. So it seems to be the rewrite statements that are failing when the real bot visits.

The relevant .htaccess statements are as follows - I've checked for various errors and corrected for those I could spot myself:
- between the rows of asterisks below...


********************************************
********************************************

# case insensitive test for SPIDER user agent

RewriteCond %{HTTP_USER_AGENT} ^SPIDER.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]

RewriteCond %{HTTP_USER_AGENT} .*SPIDER.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]

# case insensitive test for SPIDERDOMAIN user agent

RewriteCond %{HTTP_USER_AGENT} ^spiderdomain.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]

RewriteCond %{HTTP_USER_AGENT} .*spiderdomain.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]

RewriteCond %{HTTP_USER_AGENT} ^virus-alerts.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]

RewriteCond %{HTTP_USER_AGENT} .*virus-alerts.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]


# test for SPIDERDOMAIN spyware host IP addresses
RewriteCond %{REMOTE_ADDR} ^12\.345\.67\.3$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.345\.67\.5$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.345\.67\.6$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.345\.67\.41$ [ornext]

#####################################################
# the line below is the IP address the bot called from
RewriteCond %{REMOTE_ADDR} ^12\.345\.67\.43$ [ornext]
######################################################
RewriteCond %{REMOTE_ADDR} ^12\.34\.181\.134$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.181\.135$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.222\.131$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.222\.132$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.222\.133$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.222\.134$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.252 [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.106
RewriteCond %{REQUEST_URI} !^/landingpage\.php$ [ornext]
RewriteRule .* http:/ /www.mydomain/landingpage.php [last]

**********************************************************
**********************************************************


Is there anything obvious I am missing here, which accounts for the system working when I spoof the useragent but NOT working when the real spider visits? (all the above IP addresses above are made up BTW)

Many thanks.

wilderness

11:22 am on Dec 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"(compatible; various stuff here .NET CLR ; http:/ /www.spiderdomain.co.uk/products/string-string/)"

If the above is the actual bots UA?
Than your RewriteCond which are anchored with ^ (begins with) are never going to work because the actual UA begins with (

If you removed four sections that inaccurately use leading-anchor (begins with) and simply changed to a single line using "contains" in the following manner, it should work:

# IF User Agent contains the word spider (regardless of case)
RewriteCond %{HTTP_USER_AGENT} SPIDER [NC]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]


You may also modify your virus-alerts line with this same method or use a combined line (Thereby changing you six sections into one).
EX:

#If the UA contains either; spider, virus or alerts
RewriteCond %{HTTP_USER_AGENT} (SPIDER|virus|alerts) [NC]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]

jdMorgan

2:18 pm on Dec 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You really only need a single rule for this. I'd suggest either:

# attempt to 302-redirect the client to landingpage.php if the
# UA contains spider, virus, or alerts (case insensitive)
RewriteCond %{HTTP_USER_AGENT} (SPIDER|virus|alerts) [NC]
RewriteRule !^landingpage\.php$ http://www.example.org.uk/landingpage.php [R=302,L]

- or simpler/better -

# Deny access if the UA contains either spider, virus, or alerts (case insensitive)
RewriteCond %{HTTP_USER_AGENT} (SPIDER|virus|alerts) [NC]
RewriteRule ^ - [F]

Notice that in the first code snippet, the comment says "attempt to redirect." That is quite accurate, as many if not most malicious user-agents will NOT follow a 302 redirect, especially if they "know" the URL that they want to fetch.

I don't fool around with trying to redirect unwelcome requests or attempting to serve them "special" content... My server simply denies these requests with a 403 and gets on with serving legitimate requests.

Be sure to exclude both robots.txt and your custom 403 error page from this rule, either with an explicit RewriteCond exclusion, or with a 'skip rule' ahead of this one. See the thread in our Apache Forum Library on correct order for rewriterules to avoid other common errors and the serous resulting problems.

Jim