I am trying to block an unco-operative spider via .htaccess.
I have set up a system that seems to work when I visit my own site, having spoofed the useragent of the spider, with my own browser, using Firefox's useragent switcher addon - I get redirected to the landing page, and a php script sends me an email to notify me of the visit - but none of this works when the REAL bot visits - with the identical useragent string.
I've broken all the actual html links for this posting.
The two relevant log entries are:
Me, spoofing the useragent of the bot:
123.345.67.123 - - [04/Dec/2010:19:52:32 +0100] "GET / HTTP/1.1" 302 233 mydomain.org.uk "-" "(compatible; various stuff here .NET CLR ; http:/ /www.spiderdomain.co.uk/products/string-string/)" "-"
This gets caught by my REWRITE statements, and redirects appropriately to the required landing page which then runs a script to notify me by email - all as intended.
This is the REAL bot and it escapes all my .htaccess traps - it is NOT redirected to the landing page, and so no script runs.
12.345.67.43 - - [04/Dec/2010:18:52:02 +0100] "GET /robots.txt HTTP/1.0" 302 225 www.mydomain.org.uk "http:/ /www.mydomain.org.uk/robots.txt" "(compatible; various stuff here .NET CLR ; http:/ /www.spiderdomain.co.uk/products/string-string/)" "-"
The only difference I can see in those two log entries are that the real spider asks for www.mydomain.org.uk and my spoofed request is for mydomain.org.uk without the www
My php scripts are obviously okay as they all work properly when I spoof the useragent myself and visit the site. So it seems to be the rewrite statements that are failing when the real bot visits.
The relevant .htaccess statements are as follows - I've checked for various errors and corrected for those I could spot myself:
- between the rows of asterisks below...
********************************************
********************************************
# case insensitive test for SPIDER user agent
RewriteCond %{HTTP_USER_AGENT} ^SPIDER.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]
RewriteCond %{HTTP_USER_AGENT} .*SPIDER.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]
# case insensitive test for SPIDERDOMAIN user agent
RewriteCond %{HTTP_USER_AGENT} ^spiderdomain.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]
RewriteCond %{HTTP_USER_AGENT} .*spiderdomain.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]
RewriteCond %{HTTP_USER_AGENT} ^virus-alerts.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]
RewriteCond %{HTTP_USER_AGENT} .*virus-alerts.* [nocase]
RewriteCond %{REQUEST_URI} !^/landingpage\.php$
RewriteRule .* http:/ /www.mydomain.org.uk/landingpage.php [L]
# test for SPIDERDOMAIN spyware host IP addresses
RewriteCond %{REMOTE_ADDR} ^12\.345\.67\.3$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.345\.67\.5$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.345\.67\.6$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.345\.67\.41$ [ornext]
#####################################################
# the line below is the IP address the bot called from
RewriteCond %{REMOTE_ADDR} ^12\.345\.67\.43$ [ornext]
######################################################
RewriteCond %{REMOTE_ADDR} ^12\.34\.181\.134$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.181\.135$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.222\.131$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.222\.132$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.222\.133$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.222\.134$ [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.252 [ornext]
RewriteCond %{REMOTE_ADDR} ^12\.34\.106
RewriteCond %{REQUEST_URI} !^/landingpage\.php$ [ornext]
RewriteRule .* http:/ /www.mydomain/landingpage.php [last]
**********************************************************
**********************************************************
Is there anything obvious I am missing here, which accounts for the system working when I spoof the useragent but NOT working when the real spider visits? (all the above IP addresses above are made up BTW)
Many thanks.