Forum Moderators: phranque

Message Too Old, No Replies

.htaccess problem: blocking an unnamed bot

         

m3d1a

3:23 am on Jul 15, 2009 (gmt 0)

10+ Year Member



First off, I'm in no way an .htaccess expert, but I've done my share of dabbling and tweaking small scripts here and there.

Recently, I've had a bot pinging single pages repeatedly to up a user's search rankings within my site.

Looking through the logs, I've found this to be their entry repeatedly and always within 10 to 15 seconds of each other:

123.123.123.123 - - [14/Jul/2009:00:30:00 -0700] "GET /example.com/profilenameomitted HTTP/1.1" 301 310 "-" "-"

I've omitted the user's profile name and IP address, but you get the idea. Their bot is apparently named, "-", and it's making things quite annoying.

Anyone have a clue on how to block this bot? I've tried a few things with no success. Here's how I'm currently blocking bots:

RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

Again, any help would be GREATLY appreciated. Thanks everyone! :)

[edited by: jdMorgan at 3:36 am (utc) on July 15, 2009]
[edit reason] example.com [/edit]

jdMorgan

3:48 am on Jul 15, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If the UA is always blank, the requesting IP address is always the same, and the URL (as amended above) is always the same, then perhaps something like:

RewriteCond %{REMOTE_ADDR}>%{HTTP_USER_AGENT}>%{REQUEST_URI} ^123\.123\.123\.123>>/example.com/profilenameomitted [OR]

This looks for the specific IP address, blank UA, and specific URL-path-prefix. It assumes that the human at this IP address does not need to be granted 'real' access using that combination; If so, he's going to see a 403 response...

BTW, [L] used with [F] is redundant. You can use just [F] instead of [F,L] on your rule.

Oh, and the ">" characters in the RewriteCond string and pattern are just "visual spacers" -- They don't actually do anything here except illustrate the 'field boundaries.'

Jim

m3d1a

4:06 am on Jul 15, 2009 (gmt 0)

10+ Year Member



Jim,

Thanks a lot for the info. Here's the problems I foresee becoming bigger issues down the road:

1. I'm pretty sure this will be run by more than one user in the future, and rather than blocking IP(s) from portions of the site, I'd rather just block the bot in question.

2. The next logical step for an offender of this type is to start randomizing their IP on every page load rather than just changing their session name. (I've currently got code in place to protect from IP's with the same session name boosting page rank by simply refreshing the page.)

3. As my user base grows, manually updating my .htaccess file for every offender will become terribly problematic and tiresome, and I'm sure will eventually affect load times with enough offenders, so I would really like just block whatever bot is causing the problems.

So, any ideas on how to block the bot simply named with a dash? :)

jdMorgan

4:18 am on Jul 15, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's not 'named with a dash' -- It's just not sending any HTTP User-agent header at all.

Unfortunately, this is common even with non-misbehaving users; ISPs that use caching proxies in their networks (e.g. AOL, Earthlink, and many more) will send requests with no HTTP User-agent header as well -- and their users will be unaware of this, too.

So, it sounds like you can't block it except behaviorally. Maybe change your TOS to say that bot-runners will be booted, no warnings, no refunds, accounts forfeit and deleted. Then use your session method to detect too-frequent loads of this 'rank-increasing URL thing' as well as too-frequent page-refreshing.

Jim