Forum Moderators: DixonJones
eg.
XXX.XX.XX.XX ... "qtnbkirkap7p cglgmvrsnjipaxbltcgMlncmvv"
XXX.XX.XX.XX ... "hosf8moifiiipkrm cgntgtdjcfakgkKsvkxu"
XXX.XX.XX.XX ... "agiexyUbpyaypcovntmnotnjUaciaske"
I thought it was a single user for a while because the IP would change each day but stayed within a geographic region:
64.56.XXX.XXX
68.13.XXX.XXX
Now I have a new visitor from 203.222.XXX.XXX doing the same thing.
I've been blocking them - because the whole concept annoys me - using:
RewriteCond %{REMOTE_ADDR} ^64.62.XXX.XXX [OR]
RewriteCond %{REMOTE_ADDR} ^ 68.13.XXX.XXX [OR]
RewriteCond %{REMOTE_ADDR} ^ 203.222.XXX.XXX
RewriteCond %{HTTP_USER_AGENT} ^[a-z]+
RewriteRule .* - [F,L]
but would really like to know what it is - some kind of anonymizer system? A proxy? ...?
Haven't seen that one!
However, one approach is to make a list of allowable spiders, and then add to that a few entries for allowable variants of 'Mozilla/n.nn (compatible; ' for Moz-based browers. Then block everything else.
Aside: I've actually done that on one 'test' site. It works well, except when a new and not-yet-perfected spider rolls out (MSNBOT/0.1 being a prime example). The first time it crawled, it ignored disallows in robots.txt, and then got 403's on several attempted fetches because of that (based on observation, the bug is apparently fixed in msnbot/0.11).
This is a permission-based approach - nothing is allowed unless specifically allowed. It is not suitable to e-commerce or business sites, but works OK on hobby sites and development sites.
Just thought I'd toss that out for discussion... Even if your site fits the e-commerce/business description, you might be able to use that approach temporarily and see if that particular exploiter will go away if 403'ed.
You might also want to play with this code, or use some of the vars it mentions:
# Block anonymous proxy requests
RewriteCond %{HTTP:Via} !^$ [OR]
RewriteCond %{HTTP_FORWARDED} !^$ [OR]
RewriteCond %{HTTP:X-Forwarded} !^$
RewriteCond %{HTTP:Client-IP} ^$
RewriteCond %{HTTP:Forwarded-For} ^$
RewriteCond %{HTTP:X-Forwarded-For} ^$
RewriteRule .* - [F]
Jim
Hmm... a regexp test for 32-39 alphanumerics? perhaps this is better than what you're using now:
RewriteCond %{HTTP_USER_AGENT} ^[a-zA-Z_0-9 ]+ I'm not sure you can make it more complex than that, eg by adding
{32,39}. Otoh, another option would be to omit that condition altogether and just ban the IP's, as this offender might as well choose the standard Mozilla referrer strings as random characters. Allow-list:
- actually i've thought about that for a while. I guess it was the "A close to perfect .htaccess..." threads that made me consider this option.
It's not that hard to do, especially if you can afford to block a few spiders and link-checkers. At the extreme you could allow only
^Mozilla/ but this would also imply eg. disallowing Googlebot and FAST while allowing Inktomi/slurp, zealbot, voila, grub, ZyBorg etc. Add to this the User-agents that use the Mozilla string, but are something else (eg.
Mozilla/3.01 (compatible ;)... not sure i got it right) - and you will still have to have separate rules for eg. formmail and similar stuff. Still, it could make life a lot easier sometimes, but i think you'll have to be a little experienced to follow this path, as you'll have to make some informed decisions. I think i might want to allow all on the front page, just so that i don't have to set up specific rules for all the link checkers of this world - of course this would mean that i wouldn't likely get good deep links, but that is a tradeoff.
/claus
19/Nov - requested all pages from a hypermail archive - probably ripping email addresses, plus various (random) files from the same site - no graphics.
20/Nov - more pages from the same site, and a related site - no graphics.
21/Nov - hit the mail archive again, plus 1-2 pages from five other (unrelated and unlinked) sites (contact details and feedback pages)
...
5/Dec - requested the same page twice from a 'touristic' website
6/Dec - requested the same page twice from a 'film group' website
...
The reason I started blocking it is because of the behaviour in November which was definately some form of targeted spambot. The IP address changed each day and the UA on every request. None of those IPs have been used again so I might start unblocking them. Since then the behaviour has been more sporadic and seemingly benign.
If I was trying to protect valuable content then a white-list approach might be an option. As it is, I'm managing sites for a number of clients and just trying to keep out the most obvious 'no-gooders'. On an average day we see more than 600 distinct user-agents - most are Mozilla variants but I've recorded 400+ that aren't - that's a big list ;)
jdMorgan, I'm only using the combined-logs for reference so can't see any X-Forwarded info etc. I'm not even sure how to find out that information short of logging it with PHP.