Forum Moderators: phranque

Message Too Old, No Replies

Has This Defeated the China [POST] Bots?!

         

roshaoar

1:10 pm on Jan 21, 2015 (gmt 0)

10+ Year Member



I have a personal hobby site which has a reasonably decent readership because the content, photography related, is pretty helpful for others. The site is basically a few hundred articles about how to do a certain type of photography, and has the ability for people to add their comment to every article. It's not the busiest site in the world but it nevertheless has had a history of attracting a lot of China [GET] then [POST] bots, I guess trying to add spam using the submit comment thing. 75% are China, but there's Ukraine and some other suspects in there as well.

Their attempts to add their comment always fails because of the approval and CAPTCHA setup. Nevertheless they are annoying, because they have so many page attempts that they skew a bespoke 'most popular pages' widget I customised. My bots literally latch onto non-significant pages in waves of 100s and 100s during a day, sometimes 1000s for no apparent reason (ps I think this maybe why the BBC shows weird old stories in its most popular widget as well).

I have a tried a lot of different ways to do something about them but I finally seem to have seized onto something that appears to be reducing their interest in my site, to the extent that I'm now only seeing 1% of the visits that I did a month ago. I'm not putting this out as a solution for everyone or the ideal solution, far from it, but what I'm interested in is seeing if this is coincidence and based on something else entirely, or seeing if this works for others. I've always thought there were no real ways to stop bots coming to your site, but if this does lessen their interest, then surely it can only be good for people.

My approach is twofold. First, my [GET][POST] bots all have no referrer and a Mozilla/4.0 user agent, so I have a block for this in .htaccess:

# Chinabots (Moz4) - tied together
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0
RewriteCond %{REQUEST_URI} !.*\.ico
RewriteCond %{HTTP_REFERER} ^$
RewriteRule .* 1.php[L]


The other block I employ is one that looks for non-referrer straightforward POST attempts:

# Otherbots (Moz5) - post but no referrer. Just die.
RewriteCond %{REQUEST_METHOD} POST
RewriteCond %{HTTP_REFERER} ^$
RewriteRule .* 1.php [L]


The effect of both these blocks is to intercept and redirect them to 1.php, with no further rules applied ([L]). But it is the file 1.php that is also interesting. 1.php is completely empty, and all that it contains is:

<?php
header("HTTP/1.1 403 Forbidden");
?>


I think there's something in this, it seems to be working for me. In my logfiles there's a 0 byte response with this, so the bot is completely starved of information & routes onward, and there's no data cost either. I just wonder if the software behind this sort of bot might not have some programmatic problem handling a 0 byte response, or if this disrupts their activity sufficiently for the bot to stop trying, fingers crossed.

Granted this would be terrible if it were a real person, as you're just serving up a blank white page and usually we like to have a nice fancy error page that suggests some options for people. Having spent a year with one of the fancy ones with links to everywhere, I've had precisely one user tell me of a problem, a Moz4 user accessing bookmarks (the downside of this), and the bots keep hammering away, so keeping this is a choice I'm choosing to make.

For the record, the things that didn't reduce their interest:

- redirecting to fancy 403 page with links to site
- redirecting to fancy 404 page with links to site
- 400 response
- inpage PHP block
- .htaccess IP CIDR span blocks

Last note on CIDR IP span blocks. I gave up when I reached 200 - block one range and they just come from another. This method really was whack a mole for me, and if anything increased the number.

not2easy

5:32 pm on Jan 21, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



The main problem I see is blocking "GET" requests for "anything" with a blank referrer. Legitimate visitors can visit from legitimate sources and not contain a 'referer' - not rare to see. Do you do any log analysis to be sure you are only blocking unwanted "GET" requests if there is no appeal?

roshaoar

7:49 pm on Jan 21, 2015 (gmt 0)

10+ Year Member



Yes, that's of course the concern. But, what I found during three months of looking at live logfiles more or less every hour I'm awake ("apache log viewer"), is that Moz 4/0 GETs are *only* the page URLs and none of the associated other files. Ie no css, jpg, js were fetched, just the url. Then a POST attempt straight after. Which kind of has to be bots, no?

wilderness

8:43 pm on Jan 21, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



blocking "GET" requests for "anything"


easy's reply was an oversight, as he disregarded the multiple conditions and you parallel requirement of Mozilla/4.0

lucy24

9:33 pm on Jan 21, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0

If you leave off the closing anchor, wouldn't this also block a lot of MSIE <10 visits?

You can also do it with the mod_setenvif-plus-mod_authzsomething combo:
BrowserMatch ^Mozilla/4\.0$ keep_out

Or bad_bot or whatever name strikes your fancy. Save mod_rewrite for when nothing else works.

I'd suggest constraining the rule more narrowly. How often do robots try to POST a jpg? For comparison purposes here's one of my own rules:

RewriteCond %{REQUEST_METHOD} POST
RewriteCond %{REQUEST_URI} !contact
RewriteRule (^|\.html|/)$ - [F]

The second condition excludes, by name, the only page that actually permits POST. For your site, there's probably a directory containing your user input. The body of the rule constrains it to requests for pages, so conditions don't have to be evaluated on non-page requests. Replace "\.html" with "\.php" if that's what you use.

Does the 301-to-403 sequence work better than simply serving an [F] in the first place? Interesting.

I gave up when I reached 200 - block one range and they just come from another

Your ranges may be too small. Some IP ranges never, ever have humans-- and some are from countries you don't want to see-- so why not block them wholesale?

:: detour to check ::

It looks like I'm currently blocking about 1400 ranges (250 lines of htaccess) of various sizes. Normally nothing below /24 except in special circumstances. Those get re-checked periodically.

roshaoar

9:38 am on Jan 22, 2015 (gmt 0)

10+ Year Member



Thank you all, especially Lucy24. Your advice has been invaluable leading up to what I have now as well. I've narrowed it based on your observations where I can, but for example my host has disabled mod_setenvif-plus-mod_authzsomething. But restricting to php is a good call :)

1400 ranges and 250 lines, whew. You must have an amazingly comprehensive collection of ranges :). I can see why, but, I found maintaining and updating my ever expanding ranges just soaked up too much time and this does seem to be working better - for me at least. Fingers crossed!