|Do you let bad web requests clog up your log files?|
Over the past few months the number of attempts to load scripts on our web server have increased a lot.
It started off with this 1st.cgi a few months ago. It's always IP's from China.
Sep 25 22:52:31 server1 httpd: [error] [client 184.108.40.206] script not found or unable to stat: /var/www/cgi-bin/1st.cgi
Repeat that couple of times per minute x hundreds of times per day
Lately it's been ip.cgi and again always an IP from China.
Sep 28 15:07:28 server1 httpd: [error] [client 220.127.116.11] script not found or unable to stat: /var/www/cgi-bin/ip.cgi
Repeat that 4 times per second x thousands of times per day
Do I admit defeat and let them take up gigs of space in my log files? The thousands of lines in log files per day also makes it much harder to find legitimate errors since I use a web interface that shows me the last 100 lines of error_log.
What are my options for blocking these attempts?
A firewall would stop the requests reaching the server. The server software only logs requests that reach it. Many firewalls also keep logs though.
Another option is to detect these requests by IP address range, User-Agent, or Request_URI (or some combination of these) and return a 403-Forbidden response. While most of these 'bots are as dumb as a rock, some of them do detect the server response, and some om those will "go away" if they find that their efforts are in vain.
On the other hand, some will just keep at it. The only way to find out is to test.
Some bad-bots start their sessions with a request for a file that should not exist. The purpose is to discover if the server will return a proper 404 response to this request. If so, the bad-bot launches into a long series of requests trying to find a particular admin script -- the script that is used to configure PHP or your database, for example. In these cases, the initial 'non-existent file test' request can sometimes be detected. If you return a 200-OK response, then the bad-bots know that they won't be able to determine the correct script URL by just trying all common variations of the filename, and again, some of them will give up and go away.
Another option, if you have server config access, is to use custom logging. You can either "drop" these bad requests from the access log, or log them in a separate log file. See [httpd.apache.org...]
|A firewall would stop the requests reaching the server. |
Webmin had "Linux Firewall" as an option so I enabled it and added a filter to DROP that particular IP and the error_log has been clean since. Thanks!
I understand that part of the issue is that noise is making it more difficult to detect the signal. But I am always wary of solutions that filter based on a specific IP, or range. One IP turns into 2, then 10, then 50, and it becomes hard to maintain (especially if this is a firewall setting, which is buried. If you got hit by a bus, would the person who followed in your footsteps know to look there?)
A while back, I had written a little module for our software that looked for obvious rogue bots based on a pattern (20 hits from the same IP in 1 second, stuff like that) and put the suspicious IPs into one of a couple special files. One file would be "suspicious", another would be "temporarily block" and a third would be a blacklist, which would get updated with the most recent visit.
Using apache conditional logging (http://httpd.apache.org/docs/current/logs.html) you could just suppress logging of certain IPs. Using rewrites, you could block their request with an error code.
Or, my personal favorite, for the really nasty ones, you could redirect the request from bot A back to itself, or have it crawl some smarmy site that's stealing your content or something. Mwa ha ha! (I don't recommend this, as it may not be looked upon favorably by Google).
In the end, we found that logs are not the answer -- in one of my companies, we had huge sites and as they grew, logs became ungainly and large. So instead, we looked at the problem in an entirely different way -- mostly by letting Google Analytics do the heavy lifting.