|Should I ban this user agent?|
Just a few days ago, I noticed 16Mb being downloaded in about 20 minutes, the user agent was RPT-HTTPClient/0.3-3
There wasn't much information about this agent, but I did find something at:
which mentioned the behaviour of the agent to be 'naughty'. Does anyone know what this means. Is the sipder more of a web downloader or web grabber, and should be banned anyway?
If I ban the agent in robots.txt, there is no guarantee that the agent will follow the rules, is there? That is, I cannot force exclusion that way, but maybe in .htaccess?
Also, another "strange" agent, the web logs as follows:
|188.8.131.52 - - [08/Mar/2004:22:06:57 -0500] "GET /robots.txt HTTP/1.0" 200 54 "-" "http://www.almaden.ibm.com/cs/crawler [c01]" |
184.108.40.206 - - [08/Mar/2004:22:07:04 -0500] "GET /index.html HTTP/1.0" 404 - "-" "http://www.almaden.ibm.com/cs/crawler [c01]"
220.127.116.11 - - [08/Mar/2004:22:07:15 -0500] "GET /_cmdlogin?login=guest&version=enterprise HTTP/1.0" 404 - "-" "http://www.almaden.ibm.com/cs/crawler [c01]"
18.104.22.168 - - [08/Mar/2004:22:07:26 -0500] "GET /se/ HTTP/1.0" 404 - "-" "http://www.almaden.ibm.com/cs/crawler [c01]"
I did do some searching on this site, and it appears the above IP/site was indicated as something that should be banned. I can use .htaccess to ban IP addresses, but it would make more sense to ban the agent's I do not approve of, wouldn't it?
I can put an array of banned agents in a PHP file also, that is always executed from every page, but that may place a bit more load on the server, and possibly affect response time, I don't know.
Is there a "definitive" list of user agents that are banned please?
[edited by: DaveAtIFG at 4:12 am (utc) on Mar. 16, 2004]
[edit reason] Removed URL [/edit]
The Close to perfect .htaccess ban list [webmasterworld.com] is a pretty comprehensive list of "bad bots."
Thanks for the link to that (longish) thread, loads of information there alright. Btw, aren't we allowed to post URL's that are not 'webmasterworld' ones?
I wish you hadn't asked that! :) We're going way off topic and the answer isn't simple but...
|Btw, aren't we allowed to post URL's that are not 'webmasterworld' ones? |
First, review the TOS [webmasterworld.com], items 13, 20, and 25. Here are a few threads that discuss the issue.
Unfortunately there are no "hard and fast rules" that apply every time... except "Don't post URLs" and that's unrealistic. It basically comes down to a judgment call for each case.