You know, where anyone (who has registered) can submit a bot that they discover. A moderator could then check it out, possibly with the aid of some automated tools: * perform a reverse lookup * visit the IP using http:80 * search google for the IP# or name about the bot
In addition to a database that could be searched online, you could put out various periodically updated files of use to webmasters. For example: * malicious bots * known search-engine spiders
If such a system does not exist, do people think that there would be need for one? Is anyone keen to develop and promote it? I could do the database coding, I'm good with PHP/mySQL, and could possibly host it...
Personally I'm a fan of http;//www.psychedelix.com/agents.html
digitaleus's original idea might work but I think having a person as the absolute moderator is presenting a single point of failure where there doesn't need to be one...
A distributed client model works much better because you are not relying on individual users to correctly identify the elements required and you can pre-process to a certain degree.
If you added to this further by requiring multiple confirmations from separate sources alongside a trust model based on past performance you start to diminish the role of moderator and move that job to more of an administrative role.
The reason I say this is that nothing sucks more than data not being added because the owner/moderator is away or sick, or perhaps just not wanting to add the particular user-agent you have found.
A nice automated system side-steps this problem but still allows you to maintain a "pure" feed for those who want absolutes while at the same time providing a "dev" feed for those who want the bleeding edge data.
Msg#: 1664 posted 11:21 pm on Feb 11, 2003 (gmt 0)
Initally when I started with htaccess and monitoring my logs I used the aformentioned links and a few more to acquire any information possible. There may be more than the four below which I used to use: [jafsoft.com...] [psychedelix.com...] [robotstxt.org...] [botspot.com...]
These days I'm quite content with the WHOIS facilities and web searches (most of mine are through google.) Along with the data I've accumulated (If only I could get it updated and together in one file (it's too large for NotePad.)
I have been considering building a realtime malicious spider database. It would probably act like the realtime blackhole lists for open spam realys. If your website was visited by a spider that disobeyed your robots.txt, or clicked through to your bot trap, or otherwise demonstrated it to be a malicious spider, you could report its IP address and useragent to the database.
Then, each time that you get a hit to your website (or each time a new IP address/useragent combination comes along every once in a while) you can query the database to see if you should serve the page. Queries could be made through something quick such as a DNS lookup like some blackhole lists use.
i think this is impossible. Because regarding email and RBLs it's unrelevant whether the email is delivered $now or $now+10s. But when accessing a Webserver it's important that the machine can serve webpages as fast as possible. There's no time for looking up UA or IP in a remote database.
this is my point of view - i think this plan is doomed to failure.