homepage Welcome to WebmasterWorld Guest from 54.196.199.46
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Bot Database
digitaleus




msg:396141
 7:15 am on Feb 11, 2003 (gmt 0)

Does anyone know of an online bot database?

You know, where anyone (who has registered) can submit a bot that they discover. A moderator could then check it out, possibly with the aid of some automated tools:
* perform a reverse lookup
* visit the IP using http:80
* search google for the IP# or name about the bot

In addition to a database that could be searched online, you could put out various periodically updated files of use to webmasters. For example:
* malicious bots
* known search-engine spiders

If such a system does not exist, do people think that there would be need for one? Is anyone keen to develop and promote it? I could do the database coding, I'm good with PHP/mySQL, and could possibly host it...

 

ukgimp




msg:396142
 8:44 am on Feb 11, 2003 (gmt 0)

you could try spiderhunter.com

I have heard of it being down at the moment though.

volatilegx




msg:396143
 4:50 pm on Feb 11, 2003 (gmt 0)

Here's a bot database created by one of the members here: http: //joseluis.pellicer.org/ua/

I believe it's mostly concerned with User Agents... not sure if IP addresses are tracked.

I keep track of IP numbers of search engine spiders at http: //www.iplists.com/

Dreamquick




msg:396144
 8:01 pm on Feb 11, 2003 (gmt 0)

Personally I'm a fan of http;//www.psychedelix.com/agents.html

digitaleus's original idea might work but I think having a person as the absolute moderator is presenting a single point of failure where there doesn't need to be one...

A distributed client model works much better because you are not relying on individual users to correctly identify the elements required and you can pre-process to a certain degree.

If you added to this further by requiring multiple confirmations from separate sources alongside a trust model based on past performance you start to diminish the role of moderator and move that job to more of an administrative role.

The reason I say this is that nothing sucks more than data not being added because the owner/moderator is away or sick, or perhaps just not wanting to add the particular user-agent you have found.

A nice automated system side-steps this problem but still allows you to maintain a "pure" feed for those who want absolutes while at the same time providing a "dev" feed for those who want the bleeding edge data.

- tony

wilderness




msg:396145
 11:21 pm on Feb 11, 2003 (gmt 0)

Initally when I started with htaccess and monitoring my logs I used the aformentioned links and a few more to acquire any information possible.
There may be more than the four below which I used to use:
[jafsoft.com...]
[psychedelix.com...]
[robotstxt.org...]
[botspot.com...]

These days I'm quite content with the WHOIS facilities and web searches (most of mine are through google.) Along with the data I've accumulated (If only I could get it updated and together in one file (it's too large for NotePad.)

thermoman




msg:396146
 7:08 am on Feb 15, 2003 (gmt 0)

I think a very good database is [joseluis.pellicer.org...]

There u can find

Unknown UAs
Indexing UAs
Search Engine UAs
Other UAs
Offline Browser UAs
Validator UAs
Email Collector/Spam UAs

and you can generate config files such as .htaccess files for blocking some sort of bots u can specify.

greetings from germany,
Marcel.

amoore




msg:396147
 6:18 pm on Feb 15, 2003 (gmt 0)

I have been considering building a realtime malicious spider database. It would probably act like the realtime blackhole lists for open spam realys. If your website was visited by a spider that disobeyed your robots.txt, or clicked through to your bot trap, or otherwise demonstrated it to be a malicious spider, you could report its IP address and useragent to the database.

Then, each time that you get a hit to your website (or each time a new IP address/useragent combination comes along every once in a while) you can query the database to see if you should serve the page. Queries could be made through something quick such as a DNS lookup like some blackhole lists use.

I've outlined it a bit at [gotany.org...]

I can't figure out a few parts of it, though, such as how to prevent the database from being filled with false reports, and agreeing on how much of what behavior gets a spider banned.

If you have any suggestions or input, I'd love to hear it.

thermoman




msg:396148
 2:34 pm on Feb 16, 2003 (gmt 0)

Hi,

i think this is impossible. Because regarding email and RBLs it's unrelevant whether the email is delivered $now or $now+10s. But when accessing a Webserver it's important that the machine can serve webpages as fast as possible. There's no time for looking up UA or IP in a remote database.

this is my point of view - i think this plan is doomed to failure.

greetings,
Marcel.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved