Forum Moderators: DixonJones

Message Too Old, No Replies

How do I detect good and bad bots here?

check those names

         

silverbytes

4:18 pm on Sep 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



bright.net caching robot
Inktomi Slurp
Scooter
Almaden.ibm.com/cs/crawler
Ask Jeeves
DTS Agent
Enterprise Search
Fetch API Request
GaisBot
GigaBot
GirafaBot
GoogleBot
ia_archiver
iaea
Indy
Java1.x.x
libwww-perl
LinkWalker
lwp-trivial
Microsoft URL Control
Mozilla
MSIECrawler
MSNBOT
MSNIA
NetResearchServer
obot
Openfind data gatherer
PHP
Pompos (pompos@iliad.fr)
sitecheck.internetseer.com
Stupid email harvester
WiseNut bot

Iīm specially worried by LinkWalker, Fetch API Request not to mention the Stupid email harvester

jatar_k

6:07 pm on Sep 19, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Well you can look through this list

A Close to perfect .htaccess ban list [webmasterworld.com]

Though banning is a personal choice there are some that are, more often than not, up to no good.

It is important to read documentation about bots and then observe their behaviour. If they are doing harm to your site then ban them but it is important to make sure that it won't effect users trying to get to your site.

silverbytes

11:13 pm on Sep 20, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you!
Related to bots, and google bot specifically:
I see in my monthly report that googlebot acceded
some resources that are not in my server anymore...

how could that be possible?

/mydocumentthatisnotintheserver.htm is listed as hit, but some other new documents really uploaded donīt...

Any clue?

mack

11:24 pm on Sep 20, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Google does this from time to time because it knows they where there the last time it visited. It will download robots.txt (if present) and if the files are not disallowed it will then attempt to re-index them. Google has no way of knowing the pages are no longer there until your server returns a 404.

Hope this is of some help.

Mack.

silverbytes

1:31 am on Sep 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



yes thanks, but why is not crawling my new html documents?... robots txt allows all bots all directories...