Forum Moderators: open
OK - ad rem:
I am constructing script mechanism which may let me get rid of bad spiders/harvesters/(even some crackers pre-attacks) from a website and a server. It will consist of independand (with option to cooperate) modules in PHP, PHP and MySQL, .htaccess, JavaScript and Java (what doesn't mean I am an expert or even average in all the languages - my field of prof is rather management of online services).
Actually I have implemented some of described above mechanisms in a site I am currently woking on. It looks the scripts - even in so premature form - save a lot of hours work and problem involved in manual log analysis.
I will appreciate your opinion about what robots do you consider as good ones and which are the bad ones (for positive/negative verification mechanisms).
And are all good robots worth taking care of them?(local/country robot for unknow search site in my opinion isn't worth even bandwidth it will consume visiting my site).
In my list of good robots there are:
Google
Altavista
Msn
Yahoo
bad robots (or and unwelcome):
ia_archiver
I appreciate your lists. If you could give me examples form your logs like (notice it is only an example):
Altavista
AltaVista Intranet V2.0 AVS EVAL search@freeit.com
AltaVista Intranet V2.0 Compaq Altavista Eval sveand@altavista.net
AltaVista Intranet V2.0 evreka.com crawler@evreka.com
AltaVista V2.0B crawler@evreka.com
AVSearch-3.0(AltaVista/AVC)
it is more then welcome.
Second thing I am working on now is mechanism for clearing logs from spoofed IP addresses which can help read logs and should made statistic metohods for logs more reliable.
(I think I should write about it in Tracking and Loging).
Thank you.
Sorry for my language mistakes - English isn't my native.
----------------------------
Don't tell I am paranoid - I know I am.
User Agents from log lines would have us all listing till the eternity as they are ever changing. It would also make this thread rather large and repetitive for lines which already exist.
Here's a start for you.
http ://www.pgts.com.au/pgtsj/pgtsj0208d.html
If you do a search at google on User Agents and perhaps add strings the results will eventually supply you with a page which has many many UA's listed. I have one saved someplace and just cannot recall what I saved it under.
BTW, what's good and bad is entirely a decision of each webmaster as to the traffic desired for their site.
EX:
I have most of APNIC and much of RIPE denied access to my sites, HOWEVER that is not applicable for most webmasters.