Forum Moderators: open

Message Too Old, No Replies

Does everyone use bespoke scripts for spider identification

         

scooterdude

2:42 pm on Sep 13, 2011 (gmt 0)

10+ Year Member



Hi all, many moons ago i had a script i wrote for this using a database that i found from one of the threads here.

i've since abandoned it but was considering restoring since i now find that spider bandwidth use is now an issue

Or are there any scripts free or otherwise like awstats, awstats isn't really into spiders tho :)

i don't want to block any of the big 3 so, i need to start identifying these bots


awstats is showing me an unknown bot consuming 1 GB bandwidth in 12 days on a site that currently has low traffic from all sources, i don't see bing at all, so wonder whether my awstats install recognises bingbot at all,

wilderness

6:50 pm on Sep 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



IP?
Raw Log line?

dstiles

8:52 pm on Sep 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Awstats: You probably see msn instead of bin - you need to edit the config files.

Robots.txt will only "control" legit bots that obey robots.txt. Most bad bots will not even read it. So yes, you need some kind of script, either a custom one or (if available) .htaccess. There is a lot about the latter around this forum and (i understand) in the apache forum.

topr8

9:33 pm on Sep 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



... simple answer, there is no freeware available (that works)

i suspect partly because anyone could then bypass it easily once they saw how it worked.

the primer pinned to the top of this forum is a good starting point, also recently..

[webmasterworld.com...] and other threads in the library are useful

i use a custom script as well as blocking at the htaccess/config level, however i don't stop everything but do stop the bulk of stuff ... to catch and block everything is a worthwhile obsession but one i don't have time for.

>>many moons ago i had a script i wrote for this

bad bots are way more sophisticated than they once were

dstiles

7:20 pm on Sep 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Agreed. You have to monitor a LOT more than just the UA - and even the UAs can sometimes LOOK valid unless you delve deeply.

As I've said before, a good start is to block all server farms aka datacentres.

lucy24

9:47 pm on Sep 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Read the logs. Read the logs. Read the logs.

I've followed mine closely for the last 6 months or so and have got it down to where some days have no unfamiliar robots (that is, ones that are neither 403'd at the gate nor "Yeah, sure, go on in"). Maybe one or two stopping by to glance at the index page for reasons of their own, and the occasional Chinese whack-a-mole.

It helps of course to be small enough that I can actually read the entire log after all auto-deleting is out of the way. Robots don't seem to care about size, except tangentially. A big site is statistically more likely to have money associated with it. But you're not going to ignore the $10 bill someone dropped on the sidewalk just because you're looking for a bank to rob.