Forum Moderators: open
I'm still new to this side of web design - is there any reason to co-operate with any unidentifyable UA? i.e. have a list of specific UAs who are 'goodies', and assume anyone else is 'bad'?
Cheers, Robin
Generally, not all robots are bad, and you need to coexist somehow with them, since they'll get you into the search engines index.
Ofc, there are things a "good robot" should do, which are imo:
1) retrieve, analyse and comply with robots.txt and the robots meta tag
2) In the UA, give some form of indentification or feedback possibility (email addy, website)
3) when retrieving pages, it shouldnt overload your server, spam you with requests. I consider it bad practice when robots come and do more than like 3 requests per 5 seconds.
To come back to ScoutAbout, which I also have in my logfiles:
- 10 requests over a period of roughly 2hrs
- has not retrieved robots.txt so far, mifght still do so given the pace it runs at.
So, no, I would not consider ScoutAbout a bad robot.
Skirril
Sure... it MIGHT download it later. But that's not the point! That's the *first* thing it's supposed to download, its way of asking to permission to spider your site.
If it's not grabbing robots.txt then I say it's a bad robot and the authors should get an email protesting it, and it would not be an idle threat to warn that it will be banned from our servers until it complies.
Bolot
138.15.164.9 - - [17/Jul/2001:11:46:04 -0400] "GET / HTTP/1.1" 200 5826 "-" "ScoutAbout"
No, it did not request robots.txt and there is no information about where to email. It only requested one page, so it's not particularly invasive (not yet, anyway), but it still does not meet Bolot's criteria as a friendly bot.
I know it's probably irrelevant, and I'm probably paranoid, but the last part of the URL has me wondering - nec.com I have a "Ready" computer put out by NEC. Naahh, couldn't be any connection.
[proactiveresearch.com...]
redirects to
[researchrepublic.com...]
I believe that ScoutAbout and Lachesis may be working together - at least the ones that come from zeus.nj.nec.com and hades.nj.nec.com.
ScoutAbout requests robots.txt, and sometimes "/", and then Lachesis comes along later and requests "/" and other pages, but not robots.txt.
That's what I'm seeing anyway.
I had given Lachesis the boot with a 403, but it's so brain-dead, it just comes back and re-issues the request a few minutes later. I have added both Scoutabout and Lachesis to robots.txt, and I'll wait to see if they collectively obey robots.txt and then put the 403 back on (or redirect them back to themselves) if necessary.
Anybody else have more data to add to the pattern, or to break the pattern I'm seeing?
Jim