Forum Moderators: open
I don't know how it behaved before.
So, let me just say that Jetbot/1.0 does NOT obey robots.txt
Been running far too hard for my liking so I thought I'd ban it via robots.txt to see if it would back-off.
Thwack!
If it'da read and followed robots.txt they'da NOT hit the trap this time....
64.71.144.** - - [17/Oct/2004:19:00:13 -0700] "GET /robots.txt HTTP/1.0" 200 1705 "-" "Jetbot/1.0"
64.71.144.** - - [17/Oct/2004:19:00:14 -0700] "GET /blahblah/trapverbiage.cgi?id=uh,ohhh HTTP/1.0" 403 480 "-" "Jetbot/1.0"
...and this time.
64.71.144.** - - [17/Oct/2004:19:02:23 -0700] "GET /blahblah/trapverbiage.cgi?id=uh,ohhh-2 HTTP/1.0" 403 480 "-" "Jetbot/1.0"
At this rate, Jetbot will trip all of my traps instead of simply NOT crawling anymore.
Any ideas what I can do here?
Thanks,
Pendanticist.
I doubt it will deliver any traffic for a very long time, so why not wait until its big enough for brett to add a forum for it, and them allow it back into your site. Hopefully by then the developers may have fixed this very broken bot.
"We sent your block request to our spider admin...."
Problem is, they interpreted my query as to why the bot does NOT respect robots.txt as a request for them to physically block my domain from future crawls.
They never answered my question: "Why does Jetbot NOT respect robots.txt?", although the response had a small FAQ where it is stated that they DO respect robots.txt.
Judging by the way they handled a simple request, I doubt we'll be seeing any sub-forum for this puppy anytime soon.
Bad enough that bots have begun re-requesting pages I've 301'd well over a year ago. Duh.....
Naturally, since those 301s were taken down roughly six months ago, you'd think the bot would have compiled sufficient data from my site to run a fresh crawl effectively hitting on all 200s.
Anyway, bow that they've blocked my domain, things should quiet down some...
<aside>
The only bot that works worth a pile-O-beans is Jeeves / Teoma. Smooth running and has NOT looked for an old file in months.
</aside>
Pendanticist.
User-agent: JetBot
Disallow: /
"The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored." (emphasis added)
Further, this construct appears in the example code in Martijn Koster's Internet Draft of a Method for Web Robots Control [robotstxt.org].
For those just tuning in, this is valid according to the Standard, but not universally-respected:
User-agent: Slurp
User-agent: Googlebot
User-agent: JetBot
Disallow: /cgi-bin/
Disallow: /stats
User-agent: JetBot
Disallow: /cgi-bin/
Disallow: /stats
User-agent: Slurp
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /stats
Just for the record, the second version of the Standard for Robots Exclusion, promulgated by Charles Koller et al in the Robot Exclusion Standard Revisited [kollar.com], says that the muliple-user-agent-lines-per-record construct should not be used, and that the following method should be used instead:
User-agent: Slurp Googlebot JetBot
Disallow: /cgi-bin/
Disallow: /stats
I would also like to acknowledge Sean Connor's work [conman.org] on a regular-expressions-based Standard that was apparently never adopted; The use of regular expressions would have significantly improved the ability to specify disallowed resources in a concise, compact way. However, it appears that the "complexity" of regular expressions was thought to be too much for the average Webmaster, and this idea was not adopted (and judging by the errors we see in robots.txt files using simple prefix-matching, they may have been right).
Jim
To use your term, some second-tier robots don't understand robots.txt files where the User-agent lines are "stacked" and will either go away or disregard the Disallows and spider the entire site because they consider it to be an error. However, the original Standard for Robots exclusion specifically allows for this.
I am suggesting that as a possible cause for the original subject of this thread.
Jim