Forum Moderators: open

Message Too Old, No Replies

Jetbot/1.0

NOT obeying robots.txt

         

pendanticist

2:04 am on Oct 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In this thread [webmasterworld.com] the owners say this bot behaves just as its predecessor did.

I don't know how it behaved before.

So, let me just say that Jetbot/1.0 does NOT obey robots.txt

Been running far too hard for my liking so I thought I'd ban it via robots.txt to see if it would back-off.

Thwack!

pendanticist

2:28 am on Oct 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Tee, hee, hee

If it'da read and followed robots.txt they'da NOT hit the trap this time....

64.71.144.** - - [17/Oct/2004:19:00:13 -0700] "GET /robots.txt HTTP/1.0" 200 1705 "-" "Jetbot/1.0"
64.71.144.** - - [17/Oct/2004:19:00:14 -0700] "GET /blahblah/trapverbiage.cgi?id=uh,ohhh HTTP/1.0" 403 480 "-" "Jetbot/1.0"

...and this time.

64.71.144.** - - [17/Oct/2004:19:02:23 -0700] "GET /blahblah/trapverbiage.cgi?id=uh,ohhh-2 HTTP/1.0" 403 480 "-" "Jetbot/1.0"

At this rate, Jetbot will trip all of my traps instead of simply NOT crawling anymore.

Any ideas what I can do here?

Thanks,
Pendanticist.

uncle_bob

10:06 am on Oct 18, 2004 (gmt 0)

10+ Year Member



I'd recommend banning jetbot's entire IP range. Thats what I did, after I saw it ignore my robots.txt

I doubt it will deliver any traffic for a very long time, so why not wait until its big enough for brett to add a forum for it, and them allow it back into your site. Hopefully by then the developers may have fixed this very broken bot.

pendanticist

1:22 am on Oct 19, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, I submitted my complaint via the link provided and they did get back to me.

"We sent your block request to our spider admin...."

Problem is, they interpreted my query as to why the bot does NOT respect robots.txt as a request for them to physically block my domain from future crawls.

They never answered my question: "Why does Jetbot NOT respect robots.txt?", although the response had a small FAQ where it is stated that they DO respect robots.txt.

Judging by the way they handled a simple request, I doubt we'll be seeing any sub-forum for this puppy anytime soon.

Bad enough that bots have begun re-requesting pages I've 301'd well over a year ago. Duh.....

Naturally, since those 301s were taken down roughly six months ago, you'd think the bot would have compiled sufficient data from my site to run a fresh crawl effectively hitting on all 200s.

Anyway, bow that they've blocked my domain, things should quiet down some...

<aside>
The only bot that works worth a pile-O-beans is Jeeves / Teoma. Smooth running and has NOT looked for an old file in months.
</aside>

Pendanticist.

jdMorgan

3:56 am on Oct 19, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's possible that JetBot has a problem. Either they require the exact User-agent string they describe on their JetBot information page [jeteye.com], that is:
User-agent: JetBot
Disallow: /


or perhaps like many other robots, they cannot handle multiple User-agent lines per record, as described in the original Standard for Robots Exclusion [robotstxt.org]. That is, the robots from Google, Yahoo, and the other 'big players' fully conform to the Standard, but many smaller ones don't. The standard clearly states that multiple User-agent lines may appear in a robots.txt record, and that the Disallow lines that follow them should be interpreted as a policy that applies to all listed User-agents. From the original Standard:

"The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored." (emphasis added)

Further, this construct appears in the example code in Martijn Koster's Internet Draft of a Method for Web Robots Control [robotstxt.org].

For those just tuning in, this is valid according to the Standard, but not universally-respected:

User-agent: Slurp
User-agent: Googlebot
User-agent: JetBot
Disallow: /cgi-bin/
Disallow: /stats


Therefore, it might work better to try:
User-agent: JetBot
Disallow: /cgi-bin/
Disallow: /stats

User-agent: Slurp
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /stats


This lack of support for multiple user agent lines per record is very common. Even a well-known robot that now has its own WebmasterWorld forum had this problem when it first rolled out. They fixed it quickly -- possibly because of an e-mail I sent them.

Just for the record, the second version of the Standard for Robots Exclusion, promulgated by Charles Koller et al in the Robot Exclusion Standard Revisited [kollar.com], says that the muliple-user-agent-lines-per-record construct should not be used, and that the following method should be used instead:

User-agent: Slurp Googlebot JetBot
Disallow: /cgi-bin/
Disallow: /stats


There are several robots.txt syntax and robot behaviour differences evident between the two documents cited above. However, in the face of two conflicting "standards," the wise search engine spider operator will support both methods.

I would also like to acknowledge Sean Connor's work [conman.org] on a regular-expressions-based Standard that was apparently never adopted; The use of regular expressions would have significantly improved the ability to specify disallowed resources in a concise, compact way. However, it appears that the "complexity" of regular expressions was thought to be too much for the average Webmaster, and this idea was not adopted (and judging by the errors we see in robots.txt files using simple prefix-matching, they may have been right).

Jim

Wizcrafts

7:05 pm on Oct 20, 2004 (gmt 0)

10+ Year Member



Jim;
I have my Robots.txt Disallow directives stacked under a universal listing, like this:

User-agent: *
Disallow: /bbb.html
Disallow: /contact-info.html
Disallow: /emails.html

Is this still regarded as valid and understood my modern Spiders?

Wiz

jdMorgan

10:03 pm on Oct 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, of course -- that's standard practice. But that is not the issue here.

To use your term, some second-tier robots don't understand robots.txt files where the User-agent lines are "stacked" and will either go away or disregard the Disallows and spider the entire site because they consider it to be an error. However, the original Standard for Robots exclusion specifically allows for this.

I am suggesting that as a possible cause for the original subject of this thread.

Jim