Forum Moderators: open

Message Too Old, No Replies

lnbot

FAST Enterprise Crawler 6 used by LexisNexis

         

caribguy

3:04 pm on Nov 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does not understand

User-agent: *
Disallow: /*?not_to


208.68.138.nnn - - [10/Nov/2008:04:36:34 -0600] "GET /robots.txt HTTP/1.1" 200 1144 "-" "FAST Enterprise Crawler 6 used by LexisNexis (lnbot@lexisnexis.com)"
208.68.138.nnn - - [10/Nov/2008:04:38:43 -0600] "GET / HTTP/1.1" 200 10493 "http://www.referrer.com/page.html" "FAST Enterprise Crawler 6 used by LexisNexis (lnbot@lexisnexis.com)"
208.68.138.nnn - - [10/Nov/2008:04:39:13 -0600] "GET /widgets HTTP/1.1" 200 10493 "http://www.example.com/" "FAST Enterprise Crawler 6 used by LexisNexis (lnbot@lexisnexis.com)"
208.68.138.nnn - - [10/Nov/2008:04:54:52 -0600] "GET /widgets/?not_to_be_crawled=10 HTTP/1.1" 200 29977 "http://www.example.com/" "FAST Enterprise Crawler 6 used by LexisNexis (lnbot@lexisnexis.com)"

jdMorgan

2:08 am on Nov 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do they claim to support the query-string/wild-card extension to the robots.txt Standard? That extension is semi-proprietary, and not supported by all robots. It is not part of the original Standard for Robot Exclusion.

You have to be very careful not to present bots with extended robots.txt syntax if they don't explicitly claim to support it. And if they don't claim to support it, then you can't expect them to do so.

Such things as Crawl-Delay:, Sitemap:, query-string awareness, and wild-card URL-path matching are not consistently supported by all robots.

If Fast/Inbot doesn't support wild-cards and query strings, you can either tell them to stay out of /widgets/ completely, cloak that page for that spider and send them a short or blank page with a <meta name="robots" content="noindex,nofollow"> on it, or you can just disallow that spider entirely (if any of these options are appropriate for your site).

Jim

caribguy

5:10 am on Nov 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi JD,

Not sure what they claim, I understand that wildcards are not supported by all bots. Since I couldn't find out anything about this bugger through a search, I thought maybe the collective WebmasterWorld wisdom would give me an insight...

Perhaps I'm overreacting, but I'm getting a bit tired lately of these uninvited guests that show up for Thanksgiving dinner and grab the whole turkey hot out of the oven before it's even been carved... It gets more tempting by the day to disallow bots by default and only let the big three come in and play.

Lord Majestic

2:19 pm on Nov 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



JD is right.

You have got a non-standard disallow directive that is only honoured by a handful of bots, if you use it (like you did) then you should not be suprised that it is not supported by all bots. Have you tried to contact them on their email address about this issue? I would not be suprised if they tell you that your robots.txt is non-standard and that's why it was not honoured.

I don't know if you should disallow all bots, but if you specify a non-standard robots.txt directive then you should limit it to the bots that you know can execute it correctly.

If /*?not_to are the bot traps then you really shot yourself in the foot with it - there is sadly no other standard way to match these, but what you can do is redirect all /*?not_to urls to /trap.php?not_to and disallow the latter in a standard compliant way. In this case you will have a point if bots get to the final trap, though even is a bit unreliable - robots.txt is a bit fuzzy on how to deal with redirects and this confuses plenty of developers, though logically robots.txt rule should be checked against redirect targets.

Conclusion - you need to keep your robots.txt simple and standards compliant.