Cuil quietly fixes Twiceler crawler

I've been griping about Twiceler since Cuil (formerly Cuill) went live. Although their Twiceler crawler was "allowed" in my robots.txt file, it was included with several other robots' user-agent strings in a multiple-user-agent robots policy record. Just to be clear, it looks something like this:

User-agent: Googlebot
User-agent: Slurp
User-agent: Teoma
User-agent: twiceler
Disallow: /cgi-bin
User-agent: *
Disallow: /

This syntax was defined by the original Standard for Robot Exclusion as agreed by consensus on 30 June 1994 on the robots mailing list, and is valid.

However, Twiceler apparently couldn't parse that, and as a result believed that it was disallowed by the second policy record in the example above.

Twiceler seems to have started correctly parsing multiple-user-agent policy records as of November 18th, 2009. I noticed that it no longer just 'went away' after fetching robots.txt (in which it is allowed) on that date.

No change to the Twiceler user-agent string ( "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)" ) was apparent when this behavior change was noted.

It appears that the change was backed out on December 22nd 2010, as Twiceler reverted to its former "fetch robots.txt and leave behavior, but then on January 22nd, 2010, it began fetching pages from my site again, and some of those pages started to appear in the Cuil.com search results (albeit with the common mis-ascribed "sample images").

I'm not sure what they're up to over there, but I take this as a welcome improvement and a sign of life at Cuil -- at least they're evidently working on their infrastructure.

Jim

Cuil quietly fixes Twiceler crawler

To handle multi-user-agent policies in robots.txt

jdMorgan

keyplyr

jmccormac

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week