Forum Moderators: open

Message Too Old, No Replies

Cuil quietly fixes Twiceler crawler

To handle multi-user-agent policies in robots.txt

         

jdMorgan

6:43 pm on Jan 23, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've been griping about Twiceler since Cuil (formerly Cuill) went live. Although their Twiceler crawler was "allowed" in my robots.txt file, it was included with several other robots' user-agent strings in a multiple-user-agent robots policy record. Just to be clear, it looks something like this:

User-agent: Googlebot
User-agent: Slurp
User-agent: Teoma
User-agent: twiceler
Disallow: /cgi-bin

User-agent: *
Disallow: /


This syntax was defined by the original Standard for Robot Exclusion as agreed by consensus on 30 June 1994 on the robots mailing list, and is valid.

However, Twiceler apparently couldn't parse that, and as a result believed that it was disallowed by the second policy record in the example above.

Twiceler seems to have started correctly parsing multiple-user-agent policy records as of November 18th, 2009. I noticed that it no longer just 'went away' after fetching robots.txt (in which it is allowed) on that date.

No change to the Twiceler user-agent string ( "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)" ) was apparent when this behavior change was noted.

It appears that the change was backed out on December 22nd 2010, as Twiceler reverted to its former "fetch robots.txt and leave behavior, but then on January 22nd, 2010, it began fetching pages from my site again, and some of those pages started to appear in the Cuil.com search results (albeit with the common mis-ascribed "sample images").

I'm not sure what they're up to over there, but I take this as a welcome improvement and a sign of life at Cuil -- at least they're evidently working on their infrastructure.

Jim

keyplyr

8:27 pm on Jan 23, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Now if they'd stop their stealth (no UA) scraping of Twitter/Facebook links then I might take them seriously and give them access.

jmccormac

8:38 pm on Jan 23, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Never saw any traffic from Cuil. Crawled stuff forbidden by robots.txt. Deepsixed - permanently.

Regards...jmcc