homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

Twiceler/cuil.com craziness (FWIW)

 7:16 am on Dec 6, 2009 (gmt 0)

Anyone else seeing odd things with Cuil's crawler in recent days? All hits are always for robots.txt but today -- 36 times in ~12 hours?! Both with the usual UA --

Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)

-- and with no UA at all (scroll down to see differently-named servers). Usually Twiceler visits a few times a day. Never, ever like this:

[11:23:29] crawl-14c.cuil.com
[12:11:36] crawl-4c.cuil.com
[19:21:00] crawl-1c.cuil.com
[19:24:52] crawl-1c.cuil.com
[20:41:26] crawl-14c.cuil.com
[20:45:52] crawl-14c.cuil.com
[20:57:06] crawl-15c.cuil.com
[21:01:28] crawl-15c.cuil.com
[21:06:45] crawl-12c.cuil.com
[21:09:12] crawl-17c.cuil.com
[21:09:33] crawl-19c.cuil.com
[21:11:17] crawl-12c.cuil.com
[21:13:37] crawl-17c.cuil.com
[21:13:58] crawl-5c.cuil.com
[21:14:07] crawl-19c.cuil.com
[21:14:23] crawl-16c.cuil.com
[21:14:53] crawl-7c.cuil.com
[21:17:45] crawl-4c.cuil.com
[21:18:16] crawl-5c.cuil.com
[21:18:55] crawl-16c.cuil.com
[21:19:14] crawl-7c.cuil.com
[21:20:32] crawl-9c.cuil.com
[21:24:47] crawl-9c.cuil.com
[21:28:59] crawl-18c.cuil.com
[21:29:49] crawl-2c.cuil.com
[21:30:11] crawl-3c.cuil.com
[21:33:45] crawl-18c.cuil.com
[21:34:18] crawl-8c.cuil.com
[21:34:20] crawl-2c.cuil.com
[21:35:14] crawl-3c.cuil.com
[21:38:36] crawl-8c.cuil.com
[21:43:49] crawl-6c.cuil.com
[21:47:55] crawl-6c.cuil.com

And these were without any UA at all. At leat they did read/heed robots.txt --




 9:45 pm on Dec 6, 2009 (gmt 0)

Cuil's stealthy behavior also comes from
ramp1hq.cuil.com at Layer42 and have earned themselves a ban.


 2:21 pm on Dec 7, 2009 (gmt 0)

I've this thing requesting robots.txt and not proceeding any further (it wouldn't get in anyway) for more than a few days now.

This IP's make multiple requests and in no specific order, however quite close together.

Lord Majestic

 10:45 pm on Dec 8, 2009 (gmt 0)


Such behavior can happen in search engines when they are cleaning up a large list of URLs to eliminate those that have been disallowed. As this process is likely to be done in parallel it can manifest itself as described above.



 3:09 am on Dec 9, 2009 (gmt 0)

All requests on my sites look normal today -- and that's actually a new thing, because prior to last month, Twiceler apparently did not understand multi-user-agent policy records in robots.txt, and as a result didn't crawl the sites. That's changed now, and they're crawling away (at a normal rate).

Combined with Lord Majestic's speculation above and the "ramp" hosts with no UA, it wouldn't surprise me if they're preparing to roll out a new index some time soon.



 1:47 am on Dec 18, 2009 (gmt 0)

FWIW... Twiceler's still hammering away at the same site, every single day, usually in the late afternoon/early evening (Pacific). The next time I'm procrastinating something dreadful, I'll e-mail them about their overkill hits to robots.txt:

[17:12:36] crawl-2c.cuil.com
[17:12:37] crawl-6c.cuil.com
[17:12:57] crawl-7c.cuil.com
[17:12:59] crawl-8c.cuil.com
[17:13:05] crawl-14c.cuil.com
[17:13:30] crawl-4c.cuil.com
[17:17:55] crawl-5c.cuil.com
[17:17:58] crawl-17c.cuil.com
[17:17:58] crawl-12c.cuil.com
[17:18:05] crawl-19c.cuil.com
[17:18:11] crawl-9c.cuil.com
[17:18:22] crawl-16c.cuil.com
[17:18:23] crawl-3c.cuil.com
[17:18:34] crawl-1c.cuil.com

Do any of you ever get any traffic from them? I don't.


 7:05 am on Dec 29, 2009 (gmt 0)

Despite what Cuil's PR people would claim, in Irish, "Cuil" means fly or bug. Despite the claims of genius made about its founders Cuil is a pest and sends zero traffic on two of my sites. One of them is one of the largest Irish web directories and the other is a very large domain history and domain statistics website. They've been hammering away for months but normally when they start getting problematic they automatically get slapped with a 503. They've been 403ed on the web directory site for not following robots.

Lord Majestic's speculation is a possibility. The last I heard of Cuil was that it was trying some social search engine experiments and some Twitter stuff was being integrated.


Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved