LargeSmall Crawler

Forum Moderators: open

Message Too Old, No Replies

LargeSmall Crawler

Pfui

5:06 pm on Sep 21, 2009 (gmt 0)

Currently:

ec2-[yada-yada].compute-1.amazonaws.com
LargeSmall Crawler (LargeSmall; [onespot.com;...] info@onespot.com)

robots.txt? Yes

(See also: " amazonaws.com plays host to wide variety of bad bots [webmasterworld.com]")

Formerly:

prod-crawler-1.largesmall.com
LargeSmall Crawler

dev-app-1.largesmall.com
LargeSmall Crawler

Prior versions/hosts did not ask for robots.txt. Nice that it does now seeing as how OneSpot aggregates and sells what it crawls.

Pfui

2:59 am on Sep 22, 2009 (gmt 0)

Addendum:

In the following still-active, just-hit-me version crawling from its own Host, LargeSmall still does NOT request robots.txt:

prod-crawler-1.largesmall.com
LargeSmall Crawler

robots.txt? NO

Whoa. onespot.com and largesmall.com have numerous same-content pages. Can you say, "Duplicate content [google.com]"?

jdMorgan

12:20 am on Sep 23, 2009 (gmt 0)

This "LargeSmall" user-agent is the most annoying thing I've seen yet out of Amazon's Compute Cloud. It's hit my server exactly 40 times in the past ten hours, fetching robots.txt repeatedly -- sometimes from the same IP address in the same 2- to 4-second time period.

I guess its response to not liking the fact that it's Disallowed (or to not understanding a particular robots.txt file) is to simply fetch it again... and again... and again... Another badly-broken 'bot.

Jim