homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

very unusual swarm of crawlers

 8:15 pm on Feb 26, 2012 (gmt 0)

Don't believe there's any need to obscure the Class D's, as their all colo's or server farms with the exception of the MS range.

Has anybody ever seen such an attempt in such a short period?
The first seven requests (pre-line break) were 21-seconds from first to last.

I'm most puzzled by the MS range, and whether it's some kind of tool or open proxy?
Anybody know?

Collectively, the entire crawl must be part of some software or tool, as it's impossible to change IP's and servers so quick manually.
Anybody know or recognize a culprit? - - [26/Feb/2012:19:09:27 +0000] "HEAD MyFolder/Partial-Page-Name HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)" - - [26/Feb/2012:19:09:27 +0000] "HEAD MyFolder/Partial-Page-Name HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)" - - [26/Feb/2012:19:09:27 +0000] "HEAD MyFolder/Partial-Page-Name HTTP/1.1" 403 - "-" "UnwindFetchor/1.0 (+http://www.gnip.com/)" - - [26/Feb/2012:19:09:27 +0000] "GET MyFolder/Partial-Page-Name HTTP/1.1" 403 - "-" "Mozilla/5.0 (compatible; TweetmemeBot/2.11; +http://tweetmeme.com/)" - - [26/Feb/2012:19:09:32 +0000] "GET MyFolder/Partial-Page-Name HTTP/1.1" 403 533 "-" "NING/1.0" - - [26/Feb/2012:19:09:48 +0000] "GET

MyFolder/Partial-Page-Name.. HTTP/1.1" 403 533 "-" "JS-Kit URL Resolver, [js-kit.com...] - - [26/Feb/2012:19:12:55 +0000] "GET MyFolder/Partial-Page-Name HTTP/1.1" 404 194 "-" "Mozilla/5.0 (compatible; Evrinid Iudex 2.0.0; +http://www.evri.com/evrinid)" - - [26/Feb/2012:19:13:02 +0000] "GET MyFolder/Partial-Page-Name HTTP/1.1" 404 194 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)" - - [26/Feb/2012:19:21:14 +0000] "GET MyFolder/Partial-Page-Name HTTP/1.1" 404 194 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2" - - [26/Feb/2012:19:25:59 +0000] "HEAD MyFolder/Partial-Page-Name HTTP/1.1" 403 - "-" "MetaURI API/2.0 +metauri.com" - - [26/Feb/2012:19:30:29 +0000] "GET MyFolder/Partial-Page-Name HTTP/1.1" 403 533 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/534.24 (KHTML, like Gecko)" - - [26/Feb/2012:19:32:00 +0000] "GET /robots.txt HTTP/1.1" 200 4815 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 (FlipboardProxy/1.1; +http://flipboard.com/browserproxy)"



 11:37 pm on Feb 26, 2012 (gmt 0)

The pattern looks exactly like the usual Twitter swarm.

You or someone else puts a message on Twitter with a link leading to a page, and all the dumb bots subscribed to the global Twitter stream starts attacking. Some to get the post, some just to resolve the shortened URL and take metadata (such as JS-Kit URL Resolver and MetaURI), and others are typical "sharing services" (such as FlipBoard and PaperLi) making a living of our posts.

I have Wordpress connected to Twitter, and these same patterns start every time within 1-2 seconds after a Publish. And continues slower for hours after the Twitter post.

I have many of them blocked. Especially if they do not have a real name (such as JS-Kit, which is merely a library and could be anyone), or all the ones running hidden inside Amazon EC2, where you really have no idea who they are, no matter what name they pretend to have.


 12:28 am on Feb 27, 2012 (gmt 0)

many thanks DeeCee.

They all requested the same partial and incomplete page name.

All were denied because this directory is highly browser restricted and is used for temp pages and to fulfill immediate needs.

All the IP's were previously denied in my main site, with the exception of one, and it was enjoyable to become aware of a colo I wasn't previously aware of ;)


 12:31 am on Feb 27, 2012 (gmt 0) - - [26/Feb/2012:19:13:02 +0000] "GET MyFolder/Partial-Page-Name HTTP/1.1" 404 194 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

If this is a regular unidentified from MS, than I'll make a line just for it.


 1:17 am on Feb 27, 2012 (gmt 0)

Oh, ###. When did they start claiming to be NT 6? My current plainclothes-msie block only says

MSIE\ 7\.0;\ Windows\ NT\ 5\.[12]

and doesn't include the 65.52 range.

:: wandering off to edit ::


 1:36 am on Feb 27, 2012 (gmt 0)

It's definitively linked to Twitter, I get that swarm of IP's too as soon as I publish an article and tweet about it.

There's some additional info about the MS IP here:

It has no hostname and seems to go through a local IP range ... I wouldn't make a line for it.


Lucy, 6.0 is Vista


This is mine on a standard Vista / IE8 install:
Useragent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)

6.1 for Windows 7

Useragent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)


 1:46 am on Feb 27, 2012 (gmt 0)


The partial/incomplete page name seems to be because someone pasted it wrong for the Tweet (or a URL shortener service). You might still be able to find what was posted using Twitter search.

If you block with that pattern, I believe it might block quite a few normal users as well.
I see the 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0' prefix quite a bit in logs from various user IP ranges around the world. (All loading like normal user browsers.)


 3:35 am on Feb 27, 2012 (gmt 0)

I had something more specific in mind:

RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\.
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ (compatible;\ MSIE\ 7\.0;\ Windows\ NT\ 6\.0)$
RewriteRule .* - [F]


 3:43 am on Feb 27, 2012 (gmt 0)

@DeeCee: What he said. Mine now reads

# lock out plainclothes MSIEbot completely

RewriteCond %{REMOTE_ADDR} ^(65\.5[2-5]|157\.(5[4-9]|60)|207\.46)\.
RewriteCond %{HTTP_USER_AGENT} MSIE\ 7\.0;\ Windows\ NT\ [56]\.\d
RewriteRule (\.html|/)$ - [F]


 5:14 am on Feb 27, 2012 (gmt 0)

Yeah.. That would be better. :)
I only track httpd.conf or .htaccess patterns on specific Bad Crawler patterns.

In IP specific stuff, I instead run a command to tag new bad-guys or IP range policies onto on of my my DNSBL lists. Then they can bang away at their hearts content. Plus, if they "touch" anything they should not be messing with, such as trying too hack Wordpress or pushing on the Joomla Admin doors, my honeypots eventually catch them and do the tagging automagically.

General httpd access and incoming email paths are "hidden" behind DNSBL blocking, and Wordpress blogs run CrudArrest to block spam and security issues.


 3:57 pm on Feb 27, 2012 (gmt 0)

The majority of the bots in the OP are from amazonaws.com IPs, a notorious home base for Twitter swarmers. [webmasterworld.com...]

Unfortunately, they're not one-hit wonders. Long after the original tweet(s), most of the AWS-cloaked bots hit day in and day out, HEADs and GETs, good URLs and bad, 200s or 403s, ad nauseam.

In the case of re-re-re-tweets, the bots' brain-dead re-re-re-swarming is almost more comical than wasteful. Almost.


 6:49 pm on Feb 27, 2012 (gmt 0)

Yes, Pfui

I think you summed that up nicely. :)
Matches perfectly. Twitter swarmers are some of the most brain-dead bots out there. They reflect the same level as intelligence as the "sites" that send out the swarm.

Personally I finally got to the conclusion that nothing good ever reaches out from Amazon EC2, and added whole Amazon owned CIDR ranges to block lists.
Amazon really need to start watching what their customers do from there. Also symptomatic is that quite a few content scrapers coming from there show up using Russian versions of various user agent-strings. They do not even bother hiding WHAT they are; they merely try to hide WHO they are with a cheap account in the EC2 cloud.


 11:29 am on Mar 12, 2012 (gmt 0)

These requests pretty much stopped within a few days, however some three weeks later and odd-guy-out follows (and from our Dear friends at Amazon) - - [12/Mar/2012:09:22:13 +0000] "GET /SameFolder/SamePartialPagename HTTP/1.1" 403 533 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1"

FF 0.10.1 must be a new version ;)


 1:59 am on Mar 13, 2012 (gmt 0)

I concur that it's similar to a twitter swarm, but not the same exact swarm I see when I test twitter, which I do a lot.

BTW, if you want to test bot blocking code, twitter rocks, just post a link to some URL and within seconds a bunch of bots jump on it faster than flies on fresh feces. It's a great way to get the bots to actually do something useful for a change by helping you test your filters and firewalls, and it's fun to watch.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved