homepage Welcome to WebmasterWorld Guest from 54.227.146.68
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 278 message thread spans 10 pages: < < 278 ( 1 [2] 3 4 5 6 7 8 9 10 > >     
amazonaws.com plays host to wide variety of bad bots
Most recently seen: Gnomit
Pfui




msg:3828720
 3:04 am on Jan 18, 2009 (gmt 0)

ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

 

GaryK




msg:3866818
 1:21 am on Mar 10, 2009 (gmt 0)

While I appreciate the information, Jim, that's actually all the more reason for me to want to block the entire net range. :)

jdMorgan




msg:3868480
 12:21 am on Mar 12, 2009 (gmt 0)

you may

-- or you may not. Since this thread will be read by many, I thought that background info "might" be useful.

Jim

enigma1




msg:3870785
 3:08 pm on Mar 15, 2009 (gmt 0)


"Requesting my robots.txt leads to a site-wide ban."

I'm curious as to how you do that, and also why?

You could do it by generating the robots.txt dynamically. You hook on the 404 handler, examine the query and then synthesize the robots.txt on the fly (as you set it so you have no physical robots.txt file). You can compare the UA for a url like string, which would imply the visitor wants to enter as a spider.

Is it good to ban IPs based on it? No it's not. Or maybe it is, for the competitors. Countermeasures include, if they ever figure that out, the expose of urls via links or images inside the page HTML towards the robots.txt file or disallowed directories on the other server, triggering the later to ban real visitors and popular spiders. I wouldn't want to go down that path.

For the original issue with the amazonaws bots, they pretend to have a search engine via the UA but they're not. They have no public search page at least from the urls they advertise so I block them by host and redirect them to a blackhole. I don't even want to waste b/w for 403 or 404 content.

Pfui




msg:3871018
 10:24 pm on Mar 15, 2009 (gmt 0)

@enigma1: I may be missing your point but I'm unaware of any of the 50-plus AWS UAs to date "pretend[ing] to have a search engine via the UA."

AmazonAWS [aws.amazon.com] is a server cloud, née farm, not a public search engine per se. (A9.com [a9.com], another Amazon-owned site, is search-related.) It's anybody's guess who/what the amazonaws.com-based bots are crawling for, literally, because the AWS cloud is a cloak.

Pfui




msg:3871021
 10:27 pm on Mar 15, 2009 (gmt 0)

And now, courtesy of Yet Another AWSbot, the domain's first logspam! Oh, joy...

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
robots.txt? NO

Referer: FAKE; Education-related portal in Mexico

[edited by: Pfui at 10:29 pm (utc) on Mar. 15, 2009]

Pfui




msg:3871731
 10:04 pm on Mar 16, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
IIITBOT/1.1 (Indian Language Web Search Engine; [webkhoj.iiit.net;...] pvvpr at iiit dot ac dot in)

robots.txt? YES

dstiles




msg:3871756
 10:48 pm on Mar 16, 2009 (gmt 0)

The included URL goes to a page saying "503 Service Temporarily Unavailable". Otherwise a long-standing domain for an Indian research institute. Maybe that's why they checked for robots.txt. Still going to be unlucky here, though.

Ironically, given another current thread, the domain registration suggests contacting a certain .org site for more info. Good ole netsol! :)

GaryK




msg:3871818
 12:13 am on Mar 17, 2009 (gmt 0)

At least NetSol is a legitimate registrar. :)

incrediBILL




msg:3871825
 12:29 am on Mar 17, 2009 (gmt 0)

Alexa/Internet Archiver uses Elastic Cloud Compute services.

So you may want to allow those sub-ranges of ECS.

Jim,

I think your first sentence negates your second sentence because blocking Alexa and Internet Archiver are an added bonus IMO.

dstiles




msg:3871893
 2:40 am on Mar 17, 2009 (gmt 0)

netsol legit? Hmm. Legally, I suppose but they've "made some bad choices" over the years. :)

Pfui




msg:3881319
 9:22 pm on Mar 29, 2009 (gmt 0)

v0.5 on page 1 of this thread. Up a notch now, w/ same behavior:

ec2-[yada-yada].compute-1.amazonaws.com
Twitturly / v0.6
robots.txt? NO

I usually see Twitturly coming in from .algx.net or .a2webhosting.com

enigma1




msg:3891458
 4:01 pm on Apr 13, 2009 (gmt 0)

@pfui, well, they do pretend to have something like a search engine. Here is an eg from my logs:

174.129.111.#*$! - - [10/Apr/2009:19:49:26 -0400] "GET /robots.txt HTTP/1.0" 200 0 "-" "linkdexbot/Nutch-1.0-dev (http://www.example.com/; crawl at bla dot com)"

IP resolves to ec2-#*$!.compute-1.amazonaws.com and states "crawl"

But none of the recorded "crawl" links doesn't offer some public search facility. Also since I don't want to go around in circles for UAs and the various host names and how AWS changes the ips every time, I block the ultradns from the dns records and everyone else who comes from it.

PS:I altered the urls of the entry.

Pfui




msg:3904599
 12:53 am on May 1, 2009 (gmt 0)

I've yet to see anything even remotely related to a 'real person in real time' come from amazonaws.com, presumably because it's basically Just Another Server Farm. So I block on the host and move on. If only they would move on.

Too bad 403s don't stop bad bots and UAs, ditto zombies, for good.

enigma1




msg:3904874
 11:51 am on May 1, 2009 (gmt 0)

Too bad 403s don't stop bad bots and UAs, ditto zombies, for good.

Yes 403s don't, so instead what I do is redirect them to another place unloading the traffic. Something like a localhost address will do or another blackhole. Load their own servers instead.

Pfui




msg:3909760
 8:32 pm on May 8, 2009 (gmt 0)

Ditto. Good ol': [127.0.0.1...]

Pfui




msg:3909767
 8:40 pm on May 8, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
rdfbot/1.0 (Indian Language Web Search Engine; Rediff.com; rdfbotsupport AT rediffmailpro DOT com)

robots.txt? YES

-----
Related, from 11/08:

ec2-[yada-yada].compute-1.amazonaws.com
rdfbot/Nutch-1.0-dev

robots.txt? YES

-----
P.S. Bot bits:

host-202-137-236-nn.rediffdns.com
rdfbot/Nutch-1.0-dev

robots.txt? YES

host-202-137-237-nnn.rediffdns.com
IIITBOT/1.1 (Indian Language Web Search Engine; [webkhoj.iiit.net;...] pvvpr at iiit dot ac dot in)

robots.txt? YES

enigma1




msg:3910819
 8:02 am on May 11, 2009 (gmt 0)

Well here they are again:

67.202.42.--- - - [--/May/2009:--:--:-- -0400] "GET / HTTP/1.1" 301 20 "-" "AISearchBot (Email: aisearchbot@gmail.com; If your web site doesn't want to be crawled, please send us a email.)"

In the same IP range as Pfui mentioned at the beginning of the thread.

If you check the DNS records they all point to ultradns. So instead of chasing around the various IP ranges block based on the dns. I found it to be more effective.

Pfui




msg:3911669
 9:21 am on May 12, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (compatible; heritrix/1.14.2 yptrino +http://www.buddybuzz.net/yptrino)

robots.txt? YES

Pfui




msg:3918938
 8:15 am on May 23, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Caliperbot/1.0 (+http://www.conductor.com/caliperbot)

robots.txt? NO

Pfui




msg:3922632
 7:39 pm on May 29, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.5) Gecko/2008120121 Firefox/3.0.5

robots.txt? NO

enigma1




msg:3923048
 1:59 pm on May 30, 2009 (gmt 0)

Worth checking the response of the ips when you see the amazonaws ptr. I run across one entry in particular that is strange

72.44.61.194 - - [19/May/2009:22:14:44 -0400] "GET / HTTP/1.1" 301 5 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729) RPT-HTTPClient/0.3-3"

Checking if the ip responds to http requests (port 80) it loads up something like the main google search page. Does anyone have info about it?

Pfui




msg:3923263
 2:28 am on May 31, 2009 (gmt 0)

Verrrrry interesting about the IP. WHOIS sez Amazon (Amazon Web Services, Elastic Compute Cloud, EC2) but files are certainly Google's. From .js files ('/csi?v=3&s=webhp&action=' etc.) to images:

[72.44.61.194...]
[72.44.61.194...]

Check out the 404 for this dir:

[72.44.61.194...]

Looks like all sorts of companies/hosts hide behind AWS. I wonder if your discovery might have anything to do with Bill's "safebrowsing diagnostic spidering possibly going on at Google that may not be the standard Googlebot [webmasterworld.com]" topic?

If nothing else, that "RPT-HTTPClient [webmasterworld.com]" UA appendage is uncool -- and seriously ancient.

Well, at least the Google-on-AmazonAWS IP didn't show me as logged in to Google...

enigma1




msg:3923424
 11:55 am on May 31, 2009 (gmt 0)

yes you see? but what do we know. Lots of strange things. In fact you can start checking IP ranges, subdomains even from google itself. Take a look at this one:
209.85.227.141
Now that points to google alright and the port 80 responds to the google home page isn't it? In fact it translates to the google appspot. "Run your web applications on Google's infrastructure" And then you search for subdomains of the appspot all hosted by google and all kinds of stuff appear. Looks to me like hacker's paradise.

And that's just one range while we remember discussions about strange google visits that do not appear to be from googlebot. So to me AWS looks tiny in comparison.

PS: Here is a related thread
[webmasterworld.com...]

Pfui




msg:3924330
 1:31 am on Jun 2, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
LinkbackPlugin/0.1 Laconica/0.7.4

robots.txt? NO

(microblogger link-checker, apparently.)

Pfui




msg:3937544
 1:04 am on Jun 21, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (compatible; redditbot/1.0; +http://www.reddit.com/feedback)

robots.txt? NO

Pfui




msg:3937545
 1:06 am on Jun 21, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (compatible; woriobot +http://worio.com)

robots.txt? YES

[edited by: Pfui at 1:10 am (utc) on June 21, 2009]

Pfui




msg:3937546
 1:15 am on Jun 21, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6

robots.txt? NO

(Interesting cloak -- an old-old FF localized for British English.)

Mokita




msg:3939176
 12:59 am on Jun 24, 2009 (gmt 0)

ec2-174-129-91-58.compute-1.amazonaws.com
deva/deva-1.0 (Deva Fetcher; devarajaswami at yahoo dot com)

robots.txt - Yes, but promptly ignored it by requesting the home page.

Pfui




msg:3946312
 11:54 pm on Jul 4, 2009 (gmt 0)

After all these months of AmazonAWS bot-running others' cloaked UAs, I guess this one was inevitable -- whether it's real or fake:

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]

robots.txt? NO

dstiles




msg:3948610
 7:07 pm on Jul 8, 2009 (gmt 0)

If you're blocking amazonaws by IP you may not be aware of the EU / Irish range. Found it today when something triggered a trap.

AMAZON-EU-AWS
Amazon Web Services, Elastic Compute Cloud, EC2, EU
79.125.0.0 - 79.125.63.255

Full range of Amazon on that block:
IE-AMAZON-20070824
Amazon Data Services Ireland Ltd
79.125.0.0 - 79.125.127.255

Full range now blocked here.

Anyone know of any other ex-USA ranges?

enigma1




msg:3950616
 11:24 am on Jul 11, 2009 (gmt 0)

Here is some other info, not sure if it was posted before, but I see lots of ips from amazonaws used as tor proxy servers. These maybe transparent proxies serving spam/scrap worldwide.

URL
www DOT torproxylist DOT com
without spaces and real dots instead of DOT.

This 278 message thread spans 10 pages: < < 278 ( 1 [2] 3 4 5 6 7 8 9 10 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved