homepage Welcome to WebmasterWorld Guest from 174.129.163.183
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe and Support WebmasterWorld
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 278 message thread spans 10 pages: < < 278 ( 1 2 [3] 4 5 6 7 8 9 10 > >     
amazonaws.com plays host to wide variety of bad bots
Most recently seen: Gnomit
Pfui




msg:3828720
 3:04 am on Jan 18, 2009 (gmt 0)

ec2-67-202-57-30.compute-1.amazonaws.com
Mozilla/5.0 (compatible; X11; U; Linux i686 (x86_64); en-US; +http://gnomit.com/) Gecko/2008092416 Gnomit/1.0"

- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:

<html>
<head>
</head>
<body>
</body>
</html>

----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:

NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO

feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO

Twitturly / v0.5
robots.txt? NO

YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO

YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes

Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO

PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES

EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES

Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO

TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO

Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO

Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES

yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO

Mozilla/5.0
robots.txt? NO

Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES

TinEye
robots.txt? NO

Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES

nnn/ttt (n)
robots.txt? YES

AideRSS/1.0 (aiderss.com)
robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO

----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO

WebClient
robots.txt? YES

----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:

Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES

Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO

-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.

 

enigma1




msg:3950616
 11:24 am on Jul 11, 2009 (gmt 0)

Here is some other info, not sure if it was posted before, but I see lots of ips from amazonaws used as tor proxy servers. These maybe transparent proxies serving spam/scrap worldwide.

URL
www DOT torproxylist DOT com
without spaces and real dots instead of DOT.

Pfui




msg:3950688
 5:12 pm on Jul 11, 2009 (gmt 0)

enigma, I'm confused. Do you have a log entry you could post, please? TIA

Pfui




msg:3950724
 7:31 pm on Jul 11, 2009 (gmt 0)

I'm a Tor [en.wikipedia.org] noob but a quick search of related info yielded 14 publicly accessible amazonaws.com Hosts/IPs running open proxies, arguably in violation of the AWS Customer Agreement, e.g., section 5.4.5. Network:

You may not operate network services such as:
Open proxies.

(etc.)

Two of the following IPs, the 79s, map to --

ec2-[yada-yada].eu-west-1.compute.amazonaws.com

-- and the remainder to this thread's (in)famous:

ec2-[yada-yada].compute-1.amazonaws.com

67.202.11.nnn
67.202.30.nn
67.202.44.nnn
67.202.47.nn
67.202.37.nnn
75.101.155.nnn
75.101.201.nn
79.125.50.nn
79.125.60.nn
174.129.110.nnn
174.129.140.nnn
174.129.156.nnn
174.129.145.nnn
174.129.210.nnn

enigma1




msg:3950929
 11:46 am on Jul 12, 2009 (gmt 0)

Pfui, If you check the list on that site I posted, you will see there quite a few domains that belong to amazonaws (as well as on various hosts and isps).

So basically someone runs the tor on his system or server and provides a portal to others. Now your server and my server all they see is the ip of the portal/proxy with no indication of anything else as these are transparent.

I just caught one doing it because it used the standard http ports, so when I scanned port 80 it did respond. When I searched some info about the particular ip I found that site with the tor list. And among them lists serveral amazonaws ips.

Pfui




msg:3963318
 7:35 pm on Jul 31, 2009 (gmt 0)

And the bots go on and on and on and on. Multiple-file request sessions x2 days on backwater sites. Per usual, undeterred by 403s, ditto 404s, even 301s (to 127.0.0.1:)

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3

07/30 04:18:09 /
07/30 04:18:28 /
07/30 04:19:03 /m/
07/30 04:19:05 /mobile/
07/30 04:19:06 /mobi/
07/30 04:19:06 /iphone/
07/30 04:19:09 /pda/
07/30 04:19:25 /m/
07/30 04:19:28 /mobile/
07/30 04:19:32 /mobi/
07/30 04:19:33 /iphone/
07/30 04:19:33 /pda/

[edited by: Pfui at 7:46 pm (utc) on July 31, 2009]

Pfui




msg:3963322
 7:46 pm on Jul 31, 2009 (gmt 0)

Another example of the kind of activity that ticks me off no matter who or what is doing it. I don't mind reddit per se. I DO mind no robots.txt then 20 requests to the exact same file. All 403'd to no avail, per usual w/ amazonaws.

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (compatible; redditbot/1.0; +http://www.reddit.com/feedback)

07/27 09:59:07
07/27 09:59:09
07/27 10:00:11
07/27 10:00:12
07/27 10:01:06
07/27 10:01:08
07/27 10:02:08
07/27 10:02:09
07/27 10:03:08
07/27 10:03:09
07/27 10:04:07
07/27 10:04:08
07/27 10:05:08
07/27 10:05:09
07/27 10:06:11
07/27 10:06:12
07/27 10:07:13
07/27 10:07:14
07/27 10:08:08
07/27 10:08:10

Pfui




msg:3963335
 8:06 pm on Jul 31, 2009 (gmt 0)

And one more, sadly inevitable what with so many AWS servers in play...

Here's a zombied [en.wikipedia.org] amazonaws.com machine that was part of a small spam-botnet with Chinese fellow travelers:

ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
07/31 09:50:17

121.28.7.nnn
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
07/31 09:50:20

210.52.58.nn
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
07/31 09:50:25

(Botnets adore that UA, so much so that I 403 it from the get-go.)

Pfui




msg:3965053
 12:58 am on Aug 4, 2009 (gmt 0)

ec2-[yada-yada]-159.compute-1.amazonaws.com
OMGCrawler 1.0

robots.txt? YES

GaryK




msg:3965839
 3:21 am on Aug 5, 2009 (gmt 0)

OMGCrawler visited me on the 3rd too. I kicked it out cause it's from AWS.

Pfui




msg:3970021
 6:56 pm on Aug 11, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
GingerCrawler/1.0 (Language Assistant for Dyslexics; www.gingersoftware.com/crawler_agent.htm; support at ginger software dot com)

robots.txt? YES

See also the GingerCrawler thread: GingerCrawler/1.0 [webmasterworld.com]

Pfui




msg:3973197
 5:27 am on Aug 17, 2009 (gmt 0)

ec2-[yada-yada]-98.compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727;)

robots.txt? NO

Note the misconfigured UA 'ending.'

Pfui




msg:3979935
 9:23 am on Aug 28, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
Acquia Crawler

robots.txt? Yes BUT... Three minutes after home page grab.

This just in (from -0700)... 403s to all files but robots.txt do not dissuade this new pest hailing from multiple AWS hosts:

08/28 00:40:22 /
08/28 00:43:43 /robots.txt
08/28 01:30:59 /
08/28 01:34:06 /robots.txt
08/28 01:51:42 /

Pfui




msg:3989522
 3:57 am on Sep 15, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
LWP::Simple/5.808

robots.txt? NO

Twitter-related.

[edited by: Pfui at 4:03 am (utc) on Sep. 15, 2009]

Pfui




msg:3989524
 3:59 am on Sep 15, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
bitlybot

robots.txt? NO

Twitter-related.

Pfui




msg:3989526
 4:00 am on Sep 15, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
PycURL/7.18.2

robots.txt? NO

Twitter-related.

Pfui




msg:3992964
 4:48 pm on Sep 21, 2009 (gmt 0)

ec2-[yada-yada].compute-1.amazonaws.com
LargeSmall Crawler (LargeSmall; [onespot.com;...] info@onespot.com)

robots.txt? Yes

Pfui




msg:3995335
 11:59 pm on Sep 24, 2009 (gmt 0)

LargeSmall [webmasterworld.com] is a pest but this kind of activity is just asinine --

ec2-174-129-236-193.compute-1.amazonaws.com
larbin_2.6.3 (larbin2.6.3@unspecified.mail)

09/24 13:56:58 /robots.txt
09/24 14:00:55 /robots.txt
09/24 14:03:55 /robots.txt
09/24 14:09:55 /robots.txt
09/24 14:13:22 /robots.txt
09/24 14:23:35 /robots.txt
09/24 14:50:42 /robots.txt
09/24 15:01:36 /robots.txt
09/24 15:08:40 /robots.txt
09/24 15:12:41 /robots.txt

O, if only I had a nickel for every useless, log-filling hit from amazonaws.com!

Pfui




msg:3996567
 3:01 pm on Sep 27, 2009 (gmt 0)

This time bitlybot requested robots.txt -- and just as promptly ignored it.

ec2-174-129-227-79.compute-1.amazonaws.com
bitlybot
09/27 04:49:16/robots.txt
09/27 04:49:17/dir/filename.html

Pfui




msg:3999276
 2:14 pm on Oct 1, 2009 (gmt 0)

ec2-75-101-138-11.compute-1.amazonaws.com
mefashpesh (pishpush.com)

robots.txt? NO

Pfui




msg:3999367
 4:44 pm on Oct 1, 2009 (gmt 0)

ec2-174-129-158-130.compute-1.amazonaws.com
taptubot *** please read [taptu.com...] ***

robots.txt? Yes

Pfui




msg:4000904
 4:49 pm on Oct 4, 2009 (gmt 0)

Firefox/1.6a1? Oh, please.

ec2-67-202-51-187.compute-1.amazonaws.com
Mozilla/5.0 (Windows; U; Windows NT 5.2 x64; en-US; rv:1.9a1) Gecko/20060214 Firefox/1.6a1

robots.txt? No

keyplyr




msg:4002000
 9:23 am on Oct 6, 2009 (gmt 0)

UA: MetaURI API +metauri.com
rDNS: ec2-75-101-232-27.compute-1.amazonaws.com. [Verified]
robots.txt: No

Pfui




msg:4011990
 2:41 am on Oct 23, 2009 (gmt 0)

ec2-75-101-221-99.compute-1.amazonaws.com
ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)

robots.txt? Yes BUT -- ignored it.

Last Feb. (up-thread; mssg.#: 3848081), the preceding UA was A-OK w/ robots.txt. No longer, at least not when run by amazonaws.com.

Still fully compliant when run from archive.org using this one:

ia310738.us.archive.org
ia_archiver-web.archive.org

dstiles




msg:4012428
 9:28 pm on Oct 23, 2009 (gmt 0)

Why do you think this is genuine ia_archiver? I don't accept the UA anyway but could it be fake?

Pfui




msg:4012434
 10:07 pm on Oct 23, 2009 (gmt 0)

I just report 'em as I see 'em emerge from amazonaws.com's cloud cover. I don't know if they're fake or not.

Given the current "google.com -- spoof? spider? botnet zombie? employee? [webmasterworld.com]" mystery sightings, I guess everything could be fake.

GaryK




msg:4013667
 5:34 pm on Oct 26, 2009 (gmt 0)

[api.samepoint.com;...] admin@samepoint.com
174.129.119.nnn
ec2-174-129-119-nnn.compute-1.amazonaws.com
-----
Address: Amazon Web Services, Elastic Compute Cloud, EC2
NetRange: 174.129.0.0 - 174.129.255.255
-----
ROBOTS.TXT? No
-----

Took the default root page and one xml file then left.

GaryK




msg:4013672
 5:37 pm on Oct 26, 2009 (gmt 0)

PostRank/2.0 (postrank.com)
174.129.141.nnn
ec2-174-129-141-nnn.compute-1.amazonaws.com
-----
Address: Amazon Web Services, Elastic Compute Cloud, EC2
NetRange: 174.129.0.0 - 174.129.255.255
-----
ROBOTS.TXT? No
-----

Did one HEAD request and left.

blend27




msg:4014003
 4:05 am on Oct 27, 2009 (gmt 0)

I have per page set to the Max allowed.

A Proposed WebmasterWorld Feature, for this thread particulary, if I click on #3 from the main nav menu from the main treads link list in this part of universe, take me to the last post on page 3 minus 1.

amazonaws is tracked and 403d as always here.

Pfui




msg:4020607
 7:25 pm on Nov 6, 2009 (gmt 0)

Another botlike smackdown from amazonaws. Atypical pattern for most bots -- single files hit twice -- but clearly going off a hit list because not all files in /dir were hit (ditto any of thousands of files on the site). Most of the hit pages had been Twitter mentions/tweeted, but not all.

ec2-174-129-193-62.compute-1.amazonaws.com
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.5) Gecko/2008120121 Firefox/3.0.5

robots.txt? NO

18:45:31 /dir/file07.html
18:45:32 /dir/file07.html
18:45:33 /dir/file01.html
18:45:34 /dir/file01.html
18:45:36 /dir/file06.html
18:45:36 /dir/file06.html
18:45:38 /dir/file04.html
18:45:38 /dir/file04.html
18:45:39 /dir/file02.html
18:45:40 /dir/file02.html
18:45:41 /dir/file05.html
18:45:42 /dir/file05.html
18:45:43 /dir/file03.html
18:45:44 /dir/file03.html
18:45:45 /dir/file09.html
18:45:46 /dir/file09.html
18:45:48 /dir/file08.html
18:45:48 /dir/file08.html
18:45:50 /dir/file10.html
18:45:51 /dir/file10.html

FWIW: Alleged UA is old; Mac FF is currently 3.5.5.

Pfui




msg:4022855
 2:35 am on Nov 11, 2009 (gmt 0)

(Emphasis mine...)

ec2-174-129-58-178.compute-1.amazonaws.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10

robots.txt? NO
Fake ref? YES: http://www.google.com/search?q=sitename.com/

Aside:

UAs with that User-Agent: intro swarmed out of nowhere about a year ago, as I recall. Used to see multiple scores a day; now maybe once or twice, tops. (Never did figure out who/what miscoded the string and made its hits so easy to send packing.) UAs ran the gamut. Here's a very partial listing:

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; YPC 3.2.0; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0)
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; InfoPath.2)
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)

Pfui




msg:4026807
 7:31 pm on Nov 17, 2009 (gmt 0)

Many, many AWS-based UAs still hitting home and specific pages. robots.txt? NEVER.

DAILY (multiple times; always HEAD requests):

ec2-75-101-197-164.compute-1.amazonaws.com
PycURL/7.18.2

ec2-174-129-141-109.compute-1.amazonaws.com
PostRank/2.0 (postrank.com)

WEEKLY (approx.; always HEAD requests):

ec2-174-129-91-231.compute-1.amazonaws.com
Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)

(Two days earlier, Netcraft sent its minion...)

lager.netcraft.com
Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)

This 278 message thread spans 10 pages: < < 278 ( 1 2 [3] 4 5 6 7 8 9 10 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved