Forum Moderators: open
- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:
<html>
<head>
</head>
<body>
</body>
</html>
----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:
NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES
Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO
feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO
Twitturly / v0.5
robots.txt? NO
YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO
YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes
Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO
PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES
EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES
Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO
TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO
Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO
Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES
yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO
Mozilla/5.0
robots.txt? NO
Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES
TinEye
robots.txt? NO
Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES
nnn/ttt (n)
robots.txt? YES
AideRSS/1.0 (aiderss.com)
robots.txt? NO
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO
----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO
WebClient
robots.txt? YES
----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:
Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO
Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES
Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES
Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO
zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES
zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES
Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO
-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.
You may not operate network services such as:
Open proxies.
(etc.)
Two of the following IPs, the 79s, map to --
ec2-[yada-yada].eu-west-1.compute.amazonaws.com
-- and the remainder to this thread's (in)famous:
ec2-[yada-yada].compute-1.amazonaws.com
67.202.11.nnn
67.202.30.nn
67.202.44.nnn
67.202.47.nn
67.202.37.nnn
75.101.155.nnn
75.101.201.nn
79.125.50.nn
79.125.60.nn
174.129.110.nnn
174.129.140.nnn
174.129.156.nnn
174.129.145.nnn
174.129.210.nnn
So basically someone runs the tor on his system or server and provides a portal to others. Now your server and my server all they see is the ip of the portal/proxy with no indication of anything else as these are transparent.
I just caught one doing it because it used the standard http ports, so when I scanned port 80 it did respond. When I searched some info about the particular ip I found that site with the tor list. And among them lists serveral amazonaws ips.
ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3
07/30 04:18:09 /
07/30 04:18:28 /
07/30 04:19:03 /m/
07/30 04:19:05 /mobile/
07/30 04:19:06 /mobi/
07/30 04:19:06 /iphone/
07/30 04:19:09 /pda/
07/30 04:19:25 /m/
07/30 04:19:28 /mobile/
07/30 04:19:32 /mobi/
07/30 04:19:33 /iphone/
07/30 04:19:33 /pda/
[edited by: Pfui at 7:46 pm (utc) on July 31, 2009]
ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (compatible; redditbot/1.0; +http://www.reddit.com/feedback)
07/27 09:59:07
07/27 09:59:09
07/27 10:00:11
07/27 10:00:12
07/27 10:01:06
07/27 10:01:08
07/27 10:02:08
07/27 10:02:09
07/27 10:03:08
07/27 10:03:09
07/27 10:04:07
07/27 10:04:08
07/27 10:05:08
07/27 10:05:09
07/27 10:06:11
07/27 10:06:12
07/27 10:07:13
07/27 10:07:14
07/27 10:08:08
07/27 10:08:10
Here's a zombied [en.wikipedia.org] amazonaws.com machine that was part of a small spam-botnet with Chinese fellow travelers:
ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
07/31 09:50:17
121.28.7.nnn
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
07/31 09:50:20
210.52.58.nn
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
07/31 09:50:25
(Botnets adore that UA, so much so that I 403 it from the get-go.)
robots.txt? YES
See also the GingerCrawler thread: GingerCrawler/1.0 [webmasterworld.com]
robots.txt? Yes BUT... Three minutes after home page grab.
This just in (from -0700)... 403s to all files but robots.txt do not dissuade this new pest hailing from multiple AWS hosts:
08/28 00:40:22 /
08/28 00:43:43 /robots.txt
08/28 01:30:59 /
08/28 01:34:06 /robots.txt
08/28 01:51:42 /
robots.txt? Yes
ec2-174-129-236-193.compute-1.amazonaws.com
larbin_2.6.3 (larbin2.6.3@unspecified.mail)
09/24 13:56:58 /robots.txt
09/24 14:00:55 /robots.txt
09/24 14:03:55 /robots.txt
09/24 14:09:55 /robots.txt
09/24 14:13:22 /robots.txt
09/24 14:23:35 /robots.txt
09/24 14:50:42 /robots.txt
09/24 15:01:36 /robots.txt
09/24 15:08:40 /robots.txt
09/24 15:12:41 /robots.txt
O, if only I had a nickel for every useless, log-filling hit from amazonaws.com!
robots.txt? Yes
robots.txt? Yes BUT -- ignored it.
Last Feb. (up-thread; mssg.#: 3848081), the preceding UA was A-OK w/ robots.txt. No longer, at least not when run by amazonaws.com.
Still fully compliant when run from archive.org using this one:
ia310738.us.archive.org
ia_archiver-web.archive.org
Given the current "google.com -- spoof? spider? botnet zombie? employee? [webmasterworld.com]" mystery sightings, I guess everything could be fake.
Took the default root page and one xml file then left.
ec2-174-129-193-62.compute-1.amazonaws.com
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.5) Gecko/2008120121 Firefox/3.0.5
robots.txt? NO
18:45:31 /dir/file07.html
18:45:32 /dir/file07.html
18:45:33 /dir/file01.html
18:45:34 /dir/file01.html
18:45:36 /dir/file06.html
18:45:36 /dir/file06.html
18:45:38 /dir/file04.html
18:45:38 /dir/file04.html
18:45:39 /dir/file02.html
18:45:40 /dir/file02.html
18:45:41 /dir/file05.html
18:45:42 /dir/file05.html
18:45:43 /dir/file03.html
18:45:44 /dir/file03.html
18:45:45 /dir/file09.html
18:45:46 /dir/file09.html
18:45:48 /dir/file08.html
18:45:48 /dir/file08.html
18:45:50 /dir/file10.html
18:45:51 /dir/file10.html
FWIW: Alleged UA is old; Mac FF is currently 3.5.5.
ec2-174-129-58-178.compute-1.amazonaws.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10
robots.txt? NO
Fake ref? YES: http://www.google.com/search?q=sitename.com/
Aside:
UAs with that User-Agent: intro swarmed out of nowhere about a year ago, as I recall. Used to see multiple scores a day; now maybe once or twice, tops. (Never did figure out who/what miscoded the string and made its hits so easy to send packing.) UAs ran the gamut. Here's a very partial listing:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; YPC 3.2.0; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0)
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; InfoPath.2)
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)