Forum Moderators: open
- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:
<html>
<head>
</head>
<body>
</body>
</html>
----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:
NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES
Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO
feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO
Twitturly / v0.5
robots.txt? NO
YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO
YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes
Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO
PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES
EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES
Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO
TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO
Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO
Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES
yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO
Mozilla/5.0
robots.txt? NO
Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES
TinEye
robots.txt? NO
Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES
nnn/ttt (n)
robots.txt? YES
AideRSS/1.0 (aiderss.com)
robots.txt? NO
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO
----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO
WebClient
robots.txt? YES
----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:
Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO
Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES
Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES
Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO
zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES
zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES
Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO
-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.
The only concern is what happens when the "next big thing" uses their cloud computing facility and those of us blocking it because of all the bad bot noise totally miss out on being in the Google killer until it's too late?
-----
Jim:
"robots.txt? YES" <== Asked for it
"robots.txt? NO" <== Didn't ask
To my recollection, most that asked for robots.txt honored it. I didn't really pay attention because that's the only file I let unauthorized bots -- and amazonaws.com, the host -- access.
By the way, here's another one. If this were an eBay item, they'd get yanked for keyword-spamming the spider's brand!
Nutch/Nutch-1.0-dev (A Nutch-based crawler.; [lucene.apache.org...] nutch-agent AT lucene.apache.org)
robots.txt? YES
-----
Bill:
At this point, I don't know what I think about AWS, other than I'm amazed at (& irked by) all of its/their bots. (Plus their CloudFront info is so PC-centric that this Mac/Apache person's eyes just glaze over.) Why spider constantly, I wonder? Are they trying to rival Google? What are they doing with all of the data? Where are they going with all of their really geeky doodads?
Oh, to be a fly on the wall in Bezos's office, eh?
-----
All:
What are your thoughts on their Web Services? Are you intrigued? Have you tested anything and noticed more spidering? Their cookies systems are fantastically inbred -- notice anything new in that dept.?
What I do know is that between amazon.com and imdb.com and amazonfresh.com (fresh.amazon.com) and amazonaws.com (aws.amazon.com), they're all over the map. Literally. And their spiders are squished all over my server. Literally.
: )
[edited by: Pfui at 9:21 pm (utc) on Jan. 29, 2009]
generally, I'm using amazon-webservices (the alexa information thingy) and the ecommerce-webservice, though both of them not that heavy (my bill averages to $3 / month). I usually don't mind bot traffic (traffic is free and my servers can easily handle the load right now) if the bot behaves. If it doesn't, I don't care if it's an official bot by amazon or just one that's hosted on their platform, it get's blocked, most likely on iptables, so I don't have all those 403-errors in my logs.
Mozilla/5.0 (compatible; kmky-not-a-bot/0.2; [kilomonkey.com...] )
robots.txt? NO
Even the bot's home page is rude ('notabot.txt' timed out): "This Server contains no useful information. The Domain name is not for sale. Goodbye."
Well, 403 back atcha, not-notabot.
(Hmm... That domain's listed to Kilo Monkey LLC, and an Aaron Flin -- of Dogpile metasearch fame?)
Amazon.com Associates Central - Help
No relevant products are showing on my page. Why is this the case?
"You must allow our spider (Mozilla/5.0 (compatible; >> AMZNKAssocBot/4.0)) to crawl your website. The crawl is needed in order to identify the content of your website and provide matching products. If you do not allow our spider to crawl your website(s) we will display selected products from our product lines in the Omakase Links."
FWIW, I've found this name works A-OK in rewrites:
Mozilla.*AMZNKAssocBot
Oh, and coming up, Yet Another AWS Bot. (If nothing else, this thread provides a comprehensive list of their automatons, eh? :)
Hmmm,
Let me ask it, hold on, brb.... cool I am back! I asked and it is not saying anything, I even tried a "pretty please" thingy, still not talking to me, and I am the owner. I think I'll have to learn that new Telepathic computer language that enables humans to talk to text files, the one that everybody is talking about you know...
It's nice and shiny though ;)
UA: -
robots.txt? NO
Mozilla/5.0 (compatible; NetcraftSurveyAgent/1.0; +info@netcraft.com)
robots.txt? NO
Python-urllib/2.4
robots.txt? NO
ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)
robots.txt? YES
P.S./FWIW:
In addition to owning AmazonAWS.com, IMDb.com, A9.com, and who knows what else, Amazon.com owns Archive.org (WayBackMachine; Internet Archive) a.k.a. Alexa.com a.k.a. ia_archiver and ia_archiver-web.
Technically, there's nothing in robots.txt that prevents any bot from doing whatever the heck its runners program it to do.
But a blanket "Disallow: /" means Do Not Crawl Here. Go Away. Now. And that Disallow includes the root page because it's in the /rootdir. Even if the root page retrieval is basically simultaneous with that of robots.txt (as is often the case), there still should be no caching or referencing of the root page's data.
Yeah. And if wishes were horses... :)
Nokia6680/1.0 (4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html find@mtld.mobi)reads robots.txt, but doesn't care about the contents of robots.txt.
To my recollection, most that asked for robots.txt honored it.Really bad bots request for robots.txt in order to get into the dark web and to confuse webmasters ("robots.txt? YES").
Technically, there's nothing in robots.txt that prevents any bot from doing whatever the heck its runners program it to do.Requesting my robots.txt leads to a site-wide ban.
"Requesting my robots.txt leads to a site-wide ban."
I'm curious as to how you do that, and also why? There are some bots that actually honor it:)
That said, before I figured out robots.cgi, I was really reluctant to let any bot read a directory-detailed robots.txt, even a baited one. Now, the riffraff only see Disallow: /
FWIW, I allow all to access robots.txt but other than the Big 3, I block access to everything other than robots.txt, just in case. Additionally, the Big 3 are kept out of numerous directories via subdir .htaccess --
Whoa.
As I write this, I'm again whelmed by what a time-and-effort drain it is keeping the ramparts intact. An obsessively intriguing drain, but a drain nonetheless. And all of it unseen by site owners!
Anyway, getting back to AWS's bevy o' bots, Yet Another:
robots.txt? YES
UA: goroam/1.0-SNAPSHOT (goraom geo crawler; [goroam.net...] info@goroam.net)
Requested robots.txt - Yes
Obeyed it - Yes (I white-list bots, so all but a tiny few are banned)
Came from: 67.202.0.n
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.18) Gecko/20081112 Fedora/2.0.0.18-1.fc8 Firefox/2.0.0.18
could be a page thumbnailer such as that used by Ask.com search for their "preview" function in search results.
GaryK,
Alexa/Internet Archiver uses Elastic Cloud Compute services.
So you may want to allow those sub-ranges of ECS.
Jim