Forum Moderators: phranque
So I go to Alexa.com for my once-monthly check on how -- and what -- they're doing with my stuff, all of which is coded NOARCHIVE and has been for months and months. And a search for mysite.com yields links AND cached pages. Here's their header info for one of my pages:
"This is a version of http://www.example.com/dir/filename.html as it looked when our crawler examined the site on 6/4/2006. The page you see below is the version in our index that was used to rank this page in the results to your recent query. This is not necessarily the most recent version of the page - to see the most recent version of this page..."
That's the same scrape date and language for the other cached pages, too.
So what's the problem? Ah. Well, here's the code in every single one of the original AND cached-copy pages:
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
And here's the default code in robots.txt:
User-agent: *
Disallow: /
And here's the specific robots.txt entry vis-a-vis Alexa / ia_archive / ia_archiver-web.archive.org:
# HOST: archive.org
User-agent: ia_archiver
User-agent: ia_archiver-web.archive.org
# heritrix:
User-agent: archive.org_bot
Disallow: /
And one of the Cached pages is a landing page in an every-single-bot-known-to-man robots.txt-Disallowed AND mod_rewrite-blocked directory.
DogGONEit.
I even Disallow robots.txt in robots.txt for every specified entity. But right there in Alexa's Website Directory section, there are not one but two links to, you guessed it:
http://www.example.com/robots.txt
http://example.com/robots.txt
Apropos of what's coming up, below, the links are Alexa's:
[alexa.com...]
-----
2.) When ALEXA means MSN means msnscache.com
Curiously, all of the "Cached"-marked links route through the easily tpyo'd "msnscache" --
[cc.msnscache.com...]
-- and carry this reads-like-a-bad-translation disclaimer:
"MSN is not affiliated with the content nor parties responsible for the page displayed below."
Caches? Dammit! Alexa? MSN? Huh?
But wait! There's more.
-----
3.) When "ia_archiver" means Mozilla. Firefox. Or no UA at all.
Sleuthing a bit... "ia_archiver" didn't visit on June 4th, the stated date of the scrape. msnbot, but not msnbot-media, DID hit on the 4th, and it retrieved robots.txt 13 times, in addition to one of the pages Alexa now shows as Cached. Beats me who/what snagged the other one -- neither agent, as properly identified, hit it on the 4th.
Btw, Alexa uses its own identified User-agents in connection with its Internet Archive / Wayback Machine -- "ia_archiver" and "ia_archiver.web.archive.org" -- AND regular browsers, too, and the latter do NOT necessarily check for robots.txt. (Amazon and its A9 use still more.) So if you're checking your logs, look for:
207.241.236.
209.237.238.crawling025.archive.org
ocr.book1.archive.org
vm01-staging.alexa.com"Mozilla/5.0 (compatible;archive.org_bot/1.7.1; collectionId=316; Archive-It; +http://www.archive-it.org)"
Almost done. Really.
-----
4.) When will they ever learn?
In past months, I and many, many others have sent e-mails to Alexa and filed Web-form complaints about, for example, their showing private WHOIS info, and about their showing porn sites as related sites or worse, as co-owned sites:
Alexa: Now showing other sites owned
[webmasterworld.com...]
Amazon-owned Alexa breaks rules. Again.
Now hitting bare and badly.
[webmasterworld.com...]
Now this. Caches. Code violations. Cloaked User-agents. More untrustworthiness. AND now MSN's in the mix, too. So many bad and misbehaving bots, big and little, so little time.
So-o-o-o- if anyone's still awake...
Are you finding your own Do-Not-Crawl and/or Do-Not-Cache pages in Alexa's SERPs? Are you getting fed up with their screw-ups?
I would change that to this (with a blank line before the next User-agent line each time):
# HOST: archive.org
User-agent: ia_archiver
Disallow: /
User-agent: ia_archiver-web.archive.org
Disallow: /
# heritrix:
User-agent: archive.org_bot
Disallow: /
Without the explicit instruction for each user-agent, directly after the line for that user-agent alone, you are possibly allowing the site to be spidered.
Without the blank line before the next User-agent line, that User-agent will not spot the User-agent line that applies to it. I have seen this bug with Google on several sites.
With the whole site disallowed from spidering, they should never get to see the robots tags on the pages themselves.
Either use robots.txt or the meta robots tag, but not both. They often conflict.
"If we find a NOINDEX or NOARCHIVE tag, we throw away the copy. If there is a NOFOLLOW tag, the robot will not follow any links found on that page. This allows users to control access to their own data, without needing their site administrators to update "robots.txt"."
It's my experience that the majors heed their 'own' robots.txt entry before any generic one. Thus Googlebot and msnbot follow my instructions and are allowed to go various places, and they do. Alexa is not, and it doesn't.
However, the problematic part of this is Alexa using MSN SERPs -- because it's in those SERPs that Do-Not-Crawl / Do-Not-Cache violations are readily evident, and despite this, from MSN's "Control which pages of your website are indexed" info linked above:
"You can prevent MSNBot and other standards-compliant crawlers from crawling a server or collecting information and links from specific pages on your website by using a robots.txt file and/or meta tags."
Long story short --
A quick check via [search.msn.com...] shows that indeed, MSN includes Cached pages in its SERPs -- the ones used by Alexa -- the end result being that both SEs are in effect violating NOINDEX tags and robots.txt-Disallowed instructions, and their own stated codes of conduct.
So, again, are you finding your own Do-Not-Crawl and/or Do-Not-Cache pages in Alexa's/MSN's SERPs? In my case, the violative SERPs are not site-wide, but they shouldn't be 'visible' at all.
ia_archiver etc. is for the Internet Archive (archive.org) only and not for the Alexa search results, which come from MSN Search. What rules do you have for MSNBot? Secondly, none of the sites I have using
noarchive have a cache link in Alexa results. As you say that there are cache links in MSN Search, the problem is with them not Alexa which is merely reusing the data. The big question is why MSN Search cannot or did not respect the
noarchive, which they usually have no problem with. Do your pages validate? Is there anything before the robots meta tag which could offer an explanation why MSNbot is ignoring it? This is an MSN Search problem, not an Alexa problem.
Again, Googlebot (et al) and msnbot/msnbot-media follow my instructions. They're allowed to go various places, and they do. Alexa (via any/all of its bots) is not, and it doesn't. (Rarely G's and MSN's bots try to go where they're not supposed to -- presumably following inbound links -- but I block them with sub-directory rewrites.)
MSN is including Cached pages in its SERPs -- the SERPs used by Alexa. I first found them while checking Alexa. Subsequently I checked MSN. Same thing. Same problem.
(In April, Alexa was using Google's SERPs. I don't know when they changed to MSN's.)
g1smd: My robots.txt file is protocol-proper and nearly verbatim with what each major SE advises re their own, and increasingly specific, and varied, requirements (see the links I provided). Oh, and it also works. Clearly, you think not. C'est la vie.
noarchive robots meta tag not by robots.txt anyway. So MSNbot is (unusually) ignoring the
noarchive robots meta tag. Does the page validate? Is there a possible technical reason why MSNbot either passes over or does not succeed in parsing the robots meta tag? This can include improperly-nested elements in the HTML prior to the meta element. What does the top of the page (from the first line of the source code to the
body element) look like?
1.) Over 250,000 pages are marked in a similar fashion:
<HTML>
<HEAD>
<TITLE>Page Title</TITLE>
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
<META HTTP-EQUIV="Reply-to" CONTENT="webmaster@example.com">
<META NAME="Description" CONTENT="Descriptive sentences">
<META NAME="Keywords" CONTENT="50 unique, site-specific words">
<META NAME="Copyright" CONTENT="Legally sufficient statement">
<META NAME="Author" Content="Name">
</HEAD>
2.) Approx. 200 pages are allowed to be crawled at all, and only by a select few majors. MSN's results total 193 (www.example.com) and 760 (example.com). Google's are 185 and 246; Yahoo's are 126 and 123.
3.) Fewer than 20 pages contain pre-HEAD #exec commands and/or intra-HEAD scripts. Depending on how the page was generated, some have an old <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> line.
For example, two pages with Caches in MSN's SERPs are straight .html, and one has a bit of JavaScript also appearing on five or so other pages. Both have the old DOCTYPE. Both have been in existence and infrequently tweaked for at least five years. Neither are rewritten-from nor redirected-to any other pages.
4.) One of the pages is allowed (not Disallowed) in robots.txt. The other is explicitly Disallowed by directory:
User-agent: msnbot
(snipped)
Disallow: /dirX
...BUT...
5.) Further clicking through MSN's SERPs shows the scope of the problem is significantly larger, and more erratic, than first thought.
On "Page 13..."
9 entries total (one with a full description; see P.S.)
4 Cached
3 Disallowed
On "Page 17 ..."
10 entries total (none with a full description)
0 Cached
7 Disallowed
I can't go any further than "Page 20..."
3 entries total (none with a full description)
0 Cached
3 Disallowed
6.) It's almost as if MSN spidered a lot of pages/posts over time, and never removes currently Disallowed files Because msnbot/msnbot-media simply doesn't crawl the Disallowed pages, and hasn't for two years or more.
HTH
.
P.S.
FWIW, some of MSN's results for all of the SERPs either show the META Title plus details and the page URL, OR the site URL as the 'title' and the page URL. For example:
Page Title
DMOZ description OR META "Description" snip + first line(s) = ~28 words
http://www.example.com
www.example.com
http://www.example.com/dir/filename.html
Go figure.