Amazon-owned Alexa shows MSN-crawled SERPs WITH code-forbidden Caches

1.) When NO means YES

So I go to Alexa.com for my once-monthly check on how -- and what -- they're doing with my stuff, all of which is coded NOARCHIVE and has been for months and months. And a search for mysite.com yields links AND cached pages. Here's their header info for one of my pages:

"This is a version of http://www.example.com/dir/filename.html as it looked when our crawler examined the site on 6/4/2006. The page you see below is the version in our index that was used to rank this page in the results to your recent query. This is not necessarily the most recent version of the page - to see the most recent version of this page..."

That's the same scrape date and language for the other cached pages, too.

So what's the problem? Ah. Well, here's the code in every single one of the original AND cached-copy pages:

<META NAME="ROBOTS" CONTENT="NOARCHIVE">

And here's the default code in robots.txt:

User-agent: * 
Disallow: /

And here's the specific robots.txt entry vis-a-vis Alexa / ia_archive / ia_archiver-web.archive.org:

# HOST: archive.org  
User-agent: ia_archiver 
User-agent: ia_archiver-web.archive.org 
# heritrix: 
User-agent: archive.org_bot 
Disallow: /

And one of the Cached pages is a landing page in an every-single-bot-known-to-man robots.txt-Disallowed AND mod_rewrite-blocked directory.

DogGONEit.

I even Disallow robots.txt in robots.txt for every specified entity. But right there in Alexa's Website Directory section, there are not one but two links to, you guessed it:

http://www.example.com/robots.txt
http://example.com/robots.txt

Apropos of what's coming up, below, the links are Alexa's:

[alexa.com...]

-----
2.) When ALEXA means MSN means msnscache.com

Curiously, all of the "Cached"-marked links route through the easily tpyo'd "msnscache" --

[cc.msnscache.com...]

-- and carry this reads-like-a-bad-translation disclaimer:

"MSN is not affiliated with the content nor parties responsible for the page displayed below."

Caches? Dammit! Alexa? MSN? Huh?

But wait! There's more.

-----
3.) When "ia_archiver" means Mozilla. Firefox. Or no UA at all.

Sleuthing a bit... "ia_archiver" didn't visit on June 4th, the stated date of the scrape. msnbot, but not msnbot-media, DID hit on the 4th, and it retrieved robots.txt 13 times, in addition to one of the pages Alexa now shows as Cached. Beats me who/what snagged the other one -- neither agent, as properly identified, hit it on the 4th.

Btw, Alexa uses its own identified User-agents in connection with its Internet Archive / Wayback Machine -- "ia_archiver" and "ia_archiver.web.archive.org" -- AND regular browsers, too, and the latter do NOT necessarily check for robots.txt. (Amazon and its A9 use still more.) So if you're checking your logs, look for:

207.241.236. 
209.237.238. crawling025.archive.org 
ocr.book1.archive.org 
vm01-staging.alexa.com 
"Mozilla/5.0 (compatible;archive.org_bot/1.7.1; collectionId=316; Archive-It; +http://www.archive-it.org)"

Almost done. Really.

-----
4.) When will they ever learn?

In past months, I and many, many others have sent e-mails to Alexa and filed Web-form complaints about, for example, their showing private WHOIS info, and about their showing porn sites as related sites or worse, as co-owned sites:

Alexa: Now showing other sites owned
[webmasterworld.com...]

Amazon-owned Alexa breaks rules. Again.
Now hitting bare and badly.
[webmasterworld.com...]

Now this. Caches. Code violations. Cloaked User-agents. More untrustworthiness. AND now MSN's in the mix, too. So many bad and misbehaving bots, big and little, so little time.

So-o-o-o- if anyone's still awake...

Are you finding your own Do-Not-Crawl and/or Do-Not-Cache pages in Alexa's SERPs? Are you getting fed up with their screw-ups?

Amazon-owned Alexa shows MSN-crawled SERPs WITH code-forbidden Caches

NOARCHIVE and robots.txt instructions meaningless. Three strikes?

Pfui

g1smd

Pfui

kaled

Pfui

encyclo

g1smd

Pfui

encyclo

Pfui

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week