Alexa ignoring noarchive

Forum Moderators: bakedjake

Message Too Old, No Replies

Alexa ignoring noarchive

caching pages despite meta directive

amznVibe

2:03 pm on Jul 29, 2008 (gmt 0)

I just noticed Alexa is caching nearly 5000 pages from one of my sites.
Even in the cached copy you can see the meta for noarchive!

They also still identify themselves as the waybackmachine's ia_archiver.
The pages do not appear on waybackmachine however.

Do they have a way to request cached copy removal or do they just not care?

Atomic

2:24 am on Nov 1, 2008 (gmt 0)

I just noticed this myself! What's going on here?!

amznVibe

7:25 am on Nov 1, 2008 (gmt 0)

Apparently they have no motivation to follow basic search engine ethics.

The top 4 (or 5) engines follow it but maybe they feel special.

piatkow

8:47 am on Nov 1, 2008 (gmt 0)

Luckily I have never seen a hit on my site from Alexa. As my site includes gig listings for small venues that have a bad habit of closing down at short notice I don't want complaints from people who looked at a month old archive.

tangor

9:00 am on Nov 1, 2008 (gmt 0)

denied ia_archiver for years. Will continue to do so.

DilipShaw

9:01 am on Nov 1, 2008 (gmt 0)

Alexa is not considered a main-stream search engine, so why bother! Secondly how may of us search Alexa? Probably none!

frakilk

10:04 am on Nov 1, 2008 (gmt 0)

Alexa has a search engine?!

incrediBILL

10:17 am on Nov 1, 2008 (gmt 0)

Alexa has never claimed to implement the NOARCHIVE directive, they honor robots.txt and always have.

Alexa does not wish to crawl anything you want to remain private. All you have to do is tell us. How? By using a simple robots.txt file. Robots.txt files are the most widely used standard on the Web for telling crawlers where they should and should not go on your site. All major crawlers respect these robots.txt files, including those from Google, MSN, Yahoo!, etc.. There is extensive information about how to create a robots.txt file on our site at Webmasters Help page.

If you don't have access to robots.txt, such as a Blogger site, just write to them:

Here are directions on how to automatically exclude your site. If you cannot place the robots.txt file, opt not to, or have further questions, email us at info at archive dot org.

I wrote to them once for just such a reason, they took the data out, not a problem.

amznVibe

3:52 pm on Nov 1, 2008 (gmt 0)

But robots.txt can only prevent indexing of the entire site or certain directories.

NOARCHIVE is an entirely different request.
It simply means don't make a duplicate of our content publicly available.

incrediBILL

4:03 pm on Nov 1, 2008 (gmt 0)

Well in this case after doing some research they appear to be showing CACHE pages for sites actually blocked in the Internet Archive.

I don't really give two toots if they use or don't use NOARCHIVE, but if you've had your site pulled from the archive it shouldn't show up as cache on Alexa.

That's not cool whatsoever.

amznVibe

4:25 pm on Nov 1, 2008 (gmt 0)

Oh and check this out, for half a decade until August 2006 their webmaster page said specifically that they check for and obey NOARCHIVE:

[web.archive.org...]

After retrieving any HTML file, we check for the presence of the NOINDEX, NOARCHIVE, and NOFOLLOW tags in the "<head>" element of the document. If we find a NOINDEX or NOARCHIVE tag, we throw away the copy. If there is a NOFOLLOW tag, the robot will not follow any links found on that page. This allows users to control access to their own data, without needing their site administrators to update "robots.txt".

Then in October 2006 they took that out of their policy:
[web.archive.org...]

(and the irony that their own cache shows this)

incrediBILL

4:30 pm on Nov 1, 2008 (gmt 0)

See, that's why you should block the internet archive and always use NOARCHIVE, a prime example.

frontpage

5:03 pm on Nov 1, 2008 (gmt 0)

I have known this for a while and I do not trust Archive.org

1) I have previously contacted archive.org via email to request that they stop spidering my websites and received acknowledgment that they would stop.

2) If you go to Archive.org's waybackmachine you will get a 'Blocked Site Error' if you try to retrieve details on my websites.

3) I use a .htaccess directive to add a noarchive to all my webpages.


<Files ~ >
header append X-robots-tag "noarchive"
</Files>

4) In addition, robots.txt specifically bars the ia_archiver spider.

5) I also block ia_archiver via mod-security

SecRule HTTP_User-Agent "ia_archiver" "deny,log,status:406"

Conclusion: Despite specifically asking archive.org to stop spidering my sites and using noarchive and robots.txt to reinforce my request, they still attempt to deep crawl my websites. My mod-security logs are replete with daily instances of attempted spidering by ia_archiver.

Yes, they will not show the results but archive.org still attempts to download my data.

incrediBILL

8:06 am on Nov 3, 2008 (gmt 0)

FWIW, there's a good information site about NOARCHIVE and Archive.org that's worth a read called The NoArchive Initiative [noarchive.net] with many aspects covered.

amznVibe

9:54 am on Nov 3, 2008 (gmt 0)

Well that site is apparently not even a year old, but it does confirm
all the major bots obey NOARCHIVE independently of NOINDEX
[noarchive.net...]

I feel if I am going to let a bot eat my bandwidth I should at least be able to request it doesn't make a duplicate of my data publicly available. Using the data to produce stats and benchmarks is fine, just don't violate copyrights.

tangor

10:06 am on Nov 3, 2008 (gmt 0)

I've been trying to kill this for years, still unsuccessful. I think something else is going on since this bot asks for pages created LONG AFTER I SNUFFED 'EM.

Can't win, it seems like.