Forum Moderators: bakedjake
They also still identify themselves as the waybackmachine's ia_archiver.
The pages do not appear on waybackmachine however.
Do they have a way to request cached copy removal or do they just not care?
Alexa does not wish to crawl anything you want to remain private. All you have to do is tell us. How? By using a simple robots.txt file. Robots.txt files are the most widely used standard on the Web for telling crawlers where they should and should not go on your site. All major crawlers respect these robots.txt files, including those from Google, MSN, Yahoo!, etc.. There is extensive information about how to create a robots.txt file on our site at Webmasters Help page.
If you don't have access to robots.txt, such as a Blogger site, just write to them:
Here are directions on how to automatically exclude your site. If you cannot place the robots.txt file, opt not to, or have further questions, email us at info at archive dot org.
I wrote to them once for just such a reason, they took the data out, not a problem.
I don't really give two toots if they use or don't use NOARCHIVE, but if you've had your site pulled from the archive it shouldn't show up as cache on Alexa.
That's not cool whatsoever.
[web.archive.org...]
After retrieving any HTML file, we check for the presence of the NOINDEX, NOARCHIVE, and NOFOLLOW tags in the "<head>" element of the document. If we find a NOINDEX or NOARCHIVE tag, we throw away the copy. If there is a NOFOLLOW tag, the robot will not follow any links found on that page. This allows users to control access to their own data, without needing their site administrators to update "robots.txt".
Then in October 2006 they took that out of their policy:
[web.archive.org...]
(and the irony that their own cache shows this)
1) I have previously contacted archive.org via email to request that they stop spidering my websites and received acknowledgment that they would stop.
2) If you go to Archive.org's waybackmachine you will get a 'Blocked Site Error' if you try to retrieve details on my websites.
3) I use a .htaccess directive to add a noarchive to all my webpages.
<Files ~ >
header append X-robots-tag "noarchive"
</Files> 4) In addition, robots.txt specifically bars the ia_archiver spider.
5) I also block ia_archiver via mod-security
SecRule HTTP_User-Agent "ia_archiver" "deny,log,status:406" Conclusion: Despite specifically asking archive.org to stop spidering my sites and using noarchive and robots.txt to reinforce my request, they still attempt to deep crawl my websites. My mod-security logs are replete with daily instances of attempted spidering by ia_archiver.
Yes, they will not show the results but archive.org still attempts to download my data.
I feel if I am going to let a bot eat my bandwidth I should at least be able to request it doesn't make a duplicate of my data publicly available. Using the data to produce stats and benchmarks is fine, just don't violate copyrights.