Welcome to WebmasterWorld Guest from 18.208.159.25

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt now deletes archive.org wayback machine

     
3:44 pm on Apr 25, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10450
votes: 1090


At some point after our correspondence, a robots.txt exclusion request specific to the Wayback Machine was placed on the live blog. That request was automatically recognized and processed by the Wayback Machine and the blog archives were excluded, unbeknownst to us (the process is fully automated). The robots.txt exclusion from the web archive remains automatically in effect due to the presence of the request on the live blog. Also, the blog URL which previously pointed to an msnbc.com page now points to a generic parked page.

[blog.archive.org...]

Interesting news from archive.org. If true, then robots.txt suddenly has real teeth. Note, this is the last paragraph of the article and the only part that webmasters are concerned with. The rest of the blog post deals with politics.
3:49 pm on Apr 25, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10450
votes: 1090


Tempted to test this to see if the "automated" part actually works and will delete past history with wayback machine. If this really works, you can bet there's a zillion folks (and spammers, too) who would like to be forgotten that dang quick!
6:18 pm on Apr 25, 2018 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Mar 25, 2018
posts:500
votes: 101


I thought it was already the case. When I learned about the way back machine, I found out my sites were archived. So to prevent further archiving, I added a robots.txt entry, and some times later, it happened the whole archives of my sites were gone.
5:04 pm on Apr 25, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member ken_b is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 5, 2001
posts:5893
votes: 120


This popped up today on the archive.org blog and I wonder if I'm misunderstanding the statement. It sounds to me like the exclusion is implemented retroactively. Or am I reading that wrong? I thought exclusions only worked going forward.

... a robots.txt exclusion request specific to the Wayback Machine was placed... . That request was automatically recognized and processed by the Wayback Machine and the blog archives were excluded, ... (the process is fully automated). The robots.txt exclusion from the web archive remains automatically in effect due to the presence of the request on the live blog ...

Source: [blog.archive.org...]




[edited by: not2easy at 9:10 pm (utc) on Apr 25, 2018]
[edit reason] splice cleanup [/edit]

5:18 pm on Apr 25, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member ken_b is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 5, 2001
posts:5893
votes: 120


Well that was easy, I fiund the answer ... Yes it is retroactive.

The robots.txt file will do two things:
1: It will remove documents from your domain from the Wayback Machine.
2: It will tell us not to crawl your site in the future.
[web.archive.org ]

I guess I need to more attention to these things.
5:55 pm on Apr 25, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 575
votes: 59


I did inadvertently ban the Wayback machine's IP, for a long time, but the archives still had remnants. I did not put them into my robots.txt. That is a bit harsh to delete all your site's info. There was a bot that crawled and shared data with the Wayback Machine, but I cannot remember its name. I unbanned that one as well.
7:56 pm on Apr 25, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


The Internet Archive (Wayback Machine) has never been truthful about supporting robots.txt.

You can disallow their crawler (Archive-It) and they will still come back to scrape your pages spoofing as a human browser.

They unapologetically violate international copyright law, strip away all scripts and advertising and serve your web property from their own server keeping the user on their site. If you sell products or publish ads, this is not serving your interests.
9:21 pm on Apr 25, 2018 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Mar 25, 2018
posts:500
votes: 101


The Internet Archive (Wayback Machine) has never been truthful about supporting robots.txt.
You can disallow their crawler (Archive-It) and they will still come back to scrape your pages spoofing as a human browser.

I have the exact opposite experience. "In my case", their robot has always obey my robots.txt file, however, you have/had to explicitly write a rule for their bot "ia_archiver", it doesn't look at the * rules.
Also, I wrote them once, to get one of my site removed, and they did it in a matter of days, I just had to proove I was the owner of the domain name.
10:18 pm on Apr 25, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


Not what I said. Yes their bot, the new one is Archive-It, obeys robots.txt but THEY don't. They impersonate a human and continue to scrape.
11:04 pm on Apr 25, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member ken_b is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 5, 2001
posts:5893
votes: 120


So this is where my posts got moved to.

So is this retroactive behavior unique to the archive.org / wayback machine?

I ask because it seems like I've seen it said here umpteen times that disallowing some bot or another in robots.txt would help get your pages out of a system. You had to put a noarchive meta on the pages themselves to get them to disappear from an index?

I've followed that last method and used noarchive has worked best for pages dropped on many pages ans wish I had on a whole ton of others..
11:34 pm on Apr 25, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


Until this new development (and it has yet to be verified) if you wanted to get your digital property removed from the Internet Archive (Wayback Machine) you had to explicitly request it using a form they have somewhere.

I have done that about a dozen times. Why a dozen you ask... because a few weeks later my entire site would be right back on the servers again and searchable for their users. I later started to block their IP range. That worked until they acquired new ranges.

After a couple years of playing What-a-Moe I finally stopped them and it has stayed stopped.

If they have now added a actual support for robots.txt, then I say "finally."
1:20 am on Apr 26, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


So this is where my posts got moved to
We had faith in your Sherlock abilities :)
3:14 pm on Sept 19, 2018 (gmt 0)

Full Member

10+ Year Member

joined:Feb 1, 2006
posts: 271
votes: 2


ia_archiver is not obeying anymore....

User-agent: ia_archiver
Disallow: /

Someone knows a solution?
3:55 pm on Sept 19, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts:575
votes: 59


Ban the Amazon farm AWS IP?
4:17 pm on Sept 19, 2018 (gmt 0)

Full Member

10+ Year Member

joined:Feb 1, 2006
posts:271
votes: 2


Yeh, but has to be done in .htaccess isn't it?
And it is a big range?
And than it does not delete the captures from the past, which the mentioned robots.txt did
4:59 pm on Sept 19, 2018 (gmt 0)

Preferred Member

Top Contributors Of The Month

joined:Sept 13, 2018
posts:355
votes: 71


ia_archiver is not obeying anymore....

User-agent: ia_archiver
Disallow: /

Someone knows a solution?


This works for me :

User-agent: ia_archiver
Disallow: /

User-agent: archive.org_bot
Disallow: /
5:49 pm on Sept 19, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15869
votes: 869


Yeh, but has to be done in .htaccess isn't it?
Yes, that's the difference between a Disallow (robots.txt) and a flat-out Deny/Ban/Block (htaccess, config, or equivalent in non-apache servers). One is a sign saying No Admittance; the other is a lock on the door.

What's the problem?
5:59 pm on Sept 19, 2018 (gmt 0)

Full Member

10+ Year Member

joined:Feb 1, 2006
posts:271
votes: 2


The problem is through .htaccess it does not delete (actually just hides) the captures from the past, which the mentioned robots.txt did

and i'll give what justpassing says a try, but i am afraid it does not work
(on 30% of my sites the disallow ia_archiver still doess work, so justpassing, you may just be lucky)
(what i am trying to say is that the bot is "still obeying "sometimes")