Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt now deletes archive.org wayback machine

         

tangor

3:44 pm on Apr 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



At some point after our correspondence, a robots.txt exclusion request specific to the Wayback Machine was placed on the live blog. That request was automatically recognized and processed by the Wayback Machine and the blog archives were excluded, unbeknownst to us (the process is fully automated). The robots.txt exclusion from the web archive remains automatically in effect due to the presence of the request on the live blog. Also, the blog URL which previously pointed to an msnbc.com page now points to a generic parked page.

[blog.archive.org...]

Interesting news from archive.org. If true, then robots.txt suddenly has real teeth. Note, this is the last paragraph of the article and the only part that webmasters are concerned with. The rest of the blog post deals with politics.

tangor

3:49 pm on Apr 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Tempted to test this to see if the "automated" part actually works and will delete past history with wayback machine. If this really works, you can bet there's a zillion folks (and spammers, too) who would like to be forgotten that dang quick!

Travis

6:18 pm on Apr 25, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



I thought it was already the case. When I learned about the way back machine, I found out my sites were archived. So to prevent further archiving, I added a robots.txt entry, and some times later, it happened the whole archives of my sites were gone.

ken_b

5:04 pm on Apr 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This popped up today on the archive.org blog and I wonder if I'm misunderstanding the statement. It sounds to me like the exclusion is implemented retroactively. Or am I reading that wrong? I thought exclusions only worked going forward.

... a robots.txt exclusion request specific to the Wayback Machine was placed... . That request was automatically recognized and processed by the Wayback Machine and the blog archives were excluded, ... (the process is fully automated). The robots.txt exclusion from the web archive remains automatically in effect due to the presence of the request on the live blog ...

Source: [blog.archive.org...]




[edited by: not2easy at 9:10 pm (utc) on Apr 25, 2018]
[edit reason] splice cleanup [/edit]

ken_b

5:18 pm on Apr 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well that was easy, I fiund the answer ... Yes it is retroactive.

The robots.txt file will do two things:
1: It will remove documents from your domain from the Wayback Machine.
2: It will tell us not to crawl your site in the future.
[web.archive.org ]

I guess I need to more attention to these things.

TorontoBoy

5:55 pm on Apr 25, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



I did inadvertently ban the Wayback machine's IP, for a long time, but the archives still had remnants. I did not put them into my robots.txt. That is a bit harsh to delete all your site's info. There was a bot that crawled and shared data with the Wayback Machine, but I cannot remember its name. I unbanned that one as well.

keyplyr

7:56 pm on Apr 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The Internet Archive (Wayback Machine) has never been truthful about supporting robots.txt.

You can disallow their crawler (Archive-It) and they will still come back to scrape your pages spoofing as a human browser.

They unapologetically violate international copyright law, strip away all scripts and advertising and serve your web property from their own server keeping the user on their site. If you sell products or publish ads, this is not serving your interests.

Travis

9:21 pm on Apr 25, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



The Internet Archive (Wayback Machine) has never been truthful about supporting robots.txt.
You can disallow their crawler (Archive-It) and they will still come back to scrape your pages spoofing as a human browser.

I have the exact opposite experience. "In my case", their robot has always obey my robots.txt file, however, you have/had to explicitly write a rule for their bot "ia_archiver", it doesn't look at the * rules.
Also, I wrote them once, to get one of my site removed, and they did it in a matter of days, I just had to proove I was the owner of the domain name.

keyplyr

10:18 pm on Apr 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Not what I said. Yes their bot, the new one is Archive-It, obeys robots.txt but THEY don't. They impersonate a human and continue to scrape.

ken_b

11:04 pm on Apr 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So this is where my posts got moved to.

So is this retroactive behavior unique to the archive.org / wayback machine?

I ask because it seems like I've seen it said here umpteen times that disallowing some bot or another in robots.txt would help get your pages out of a system. You had to put a noarchive meta on the pages themselves to get them to disappear from an index?

I've followed that last method and used noarchive has worked best for pages dropped on many pages ans wish I had on a whole ton of others..

keyplyr

11:34 pm on Apr 25, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Until this new development (and it has yet to be verified) if you wanted to get your digital property removed from the Internet Archive (Wayback Machine) you had to explicitly request it using a form they have somewhere.

I have done that about a dozen times. Why a dozen you ask... because a few weeks later my entire site would be right back on the servers again and searchable for their users. I later started to block their IP range. That worked until they acquired new ranges.

After a couple years of playing What-a-Moe I finally stopped them and it has stayed stopped.

If they have now added a actual support for robots.txt, then I say "finally."

keyplyr

1:20 am on Apr 26, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So this is where my posts got moved to
We had faith in your Sherlock abilities :)

mirrornl

3:14 pm on Sep 19, 2018 (gmt 0)

10+ Year Member



ia_archiver is not obeying anymore....

User-agent: ia_archiver
Disallow: /

Someone knows a solution?

TorontoBoy

3:55 pm on Sep 19, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



Ban the Amazon farm AWS IP?

mirrornl

4:17 pm on Sep 19, 2018 (gmt 0)

10+ Year Member



Yeh, but has to be done in .htaccess isn't it?
And it is a big range?
And than it does not delete the captures from the past, which the mentioned robots.txt did

justpassing

4:59 pm on Sep 19, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



ia_archiver is not obeying anymore....

User-agent: ia_archiver
Disallow: /

Someone knows a solution?


This works for me :

User-agent: ia_archiver
Disallow: /

User-agent: archive.org_bot
Disallow: /

lucy24

5:49 pm on Sep 19, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yeh, but has to be done in .htaccess isn't it?
Yes, that's the difference between a Disallow (robots.txt) and a flat-out Deny/Ban/Block (htaccess, config, or equivalent in non-apache servers). One is a sign saying No Admittance; the other is a lock on the door.

What's the problem?

mirrornl

5:59 pm on Sep 19, 2018 (gmt 0)

10+ Year Member



The problem is through .htaccess it does not delete (actually just hides) the captures from the past, which the mentioned robots.txt did

and i'll give what justpassing says a try, but i am afraid it does not work
(on 30% of my sites the disallow ia_archiver still doess work, so justpassing, you may just be lucky)
(what i am trying to say is that the bot is "still obeying "sometimes")