>What about the WayBackMachine? Not doing a good enough job?
Kahle is the *founder* of the WaybackMachine, which is located at archive.org. They are one in the same.
ahh, w/o archive, how else can we look back stuff like this?
|Correct me if I'm wrong, but isn't the web already an open archive? |
Its open, but its not archive - it is unthinkable for archive to only contain the latest version of the book (not all revisions no matter how minor), it is also unthinkable for archive to delete stuff when it "gone out of print".
Web is just a current snapshot, where as an archive (good one!) is a full featured movie.
|One problem with archive.org is if the bot is *ever* blocked by robots.txt, ALL previously stored pages are deleted. |
Not deleted, but made "unavailable." A real-time check is made to see if robots.txt exists before perviously stored pages are shown. That doesn't mean they are deleted -- far from it. They're merely waiting until such time that the information can be useful once again.
I had a domain blocked and then I sold the domain name and moved everything to a new domain. The new owners don't have a robots.txt. Suddenly all the stuff I thought I had been blocking came alive and available on archive.org -- six years worth. I have no control over this because I no longer own the domain.
Archive.org is a slimy operation in my opinion. No respect for the rights of webmasters.
|Not deleted, but made "unavailable." |
Some are made "unavailable", some are lost however - 404s about for all sort of reasons. Whatever the reason it aint an archive if some items are "unavailable" - excusable for old fashined archives where old rare books could be subject to restoration procedures, but not acceptable for digital archive that can generate copies at near zero cost.
|Suddenly all the stuff I thought I had been blocking came alive and available on archive.org -- six years worth |
Sorry to hear that but it is naive to "protect" content using robots.txt. If it was meant to be for registered (paying) users then you should have password protect it.
The way I current situation is that there is a balanced compromise between free pages that can make into search engine or being paid but not making into it.
In that case, I stand corrected Scarecrow. I will call your attention then to:
You are going to have to read up on the DMCA, and how to do a proper removal request. If you created that content, and didn't sell the rights to that content when you sold the domain, then you have the copyright on it.
Crawling the web means never having to say you're sorry. I can show you pages of mine at archive org, where if you click "show source" you can see that there's a NOINDEX, NOFOLLOW meta in the headers.
Class action, anyone?
Actually, I think the ROBOTS, NOARCHIVE meta is working. I use that on everything now. Of course, before Yahoo kicked in this year, it was always GOOGLEBOT, NOARCHIVE - which did not work at archive.org.
I guess technically, archive.org isn't "indexing."
Highlighted in red:
"Note: Currently only few robots support this tag!"
No mention of the meta tag there in the official RFC.
It doesn't look like the meta tags are officially recognized. Only robots.txt. And, as you point out archiving and indexing are not necessarily the same. There may be people who actually want archive.org to archive their site, but it not appear in search engines. Would make sense for a site that expected almost all traffic would be from links on other sites, and not through SEs.
This is getting humorous. You can actually call up old copies of your robots.txt at archive.org. I have many dozens listed and they clearly show:
Time to dash off a fax to millionaire whiz-kid Brewster Kahle and ask him to take out all six years worth. Their own evidence gives me a prima facie case, which I wouldn't be able to prove otherwise!
Um, didn't Amazon buy webarchive.org? Why should Google give Amazon their database?
Brewster Kahle started the Internet Archive (www.archive.org, aka the Wayback Machine) and Alexa at about the same time. The first is a nonprofit, the second is for-profit. Written into Alexa's charter is a provision that requires Alexa to hand over their crawl to the Archive after a delay of six months.
Alexa was sold to Amazon in 1999 for $250 million, but Kahle was already rich before then. Alexa has a certain amount of independence from Amazon, written into the deal, despite the fact that Amazon got all of the stock.
The Archive paid Alexa $1.7 million in 2002 for web hosting. But then, the Kahle/Austin Foundation in San Francisco donated $2.4 million to the Archive in 2002. While the Archive is a legally-recognized nonprofit, and even got a $174,000 donation from the Library of Congress in 2002, the tight arrangement with Alexa/Amazon means that there are commercial motives at work. There are enough interconnects so that the for-profit/nonprofit issue gets rather clouded. Alexa sells sets of crawl data to customers, for example. But their donation of the crawl to Archive.org makes it look less greedy, and it's probably a tax-writeoff also.
The robots.txt policy at archive.org is deceptive, and they should change it. The robots.txt is checked in real time when you request any or all their pages for a domain. If it finds an exclusion for ia_archiver on the site requested, it says so and doesn't show you any pages. But the pages exist nonetheless at archive.org if they were crawled -- try deleting your ia-archiver exclusion and make another request. Use the www.mydomain.com/* format to ask for all pages.
It's clear from all of the robots.txt I checked for a half-dozen sites of mine, going back several years, that the ia_archiver exclusion did not prevent crawling of the site. You can see my exclusion right there in the archived version of the robots.txt.
What are they up to? Their vacuum cleaner is set on max to suck up everything, same as all the other search engines out there. Archive.org even has a reputation as very much in the public interest. Who cares what webmasters think?
Thank you for setting the record straight. Google should "affiliate" with a nonprofit and ask Amazon for their database too.
I forgot to mention in my last post, I bet that Google has plans for their archive. Archive.org may be a joke compared to what Google may one day have.
I just thought of something. I'll check it after this posts:
...would I have to opt in to say I want my site distributed. If so how do I opt out?
Well, your site is already publically available to a world-wide audience, any of whom can save it onto their hard-drive and 'archive' it for their own use... In essence, you've distributed it yourself in the first place. Legally, I'm sure you could say you don't want it redistributed for profit, but I don't think you could make a sound legal argument that information you made publically available can't be archived for not-for-profit reasons.
While a private party couldn't save up copies of the New York Times and then reprint them for a commercial project, libraries all over the world archive copies of newspapers for later viewing... and I don't think they could jump on a library for making copies of their microfiche archives for the purpose of giving them to another library.
The only legitimate argument I can see here is against Google for caching the pages in the first place (redistributing your material for commercial gain); but trying to stop Google from giving their collected data away to a not-for-profit archiving scheme? Well, that would be like a publishing house trying to sue someone for donating their personal book collection to a library. I don't think it would fly.
>I forgot to mention in my last post, I bet that Google has plans for their archive. Archive.org may be a joke compared to what Google may one day have.
There are some problems with that idea. Archive.org has the advantage of being a non-profit org in terms of legal status. Google could create their own non-profit, but since they couldn't monetize it I see little incentive. Some sort of partneship with archive.org would make more sense.
Good point. Well they could use a nonprofit for PR and sending Traffic to the Big G.
|ahh, w/o archive, how else can we look back stuff like this? |
I think pic that should be on the cover of the prospectus....
Big, big privacy problem.
You dont need archive.org for that.. try:
Scarecrow used to display this on his site :)
It's still up at Stanford, so it's not the Archive's problem. If you want to worry about privacy, have a look at Sergey's love affair with data mining [www-db.stanford.edu]. He can't spell too well. I think the "datamine maling list achive" should be "datamine mauling list archive."
maul vt beat, bruise; to injure by beating; mangle; to handle roughly
>Big, big privacy problem.
No. At least not in any new sense. The archive.org bot has been spidering and archiving the web for many years. The privacy issue is the same whether or not they ever partner with Google.
As for privacy issues, Google has had that over its head ever since the acquired the deja.com Usenet archives, and some other old Usenet archives. DejaNews, and its X-No-Archive header option, came into existence circa 1995. Yet Google has in the Google Groups archives Usenet posts as early as 1981. Google has archived a Usenet post by Tim Berners-Lee announcing the WWW project from 1991. People posting to Usenet back in the 1980s couldn't have anticipated that their posts would many years later all be available via some newfangled technology like web servers. While the knowledgeable on Usenet would have been aware that someone, somewhere might possibly be archiving their posts, that someday they would be trivially accessible worldwide via a simple search was something they likely never invisioned. IMO the privacy issue of a WWW archive pales to having an archive of Usenet posts from the 1980s.
I have established that the ia_archiver disallow in robots.txt does not do a good job of preventing crawling by Alexa. (Alexa feeds the WayBack Machine at www.archive.org.)
There are thousands of pages of mine from several domains available at archive.org over the last few years. For at least half of these, you can also find a robots.txt captured by archive.org at the same time as these pages, and you can see this in the robots.txt:
The page at www.archive.org that explains their exclusion policy [archive.org] is deceptive at best. The box with the information in it is titled, " Removing Documents From the Wayback Machine." They tell you to put the above in robots.txt and submit your site for crawling at Alexa. Two things will happen, they say:
1. It will remove all documents from your domain from the Wayback Machine.
2. It will tell us not to crawl your site in the future.
On the off-chance that Alexa would pick up the site submission under these specific conditions and treat it as a removal request, I tried this. In fact, Alexa merely queues your site for future crawling. It makes no effort to check the robots.txt immediately. In no way can archive.org's exclusion procedure be considered a removal procedure. It's simply a referral to the generic site submission box at Alexa.
I've established that there is no effective removal procedure outlined at the archive.org. I've also pointed out that when you put the above disallow in your robots.txt, what happens is that if anyone requests an old page from your site at archive.org, a real-time check is made by archive.org of your robots.txt. If the disallow for ia_archiver is found, then they tell the searcher that they can't show any pages due to the robots.txt.
But the pages still exist at archive.org, and the next time the situation may be different. What happens under two conditions, a real-time hung request for robots.txt by archive.org, and a 403 Forbidden response from their request for your robots.txt? I tried both of these. You can try them yourself. The block for the archive.org IP addresses that will cover at least 90 percent of these real-time requests for robots.txt is this:
Linux kernel block:
/sbin/route add -net 22.214.171.124 netmask 255.255.252.0 reject
Deny from 126.96.36.199/255.255.252.0
Either of these will block the range 188.8.131.52 to 184.108.40.206, which is where archive.org comes from when it fetches your robots.txt in real time, in response to a request for a page from your domain. (This is not the range for Alexa's crawlers, so don't confuse the two).
The kernel block makes archive.org's request hang. But after a 20-second timeout, archive.org goes ahead and shows the searcher the pages, as if the robots.txt text was found without an ia_archiver disallow in it.
The .htaccess block returns a 403 Forbidden. This is completely ignored by archive.org, and the requested pages are shown immediately, as if robots.txt text was found without an ia_archiver disallow in it.
This archive.org behavior means that they only way to make sure no one can see your old pages is to keep your site up 100 percent from now until eternity, and make sure the robots.txt is always available, and always shows the ia_archiver disallow in it.
Here's the clincher: The WayBack Machine at www.archive.org plans to implement keyword searching in the near future so that old pages can be found based on the content in the page, like a normal search engine!
Right now this feature is in beta mode at recall.archive.org and I get frequent "Server busy" errors. But they are clearly working on it, and have stated that they plan to offer this within months, not years. Up until now, archive.org has been useful only if you enter a domain name followed by a wildcard, or a specific page on a domain.
I've also established that Alexa's crawler ignores the NOINDEX, NOFOLLOW meta. And I retract my earlier statement that it appears to respect the ROBOTS, NOARCHIVE meta. Further research shows that archive.org is at least six months behind. Since the ROBOTS version of NOARCHVE wasn't used by me until Yahoo came online earlier this year, it is impossible for me to say whether the ROBOTS, NOARCHIVE will work at archive.org. There are relatively few pages from 2004 in archive.org from any sites at this point, and I cannot tell what's going to happen yet with the ROBOTS, NOARCHIVE meta.
This situation means that archive.org's lack of a real removal policy, and Alexa's very poor record in terms of respecting robots.txt during the crawl, along with the practice of ignoring document headers, makes the whole thing a privacy time bomb. Look for an Archive-Watch dot org site within a few weeks. Meanwhile, I've faxed them request for a total removal for 12 domains, and am curious to see how they will respond.
|<meta name="Google" contents="nodistribute, noresale"> |
That's very good. I think I will contact w3.org and see if we can make this fly.
>This situation means that archive.org's lack of a real removal policy, and Alexa's very poor record in terms of respecting robots.txt during the crawl, along with the practice of ignoring document headers, makes the whole thing a privacy time bomb. Look for an Archive-Watch dot org site within a few weeks. Meanwhile, I've faxed them request for a total removal for 12 domains, and am curious to see how they will respond.
Archive.org *does* have a real removal policy:
Scroll down to "Copyright Policy". That is how to make a DMCA takedown request. Please note that the form of the request MUST exactly conform to how it is specified on that page. Particularly about the statement made under penalty of perjury. If it isn't in that exact form, archive.org should ignore your removal request.
|If it isn't in that exact form, archive.org should ignore your removal request. |
They would not be well-advised to do this. Furthermore, they should clean up their robots.txt act, start obeying standard ROBOTS instructions in document headers, and implement a removal policy that is simpler - something akin to Google's removal system. We shall see.
>They would not be well-advised to do this.
Actually, they would be well advised to do this. Their legal counsel likely would insist on it, in fact. The DMCA is a law, and one specifically designed to protect those running Internet servers. By faithfully complying with the law, they cover their backsides.
>Furthermore, they should clean up their robots.txt act, start obeying standard ROBOTS instructions in document headers, and implement a removal policy that is simpler - something akin to Google's removal system. We shall see.
Don't hold your breath. In particular, there is no RFC regarding robots instructions in document headers, so that is a problem. There is a RFC for robots.txt, and they ought not be spidering what is disallowed by that.
Sorry, rfgdxm1, I think you are way off base on this one. First of all, the public relations angle is probably more significant than what some legal counsel might advise. I didn't just fall off the turnip truck when it comes to privacy activism.
Secondly, even your legal position wouldn't hold up very well.
Archive.org comes along and jumps my robots.txt despite my explicit disallow, and ignores my NOARCHIVE meta in all my documents, and then in the only removal policy you can find, archive.org insists that I have to file with them:
"A statement by you that you have a good-faith belief that the disputed use is not authorized by the copyright owner, its agent, or the law; A statement by you, made under penalty of perjury, that the above information in your notice is accurate and that you are the owner of the copyright interest involved or are authorized to act on behalf of that owner; and Your electronic or physical signature."
I have material on my sites that is posted with written permission of the copyright owner. I cannot claim that I am acting on behalf of the copyright owner; all the written permission says is that I can post the material on my site.
Now archive.org comes along and steals this, in defiance of my specific requests, and says the only way I can get it removed is to claim, under penalty of perjury, that I'm acting on behalf of the copyright owner! Don't you see the double standard in your position?
This is not a DMCA issue. That law is geared for very specific items that are in dispute. This is a spidering issue. Archive.org claims a larger index than Google, going back to 1996, and DMCA has almost nothing to do it. Neither does "fair use," because of the quantity of material involved in archive.org's collection.
This is an issue of common courtesy for web spiders, and the rights of webmasters.
|I have material on my sites that is posted with written permission of the copyright owner. I cannot claim that I am acting on behalf of the copyright owner; all the written permission says is that I can post the material on my site. |
Have you considered the possibility that people who allow you to post their copyrighted material on your sites might prefer their data to be cached for posterity?
>Sorry, rfgdxm1, I think you are way off base on this one. First of all, the public relations angle is probably more significant than what some legal counsel might advise. I didn't just fall off the turnip truck when it comes to privacy activism.
Seriously now. Who puts on a web server something that is neither password protected and/or encrypted that they want to be private? Does this seem to you sane?
>Archive.org comes along and jumps my robots.txt despite my explicit disallow, and ignores my NOARCHIVE meta in all my documents...
I would agree if archive.org is ignoring robots.txt then they shouldn't be doing that. The NOARCHIVE meta is something that is not defined by an RFC, and can be ignored. However, if the page was blocked by robots.txt, then it shouldn't have been fetched that a NOARCHIVE meta is relevant.
That's a bit like saying "Would anybody who don't want their material copied and distributed, show it in a cinema? Or sell it, printed on newspaper at every coner?" Damn right they do, and a big fuss they make about somebody copying and redistributing their publicly available material. Just because I don't distinctly charge the select individuals I invite (via links) to visit my pages and viewing fees, doesn't waive my copyright.
Just because I don't distinctly charge the select individuals I invite (via links) to visit my pages and viewing fees, doesn't waive my copyright.
So, would you say a library is violating a newspaper's copyrights by keeping back issues archived for public viewing?
I suppose it boils down to whether a website falls under movie/music style copyright protection, or whether it's more like a newspaper or book.
UPDATE: archive.org placed an internal block on all the domains that I requested they remove, and notified me that they had done this per my request. They did it within a couple days of receiving the fax, and did not even argue with me about it. This block is past, present, and future, and is independent of whether my sites are online or whether I have them excluded in robots.txt.
My request was a faxed request on letterhead. It had nothing to do with the DMCA, and I didn't have to do any "under penalty of perjury" stuff. I just politely told them that their robots.txt was deceptive and didn't do what they said it would do, and now I was requesting that they remedy the situation for all my domains, past, present, and future.
And they did it! I didn't think they'd want to argue with me on this, because they are not on very firm ground.
Of course, I still have to block the Alexa crawler, because I don't imagine there's much communication in that direction between archive.org and Alexa. But even if Alexa gets through, my sites won't show up on archive.org now. I only need to block Alexa for bandwidth reasons.
| This 67 message thread spans 3 pages: < < 67 ( 1  3 ) > > |