UPDATE: archive.org placed an internal block on all the domains that I requested they remove, and notified me that they had done this per my request. They did it within a couple days of receiving the fax, and did not even argue with me about it. This block is past, present, and future, and is independent of whether my sites are online or whether I have them excluded in robots.txt.
My request was a faxed request on letterhead. It had nothing to do with the DMCA, and I didn't have to do any "under penalty of perjury" stuff. I just politely told them that their robots.txt was deceptive and didn't do what they said it would do, and now I was requesting that they remedy the situation for all my domains, past, present, and future.
And they did it! I didn't think they'd want to argue with me on this, because they are not on very firm ground.
Of course, I still have to block the Alexa crawler, because I don't imagine there's much communication in that direction between archive.org and Alexa. But even if Alexa gets through, my sites won't show up on archive.org now. I only need to block Alexa for bandwidth reasons.
|So, would you say a library is violating a newspaper's copyrights by keeping back issues archived for public viewing? |
Libraries pay for subscriptions. To make it comparable to archive.org, you'd need a rogue library that goes to a legitimate library, and photocopies every issue of a publication so that they don't have to pay for a subscription, and then makes the photocopy available for public viewing.
|I only need to block Alexa for bandwidth reasons. |
Congrats - you will probably save a few cents on traffic at the expense of leaving behind incomplete version of web history. I hope it won't be classed as incomplete though.
|Libraries pay for subscriptions. |
Neither you charge for accessing that information that crawlers tried to get - they only "sell" it for what they "bought it".
Instead of DMCA it would have been more beneficial for human kind to have legislation similar to the one that applies to printed media, which would compel all publishers of publicly available content on the web to submit or at the very least not resist having said content added to digital equivalents of the Library of Congress.
Having said that I just want to say "Amen to opt-out" principle of current search engines/archivers, without this principle there would be no Web as we know it. One can't have cake and eat it at the same time. IMHO an association of people who crawl the web should be created with a intention to blacklist those websites that selectively ban legitimate non-abusive bots - if you don't like legit non-abusive box X, then all bots won't crawl you. Sounds like a fair compromise to me.
I started my websites to reach humans, not machines.
Crawlers cost me because:
1) Spiders consume 80 percent of my bandwidth.
2) I lose some control over my content when they cache it.
I tolerate three crawlers because:
1) They bring me eyeballs (Google, Yahoo, MSN). Without these three I wouldn't have many eyeballs. Should I fall on my knees and thank G/Y/M? No, they're not doing it to be nice. They're doing it to get rich. I'm not getting rich; I'm nonprofit. But that's the way the web evolved and I have to live with it.
2) I can opt-out of the cache copy on Google and Yahoo.
I do not tolerate Alexa/archive.org because:
1) They don't bring me eyeballs.
2) Their robots.txt and ROBOTS header policies are deceptive or nonexistent.
It's a balance of power question, and we all draw the line in different places according to our needs.
|No. At least not in any new sense. The archive.org bot has been spidering and archiving the web for many years. The privacy issue is the same whether or not they ever partner with Google. |
I meant the storing of all that content.
I had to email Google to get my name and address out of its records. Do I have to do that for every company that wants to archive that information?
Libraries pay for subscriptions.
Libraries pay for subscriptions because that's what you have to do in order to access the content in the first place. A publically available website which is free to view for everyone is different... that's like a library keeping archived copies of a free publication. My apologies, my original example wasn't precise enough.
To make it comparable to archive.org, you'd need a rogue library that goes to a legitimate library, and photocopies every issue of a publication so that they don't have to pay for a subscription, and then makes the photocopy available for public viewing.
Nonsense. That would be like archive.org using someone else's name and password to access a suscription site, crawling it, and then making it publically available. I'm assuming your website is freely available for public viewing?
My point is that libraries make a lot of copyrighted material freely available to the general public... This would not be legal for someone to do for profit, but libraries are allowed to abide by a slightly different set of rules, because they are performing a public service. Archive.org is attempting to essentially create a library of internet content, past and present.
|I'm assuming your website is freely available for public viewing? |
Not for bots it isn't. I monitor rate of access and total fetches around the clock, and cut off bots all the time. A bot trying to go deep for 130,000 pages worth, at 10 pages per second, is not what I would call "public access." Maybe "denial of service" would be a good description.
First I cut them off with 403 Forbidden. Most of them keep trying anyway, so the next level cuts them off at the kernel's route table. That block stays there until the next time I reboot, which is usually months later. Then the cycle starts all over again.
Some of these jokers are from China. Some from Japan. Some from Africa. Some from Europe. Some from MetaCarta.com, which is doing it for U.S. intelligence. Some are "Dudes in Dorm Rooms" with personal bots using the Big U bandwidth.
Death to these bots, and I don't care it if cuts into your widget sales. Give me back the old Internet!
| This 67 message thread spans 3 pages: < < 67 ( 1 2  ) |