Forum Moderators: phranque

Message Too Old, No Replies

archive.org running amock

         

jackson

4:09 pm on Jul 7, 2004 (gmt 0)

10+ Year Member



Recently been hit by this mess:

ia11005.archive.org - - [05/Jul/2004:20:59:19 -0500] "GET /robots.txt HTTP/1.0" 200 980 "-" "ia_archiver-web.archive.org"
ia11012.archive.org - - [05/Jul/2004:20:59:21 -0500] "GET /delta/board/images/next.gif HTTP/1.0" 404 27109 "-" "ia_archiver-web.archive.org"
ia11025.archive.org - - [05/Jul/2004:20:59:24 -0500] "GET /delta/board/images/icons/book.gif HTTP/1.0" 404 27109 "-" "ia_archiver-web.archive.org"
cgi9.archive.org - - [05/Jul/2004:20:59:25 -0500] "GET /robots.txt HTTP/1.0" 200 980 "-" "ia_archiver-web.archive.org"
ia11015.archive.org - - [05/Jul/2004:20:59:26 -0500] "GET /delta/board/images/previous.gif HTTP/1.0" 404 27109 "-" "ia_archiver-web.archive.org"
ia11030.archive.org - - [05/Jul/2004:20:59:27 -0500] "GET /delta/board/images/newicons/book.gif HTTP/1.0" 404 27109 "-" "ia_archiver-web.archive.org"
ia11003.archive.org - - [05/Jul/2004:20:59:29 -0500] "GET /delta/board/images/all.gif HTTP/1.0" 404 27109 "-" "ia_archiver-web.archive.org"
ia11012.archive.org - - [05/Jul/2004:20:59:30 -0500] "GET /delta/board/images/threaded.gif HTTP/1.0" 404 27109 "-" "ia_archiver-web.archive.org"
cgi4.archive.org - - [05/Jul/2004:20:59:31 -0500] "GET /delta/board/images/greyflat.gif HTTP/1.0" 404 27109 "-" "ia_archiver-web.archive.org"
ia11017.archive.org - - [05/Jul/2004:20:59:35 -0500] "GET /delta/board/images/adm.gif HTTP/1.0" 404 27109 "-" "ia_archiver-web.archive.org"
ia11023.archive.org - - [05/Jul/2004:20:59:36 -0500] "GET /delta/board/images/new.gif HTTP/1.0" 404 27109 "-" "ia_archiver-web.archive.org"
ia11013.archive.org - - [05/Jul/2004:20:59:47 -0500] "GET /delta/board/showflat.php?Cat=&Board=mainboard&Number=79&page=&view=&sb=&o=&part=all&vc=1 HTTP/1.0" 404 27109 "-" "ia_archiver-web.archive.org"
ip68-12-201-226.ok.ok.cox.net - - [05/Jul/2004:21:00:39 -0500] "GET /delta/board/images/previous.gif HTTP/1.1" 403 27167 "http://web.archive.org/web/20010608211415/66.70.139.171/delta/board/showflat.php?Board=mainboard&Number=79" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
ip68-12-201-226.ok.ok.cox.net - - [05/Jul/2004:21:00:39 -0500] "GET /delta/board/images/all.gif HTTP/1.1" 403 27167 "http://web.archive.org/web/20010608211415/66.70.139.171/delta/board/showflat.php?Board=mainboard&Number=79" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
ip68-12-201-226.ok.ok.cox.net - - [05/Jul/2004:21:00:40 -0500] "GET /delta/board/images/next.gif HTTP/1.1" 403 27167 "http://web.archive.org/web/20010608211415/66.70.139.171/delta/board/showflat.php?Board=mainboard&Number=79" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
ip68-12-201-226.ok.ok.cox.net - - [05/Jul/2004:21:00:40 -0500] "GET /delta/board/images/threaded.gif HTTP/1.1" 403 27167 "http://web.archive.org/web/20010608211415/66.70.139.171/delta/board/showflat.php?Board=mainboard&Number=79" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
ip68-12-201-226.ok.ok.cox.net - - [05/Jul/2004:21:00:43 -0500] "GET /delta/board/images/greyflat.gif HTTP/1.1" 403 27167 "http://web.archive.org/web/20010608211415/66.70.139.171/delta/board/showflat.php?Board=mainboard&Number=79" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
ip68-12-201-226.ok.ok.cox.net - - [05/Jul/2004:21:00:43 -0500] "GET /delta/board/images/newicons/book.gif HTTP/1.1" 403 27167 "http://web.archive.org/web/20010608211415/66.70.139.171/delta/board/showflat.php?Board=mainboard&Number=79" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
ip68-12-201-226.ok.ok.cox.net - - [05/Jul/2004:21:00:44 -0500] "GET /delta/board/images/icons/book.gif HTTP/1.1" 403 27167 "http://web.archive.org/web/20010608211415/66.70.139.171/delta/board/showflat.php?Board=mainboard&Number=79" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
ip68-12-201-226.ok.ok.cox.net - - [05/Jul/2004:21:00:44 -0500] "GET /delta/board/images/new.gif HTTP/1.1" 403 27167 "http://web.archive.org/web/20010608211415/66.70.139.171/delta/board/showflat.php?Board=mainboard&Number=79" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
ip68-12-201-226.ok.ok.cox.net - - [05/Jul/2004:21:00:44 -0500] "GET /delta/board/images/adm.gif HTTP/1.1" 403 27167 "http://web.archive.org/web/20010608211415/66.70.139.171/delta/board/showflat.php?Board=mainboard&Number=79" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
ia11034.archive.org - - [05/Jul/2004:21:00:48 -0500] "GET /robots.txt HTTP/1.0" 200 980 "-" "ia_archiver-web.archive.org"
cgi9.archive.org - - [05/Jul/2004:21:01:13 -0500] "GET /delta/board/stylesheets/stylesheet2.css HTTP/1.0" 404 27109 "-" "ia_archiver-web.archive.org"

Apologies for this huge chunk of log data but it might bear for some interesting analysis. First of all, archive.org - if indeed that is what it is - comes in. It goes to the robots.txt file - where its been banned/blocked. It then proceeds to trawl through a series of directories and files I don't even have on my site - as per "/delta/board/images/next.gif" and gets 404'ed there.

And then, low and behold, I have another site (ip68-12-201-226.ok.ok.cox.net) coming in and ferretting for the same or similar directories and files and they get 403'ed.

Can someone please explain:

i) what is happening here?
ii) how to "kill off" this process?

Any advice would be appreciated.

ergophobe

4:16 pm on Jul 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



archive.org is home of the wayback machine, a group whose mission is to keep snapshots of the web. I'm surprised it ignores the robots.txt though.

Go to archive.org and type in your url and see what happens.

The idea is that when you sell your domain and it turns into a porn site, the people who used to use that content can go to archive.org and find the original pages.

Funny side note - I've been asked to help on a site that was on a server that died recently. Lost lots of content, much of which I could retrieve from the wayback machine.

ergophobe

4:19 pm on Jul 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



According to their FAQ [archive.org]


The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine

Also check out Removing Documents From the Wayback Machine [archive.org]

According to that page


To exclude the Internet Archive's crawler (and remove documents from the Wayback Machine) while allowing all other robots to crawl your site, your robots.txt file should say:

User-agent: ia_archiver
Disallow: /

Tom

vkaryl

11:25 pm on Jul 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've had reasons to be VERY VERY GRATEFUL to the wayback machine. Bless their hearts, without them I might have seriously for the first time in my LIFE considered suicide! [Okay, so that's an exaggeration. But NOT MUCH OF ONE....]

pendanticist

12:16 am on Jul 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



FWIW - I've noticed "ia_archiver" has also increased it's frequency of visits, somewhat similar to what the origional poster cites.

However, the UA is quite a bit different.

In other words, the bot I mention is the ONLY 'Wayback' to ever have come to visit my site, which has been online since '98.

How long has the one, mentioned by the poster, been around?

Do they have different bots for whatever reason?

balam

4:46 am on Jul 25, 2004 (gmt 0)

10+ Year Member



> To exclude the Internet Archive's crawler [...]

...I had to ban it via .htaccess, regardless of what they may say in their FAQ. I caught them violating my robots.txt on more than one site, so adios ia-archiver.

Disregarding copyright issues, I'm not particularly keen on the idea of folks seeing what my sites looked like years ago. But I will agree with vkaryl - the Wayback Machine has been a lifesaver when I've needed to rescue someone's lost site.

(So, in effect, I do get to have the cake & eat it, too!)

figment88

5:05 am on Jul 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Remember this was started by the same people who brough you Alexa. They are big on vision, crappy on execution. I wouldn't be surprised at all if they try to respect robots.txt but fail due to incompetence.

plumsauce

7:36 am on Jul 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




they are also very big on *monetizing* formerly volunteer created resources such as imdb.com

so you supply the content, they organize and copyright it and sell it.

for that reason alone, they should be banned.

ergophobe

4:47 pm on Jul 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't know anything about their motives, but the concept is fantastic.

I'm not particularly keen on the idea of folks seeing what my sites looked like years ago.

I'm a professional historian by training and vocation and one of the problems with the web is that it is a very important communication medium and not at all archived. Unlike print, handwritten manuscripts, movies and most radio and television broadcasts, there is no systematic way of preserving what's on the net. From a historian's perspective 200 years from now, what will be most interesting will be the first "primitive" web sites, just as books printed prior to 1500 (the so-called 'incunabula') are many times more valuable than something printed in 1510.

Maybe some non-profit group needs to do like the government does for other types of documents - they preserve them, but nobody gets access for 25-50 years.

Tom

ergophobe

5:00 pm on Jul 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



they are also very big on *monetizing* formerly volunteer created resources such as imdb.com. so you supply the content, they organize and copyright it and sell it. for that reason alone, they should be banned.

Where are you getting this information. They do not and can not claim copyright on any materials in the IA any more than the Library of Congress claims copyright over the books that are in there. This is just alarmist.


Remember this was started by the same people who brough you Alexa.

Well, if you think about it, Mozilla was started by the same people that bring you AOL. Much Linux development has been been brought to you by the same people that bring you WordPerfect and Star Office. What does that say? It's an irrelevant relationship. The Internet Archive is supported by Alexa, but also by

- the Smithsonian
- the Long Now Foundation (and they are neighbors of the Long Now Foundation out in the Presidio).

If not them, then who will take snapshots of the internet for posterity?

Tom

figment88

5:34 pm on Jul 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ergophobe, your comparisons are bad. I am not talking about groups of people, I am talking about a few specific individuals. Both Alexa and the WayBackMachine are small enterprises - not like AOL or even mozilla. We are talking just a couple handfuls of people lead by Brewster Kahle - a so-called visionary who can't execute very well.

The WayBackMachine may get financial support from those other institutions, but the personnel overlap with Alexa. Both Alexa and the Archive are located in the Presidio in San Francisco with offices by eachother.

fiestagirl

5:42 pm on Jul 25, 2004 (gmt 0)

10+ Year Member



I was using the wayback machine last week to rescue some content and I think I may know what is happening.

When I searched for all of the content that is indexed by them for this site, I got a few messages like this:

Not in Archive.
The page you requested has not been archived. If the page is still available on the Internet, we will begin archiving it during our next crawl.

So I am imagining that they have gone off to my clients site looking for these pages that haven't existed since 1998 and started asking for them.

I also encountered content that had been excluded by robots.txt. I should say that the link was there but the content wasn't.
The message was:
We're sorry, access to www.coolsite.com/page.html has been blocked by the site owner via robots.txt.

So they may check robots.txt and still visit the page if they've been excluded. The content isn't indexed but the link is. Just like you find on GG.

ergophobe

6:18 pm on Jul 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Figment, what about my more general point - if not them, then who?

I don't see anyone else trying to record states of the web. I would rather have an incompetent archivist than no archivist at all.

Sure, ideally someone would step up and do this who had fantastic resources, expertise and execution and the clout to guarantee storage for 50 or 100 years barring planetary catastrophe. As near as I can see, though, that institution is not around yet and the early days of the internet are being lost. At least someone is trying to do something.

Tom

vkaryl

6:51 pm on Jul 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Actually the WayBackMachine has a copy of my earliest site available. It's embarrassing, yes. I was a total newbie to the world of the net, though I'd been using computers since 1984.

No matter that it's embarrassing, I'm glad it's still there. I was impressed by what I DID know back then, considering that I didn't know much; and I'm amazed at what I've learnt since - including how much I STILL have to learn!

Ergophobe, I too am glad that someone (no matter how incompent, assuming that to be true) has at least some of the "early net" stashed somewhere....

figment88

6:52 pm on Jul 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ergophobe, I do not really care to comment on your more general point. If you care to discuss the value of the WayBackMachine, start a new topic.

I was responding to the original poster with my thought that a possible reason the archive seems to ignore the robots.txt is incompetence. I am not even voicing an opinion on whether this is the most likely reason. I am just saying that sometimes things don't work like you expect because they are broken.

plumsauce

7:16 pm on Jul 25, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Where are you getting this information. They do not and can not claim copyright on any materials in the IA any more than the Library of Congress claims copyright over the books that are in there. This is just alarmist.

Ergophobe,

The example I used was imdb.com, now owned by amazon. Their site *claims* copyright. Whether this claim is sustainable does not lessen the chilling effect of the purported claim. The relevant case law on this of course is Feistel vs. Rural Telephone. The *fact* that a certain actor appeared in a specific film is not open to copyright claim. Yet, imdb.com seeks to claim copyright on their compilation.

archive.org previously offered to *sell* a *licensed* copy of their archive for local usage. They, of course claim copyright on this compilation. I did not find this on the site just now. A program with similar features is now found at alexa.com. And it is *not* free.

But, archive.org still have:


Can people download sites from the collections?

Our terms of use specify that users of the collections are not to copy data from the collections. If there are special circumstances that you think the Archive should consider, please contact info@.

so the flow is arhive.org->alexa.com->amazon.com

I do not object to the presentation or collection of material. I *do* object to the taking of material and subsequently claiming copyright on it.

How long before amazon decides to *fund* and then monetize dmoz.org?