Internet Archive using Nutch

Forum Moderators: open

Message Too Old, No Replies

Internet Archive using Nutch

keyplyr

9:42 am on Sep 28, 2005 (gmt 0)

***.***.***.* - - [27/Sep/2005:15:33:19 -0700] "GET /robots.txt HTTP/1.0" 200 2016 "-" "InternetArchive/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"

IP does belong to Internet Archive. Anyone know why they're using Nutch?

GaryK

4:04 pm on Sep 28, 2005 (gmt 0)

If I read their, Roadmap For 2005 [crawler.archive.org], correctly they are now using Nutch-based software to do their crawling.

keyplyr

5:22 pm on Sep 28, 2005 (gmt 0)

So banning 'ia_archiver' may not effectively keep websites out of the Wayback Machine (Internet Archive) any longer? Don't want to block all the little guys who are using Nutch to for start-up SEs. Guess I'll see if banning 'InternetArchive' works.

GaryK

12:21 am on Sep 29, 2005 (gmt 0)

That sounds like a reasonable thing to try. If it doesn't work there are always stronger measures we can take to ban it, as needed.

jdMorgan

1:09 am on Sep 29, 2005 (gmt 0)

Hey, thanks for the heads-up on this.

I'm taking the opposite tack, though. and modified my code to allow the new IA user-agent. I've found it comes in very handy when someone copies my content and then claims to have had it first... :)

Jim

keyplyr

8:04 am on Sep 29, 2005 (gmt 0)

Well it showed again, took robots.txt then immediately disobeyed it:

207.241.238.** - - [28/Sep/2005:21:43:26 -0700] "GET /robots.txt HTTP/1.0" 301 243 "-" "InternetArchive/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"
207.241.238.** - - [28/Sep/2005:21:43:28 -0700] "GET / HTTP/1.0" 200 1433 "-" "InternetArchive/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"

GaryK

4:36 pm on Sep 29, 2005 (gmt 0)

So now what are you going to do? Will you try disallowing Nutch to see if the IA obeys that request? Or will you just ban any user agent with InternetArchive in it?

jdMorgan

4:40 pm on Sep 29, 2005 (gmt 0)

Note that you may have confused it with the 301 redirect on robots.txt. It did not successfully fetch robots.txt, but it didn't disobey it, either, because it never fetched it.

That looks like a rather major bug in Nutch.

Jim

GaryK

4:46 pm on Sep 29, 2005 (gmt 0)

I missed that at first. I don't understand why robots.txt would return a 301. Do you? Thanks.

keyplyr

6:42 pm on Sep 29, 2005 (gmt 0)

I don't understand why robots.txt would return a 301. Do you?

Yeah, I have several redirects to my baseline (www, no www, .+www, etc)

But the log didn't show it followed any of those 301s. Strange behaviour.

I still don't read just what purpose this new bot has. Just because it's from InternetArchive does not mean the data is being harvested for the Wayback Machine itself (which is what I object to.) I'm sure this company has more than one venture at hand. And I'm still seeing attempts from the infamous ia_archiver.

GaryK

2:48 pm on Oct 2, 2005 (gmt 0)

It hit each of my sites this past week. It read robots.txt and my index file and then moved on.

At first I was thinking they might be trying to figure out how many sites are banning their normal user agent via .htaccess or (using ISAPI_Rewrite) httpd.ini.

Then I had some espresso and realized they could determine that by the response ia_archiver gets from sites that forcibly ban them like I do.

So I still don't know what this thing does. As long as it behaves itself I'll continue to allow it access so that maybe I can figure out what it's up to.

jdMorgan

3:34 pm on Oct 2, 2005 (gmt 0)

> I don't understand why robots.txt would return a 301. Do you?

Number one reason would be if you've implemented a non-www - to - www-domain redirect or vice-versa, and they asked for robots.txt using the 'wrong' domain. I believe that's that only possible reason that wouldn't be obvious in standard-format access log data.

So it got a 301 for some reason, but didn't follow it.

Regardless of the cause, if the 'bot can't handle a simple 301, it needs some more work -- Kinda like another 'bot we know... The two of them apparently violate opposite boundary conditions of Einstein's well-known dictum, "Make everything as simple as possible, but no simpler."

Jim

GaryK

7:09 pm on Oct 2, 2005 (gmt 0)

I really should have known better. I do a 301 from non-www to www via ISAPI_Rewrite on all my websites. :)

I'm on a lot of pain meds and my surgery is tomorrow so I'm really not thinking clearly lately. It's the truth, it's my excuse, and I'm sticking to it. ;)

cutting

3:32 am on Nov 5, 2005 (gmt 0)

I was running this crawler. The output will not be used by the wayback machine. These runs were to test a new version of Nutch's open-source crawler. If there are bugs, please report them to the email address provided, nutch-agent@lucene.apache.org.

I was not aware of the bug following redirects when fetching robots.txt files, but, now that you mention it, I can see how this happens. I will try to fix this ASAP.

Thanks,

Doug

keyplyr

6:44 am on Nov 5, 2005 (gmt 0)

Thanks Doug, mystery solved. If I may ask, why did you choose to use the UA Internet Archive?

cutting

6:27 pm on Nov 7, 2005 (gmt 0)

I use the NutchCVS agent when I'm crawling fewer than a thousand urls over my home broadband connection. For anything larger I think it is better to use something more descriptive.

Some background: I work part-time for the Internet Archive, using Nutch to index and search content collected by other crawlers (Heritrix and Alexa). Nutch has recently been re-architected to be more easily distributed (modelled after Google's MapReduce and GFS technologies). This crawl was done to test Nutch's new distributed platform. Crawling is a demanding task, with many complex failure modes and thus makes a great test. A successfull two-day crawl bodes well for a two-week indexing job; indexing is much more predictable.

So this crawl was done for the Archive, even though its results will not be used by the archive. That's why I used InternetArchive as the user-agent.

Would you have preferred I used a different user-agent?

cutting

6:31 pm on Nov 7, 2005 (gmt 0)

I have fixed Nutch's bug following robots.txt redirects.

BTW, why do you object to your content being used by the Wayback Machine? Just curious...

GaryK

9:05 pm on Nov 7, 2005 (gmt 0)

I can't speak for others, but here's my take on it.

My website design and content belong to me. If someone else takes it for their own, non-personal use, it's at least morally wrong and hopefully, one day, a criminal act.

wilderness

9:58 pm on Nov 7, 2005 (gmt 0)

BTW, why do you object to your content being used by the Wayback Machine? Just curious...

Almost sent a reply to this and then realized your inquiry was meant for keyplr.

Since Gary has replied, I'll add my two-cents.

jdMordgan has always allowed archive.org and to some extent I agree with his postion for allowing the spidering and archiving.

My gripe with archive org, all along and my reason for denial has been that there "previously" was a link on their site offering archived data for sale by terrabytes.

I decided to check today for aforementioned link and was surprised to find that it no longer exists. (or at the least, it's very well hidden.)

It has me wondering whether it's total change in policy and methods or if they are just no longer providing the link however still sellling terrabytes :)

Don

Lord Majestic

10:03 pm on Nov 7, 2005 (gmt 0)

If someone else takes it for their own, non-personal use, it's at least morally wrong

Your local librarian will probably disagree, but then again they must be immoral making a living on essentially somebody's else content...

keyplyr

10:13 pm on Nov 7, 2005 (gmt 0)

I used InternetArchive as the user-agent... Would you have preferred I used a different user-agent?

No, the UA is appropriate if you collect data for them, however you will find many webmasters will block it.

BTW, why do you object to your content being used by the Wayback Machine? Just curious...

Discussion of whether the Wayback Machine (Internet Archive) is ethical or not has been going on for several years here and at many other forums. I strongly object to any business violating my copyright, and have had numerous phone conversations with Internet Archive staff, with and without my attorney.

GaryK

10:34 pm on Nov 7, 2005 (gmt 0)

Your local librarian will probably disagree, but then again they must be immoral making a living on essentially somebody's else content...

How is the local librarian making a living when I check a book out of the library? I don't think librarians earn a commission when someone checks out a book. The library does receive income if I return the book late, but that's hardly making a living!

Lord Majestic

10:40 pm on Nov 7, 2005 (gmt 0)

How is the local librarian making a living when I check a book out of the library?

Simple - they are employed to do their job: they literally make a living from it as if there were no libraries then there would be no librarians.

Most civilized countries have laws that ensure that libraries get content from publishers - this was in place for decades if not centuries. Almost the same happens with the search engines - I say almost because they actually drive traffic to your sites, unlike library where I can get book for free in full, read it and avoid making purchase.

Search engines can't survive for long time without driving traffic, this is by far better bargain than real world publishers get when they are legally required to send X copies to libraries where same copy will be read by many people.

If everyone was so protective about their sites then there would be no Internet as we know it because if the Web was unsearchable then it would be one hell of a place with just a handful of human selected sites available to all people. Is this what you want? An even worse monopoly than just a single dominating search engine? You guys are sleepwalking into monopoly that generates unholy sized threads every time it changes the algorithm. This is unhealthy for content owners in the first place.

wilderness

10:49 pm on Nov 7, 2005 (gmt 0)

I don't think librarians earn a commission when someone checks out a book.

In addition and at least in the majority of the US.

Local libraries, their contents, expenses and wages are funded by local taxpayers.
Making them servants of that same tax paying public.

In the state of my residence, each private telephone line is charged a monthly fee to pay for the internet service (machines and otherwise) that local libraries provide to the very same tax paying public.

It's a bad idea to compare a public and a publicly supported library to a private website or any other commercial venture.

In addition, there are many states in the US that do not allow access to their digital archives to non-residents of that state. Ohio is one example.
How is this deemed "public access?"

Perhaps the procedures of public libraries are different in the UK and other countries?

Don

Lord Majestic

10:55 pm on Nov 7, 2005 (gmt 0)

It's a bad idea to compare a public and a publicly supported library to a private website or any other commercial venture.

"The Internet Archive is a 501(c)(3) non-profit that was founded to build an �Internet library,� with the purpose of offering permanent access for researchers, historians, and scholars to historical collections that exist in digital format."

[archive.org...]

What more do you want from IA apart from being good netizen, ie obeying robots.txt etc?

AFAIK in the USA (and correct me here if I am wrong) The Library of Congress gets few copies of pretty much anything published in the USA: books, newspapers etc.

Comparison between libraries in general and LoC in particular and IA is perfectly normal.

Now tell me what you have against Google crawling you site to make even more money for shareholders of the company?

GaryK

11:00 pm on Nov 7, 2005 (gmt 0)

Simple - they are employed to do their job:

They are employed regardless of whether or not anyone checks out a book.

I can see why most people won't debate with you. It hard to debate someone who twists other's words and uses faulty logic.

I won't make this mistake again.

Lord Majestic

11:08 pm on Nov 7, 2005 (gmt 0)

They are employed regardless of whether or not anyone checks out a book.

So what? They still make money from somebody's content! What's so different in you and book publisher?

It hard to debate someone who twists other's words and uses faulty logic.

If its faulty then demonstrate it please. I assume here you don't mind Google (for profit company) crawling your site, but you don't like anybody else including non-profit entities like IA whose job is nothing else but a job of a digital library, something protected by laws in most countries? The matter of fact is that publishers in real world can't argue against their content used in libraries -- this sure hits sales, you - electronic publisher - is in exactly the same position and only current lack of explicit laws does not actually require you to allow bots crawl your site: if your content is so secret then take it off public internet.

[edited by: Lord_Majestic at 11:08 pm (utc) on Nov. 7, 2005]

wilderness

11:08 pm on Nov 7, 2005 (gmt 0)

What more do you want from IA apart from being good netizen, ie obeying robots.txt etc?

Many thanks for copying and pasting the entire comment.

Please read my 4th, 5th and 6th paragraph in message #19 of this thread.

Lord Majestic

11:11 pm on Nov 7, 2005 (gmt 0)

Please read my 4th, 5th and 6th paragraph in message #19 of this thread.

I read it carefully but you seem to ignore simple fact that IA is a non-profit registered charity. Do you object to Google crawling your site and reselling terabytes of data (which includes your sites) to AOL and others who pay them big $$$$$$$ for search technology? Do you object Library of Congress charging money for some of their services? Hypocritical comes to my mind.

[edited by: Lord_Majestic at 11:13 pm (utc) on Nov. 7, 2005]

wilderness

11:12 pm on Nov 7, 2005 (gmt 0)

I won't make this mistake again.

Gary,
Don't go away mad!
Just go away! ;)

You either argue with the man or worship the ground his bot spiders upon.

There's no in-between.

Every thing is black and white ;)

This 32 message thread spans 2 pages: 32