Forum Moderators: open
Well it showed again, took robots.txt then immediately disobeyed it:
207.241.238.** - - [28/Sep/2005:21:43:26 -0700] "GET /robots.txt HTTP/1.0" 301 243 "-" "InternetArchive/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"
207.241.238.** - - [28/Sep/2005:21:43:28 -0700] "GET / HTTP/1.0" 200 1433 "-" "InternetArchive/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"
I don't understand why robots.txt would return a 301. Do you?
But the log didn't show it followed any of those 301s. Strange behaviour.
I still don't read just what purpose this new bot has. Just because it's from InternetArchive does not mean the data is being harvested for the Wayback Machine itself (which is what I object to.) I'm sure this company has more than one venture at hand. And I'm still seeing attempts from the infamous ia_archiver.
At first I was thinking they might be trying to figure out how many sites are banning their normal user agent via .htaccess or (using ISAPI_Rewrite) httpd.ini.
Then I had some espresso and realized they could determine that by the response ia_archiver gets from sites that forcibly ban them like I do.
So I still don't know what this thing does. As long as it behaves itself I'll continue to allow it access so that maybe I can figure out what it's up to.
Number one reason would be if you've implemented a non-www - to - www-domain redirect or vice-versa, and they asked for robots.txt using the 'wrong' domain. I believe that's that only possible reason that wouldn't be obvious in standard-format access log data.
So it got a 301 for some reason, but didn't follow it.
Regardless of the cause, if the 'bot can't handle a simple 301, it needs some more work -- Kinda like another 'bot we know... The two of them apparently violate opposite boundary conditions of Einstein's well-known dictum, "Make everything as simple as possible, but no simpler."
Jim
I was not aware of the bug following redirects when fetching robots.txt files, but, now that you mention it, I can see how this happens. I will try to fix this ASAP.
Thanks,
Doug
Some background: I work part-time for the Internet Archive, using Nutch to index and search content collected by other crawlers (Heritrix and Alexa). Nutch has recently been re-architected to be more easily distributed (modelled after Google's MapReduce and GFS technologies). This crawl was done to test Nutch's new distributed platform. Crawling is a demanding task, with many complex failure modes and thus makes a great test. A successfull two-day crawl bodes well for a two-week indexing job; indexing is much more predictable.
So this crawl was done for the Archive, even though its results will not be used by the archive. That's why I used InternetArchive as the user-agent.
Would you have preferred I used a different user-agent?
BTW, why do you object to your content being used by the Wayback Machine? Just curious...
Almost sent a reply to this and then realized your inquiry was meant for keyplr.
Since Gary has replied, I'll add my two-cents.
jdMordgan has always allowed archive.org and to some extent I agree with his postion for allowing the spidering and archiving.
My gripe with archive org, all along and my reason for denial has been that there "previously" was a link on their site offering archived data for sale by terrabytes.
I decided to check today for aforementioned link and was surprised to find that it no longer exists. (or at the least, it's very well hidden.)
It has me wondering whether it's total change in policy and methods or if they are just no longer providing the link however still sellling terrabytes :)
Don
I used InternetArchive as the user-agent... Would you have preferred I used a different user-agent?
BTW, why do you object to your content being used by the Wayback Machine? Just curious...
Your local librarian will probably disagree, but then again they must be immoral making a living on essentially somebody's else content...
How is the local librarian making a living when I check a book out of the library?
Simple - they are employed to do their job: they literally make a living from it as if there were no libraries then there would be no librarians.
Most civilized countries have laws that ensure that libraries get content from publishers - this was in place for decades if not centuries. Almost the same happens with the search engines - I say almost because they actually drive traffic to your sites, unlike library where I can get book for free in full, read it and avoid making purchase.
Search engines can't survive for long time without driving traffic, this is by far better bargain than real world publishers get when they are legally required to send X copies to libraries where same copy will be read by many people.
If everyone was so protective about their sites then there would be no Internet as we know it because if the Web was unsearchable then it would be one hell of a place with just a handful of human selected sites available to all people. Is this what you want? An even worse monopoly than just a single dominating search engine? You guys are sleepwalking into monopoly that generates unholy sized threads every time it changes the algorithm. This is unhealthy for content owners in the first place.
I don't think librarians earn a commission when someone checks out a book.
In addition and at least in the majority of the US.
Local libraries, their contents, expenses and wages are funded by local taxpayers.
Making them servants of that same tax paying public.
In the state of my residence, each private telephone line is charged a monthly fee to pay for the internet service (machines and otherwise) that local libraries provide to the very same tax paying public.
It's a bad idea to compare a public and a publicly supported library to a private website or any other commercial venture.
In addition, there are many states in the US that do not allow access to their digital archives to non-residents of that state. Ohio is one example.
How is this deemed "public access?"
Perhaps the procedures of public libraries are different in the UK and other countries?
Don
It's a bad idea to compare a public and a publicly supported library to a private website or any other commercial venture.
"The Internet Archive is a 501(c)(3) non-profit that was founded to build an ‘Internet library,’ with the purpose of offering permanent access for researchers, historians, and scholars to historical collections that exist in digital format."
[archive.org...]
What more do you want from IA apart from being good netizen, ie obeying robots.txt etc?
AFAIK in the USA (and correct me here if I am wrong) The Library of Congress gets few copies of pretty much anything published in the USA: books, newspapers etc.
Comparison between libraries in general and LoC in particular and IA is perfectly normal.
Now tell me what you have against Google crawling you site to make even more money for shareholders of the company?
They are employed regardless of whether or not anyone checks out a book.
So what? They still make money from somebody's content! What's so different in you and book publisher?
It hard to debate someone who twists other's words and uses faulty logic.
If its faulty then demonstrate it please. I assume here you don't mind Google (for profit company) crawling your site, but you don't like anybody else including non-profit entities like IA whose job is nothing else but a job of a digital library, something protected by laws in most countries? The matter of fact is that publishers in real world can't argue against their content used in libraries -- this sure hits sales, you - electronic publisher - is in exactly the same position and only current lack of explicit laws does not actually require you to allow bots crawl your site: if your content is so secret then take it off public internet.
[edited by: Lord_Majestic at 11:08 pm (utc) on Nov. 7, 2005]
Please read my 4th, 5th and 6th paragraph in message #19 of this thread.
I read it carefully but you seem to ignore simple fact that IA is a non-profit registered charity. Do you object to Google crawling your site and reselling terabytes of data (which includes your sites) to AOL and others who pay them big $$$$$$$ for search technology? Do you object Library of Congress charging money for some of their services? Hypocritical comes to my mind.
[edited by: Lord_Majestic at 11:13 pm (utc) on Nov. 7, 2005]