Welcome to WebmasterWorld Guest from 35.172.195.49

Forum Moderators: Ocean10000

Message Too Old, No Replies

NLUX IAHarvester

New UA for Internet Archive

     
8:28 pm on Jun 24, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


UA: "Mozilla/5.0 (compatible; NLUX_IAHarvester/3.3.0 +http://crawl.bnl.lu/)
Protocol: HTTP/1.0
Robots.txt: Yes
Host: us.archive.org (Internet Archive)
207.241.224.0 - 207.241.239.255
207.241.224.0/20

They must think if they keep changing their UA, they'll be able to scrape our properties without our notice.
8:33 pm on June 24, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts: 2663
votes: 112


Looks like Archive.org is doing some National Library crawling. There's an initiative among national libraries to crawl their local web for reference purposes. Perhaps your site was on Luxembourgish IP range or was categorised as being related to Luxembourg.

Regards...jmcc
8:40 pm on June 24, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


Well I've been in conflict with them for about 10 years. I almost took them to court. I had to send numerous C&Ds and had my attorney serve them.

Finally they removed my property from their so called "Wayback Machine." Soon after that they installed a mechanism to remove your property from their server but it was virtually ignored for a couple more years. IMO it was more for "show."

For the next few years I would occasionally find my pages again copied to their server, each time serving them with removal notices.

After the 2016 US elections, they moved their server from San Francisco to Canada and abroad. Now they have registered IP ranges I can block. Still they do a drive-by every now and then to see if they can sneak in.
8:49 pm on June 24, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts: 2663
votes: 112


Their methodology for the national library surveys/scrapes isn't very good and isn't very precise. The main focus is on the local ccTLDs and the problem is that the people looking at the end product are librarians rather than Search heads. Thus they get an awful lot of rubbish (parking/holding pages etc, and hacked pages and sites) mixed in with the searches. Like most Infinite Monkeys based surveys, they follow links so if your site showed up in some LU Dmoz clone, it was following links from it.

Regards...jmcc
8:50 pm on June 24, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1194
votes: 288


About this spider, this is because The Bibliothèque nationale de Luxembourg, is using the open source crawler developed by the Internet Archive. [webarchive.jira.com...]

They say the crawler respects the robots.txt file (excepting for the home page of a site) , but they don't mention how to identify their robots.

Well I've been in conflict with them for about 10 years. I almost took them to court. I had to send numerous C&Ds and had my attorney serve them.

Sorry to hear about that.

Personally, I had my sites removed from the Wayback machine smoothly, in 2006, if I don't make mistake. I just wrote to them asking how to get a site removed. And they told me they just needed a proof I was the owner, this could be achieved by writing to them from an email at the domain name concerned (I guess you can't remove yahoo, by writing from a yahoo.com email), or by showing them a bill from the registrant. I was even able to remove a domain name I was no longer owning, I showed them I owned it from xxxx to xxxx, and they removed all pages from the archive in a matter of days.

Also, their robot obeys my robots.txt, I don't know why, because I often read they do not. May be, since my sites are not hosted in the USA, they have different rules.
8:54 pm on June 24, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


Looking back through my records, they first infringed my property in 2003 before they actually established themselves as a non-profit.