NLUX IAHarvester

Forum Moderators: open

Message Too Old, No Replies

NLUX IAHarvester

New UA for Internet Archive

keyplyr

8:28 pm on Jun 24, 2018 (gmt 0)

UA: "Mozilla/5.0 (compatible; NLUX_IAHarvester/3.3.0 +http://crawl.bnl.lu/)
Protocol: HTTP/1.0
Robots.txt: Yes
Host: us.archive.org (Internet Archive)
207.241.224.0 - 207.241.239.255
207.241.224.0/20

They must think if they keep changing their UA, they'll be able to scrape our properties without our notice.

jmccormac

8:33 pm on Jun 24, 2018 (gmt 0)

Looks like Archive.org is doing some National Library crawling. There's an initiative among national libraries to crawl their local web for reference purposes. Perhaps your site was on Luxembourgish IP range or was categorised as being related to Luxembourg.

Regards...jmcc

keyplyr

8:40 pm on Jun 24, 2018 (gmt 0)

Well I've been in conflict with them for about 10 years. I almost took them to court. I had to send numerous C&Ds and had my attorney serve them.

Finally they removed my property from their so called "Wayback Machine." Soon after that they installed a mechanism to remove your property from their server but it was virtually ignored for a couple more years. IMO it was more for "show."

For the next few years I would occasionally find my pages again copied to their server, each time serving them with removal notices.

After the 2016 US elections, they moved their server from San Francisco to Canada and abroad. Now they have registered IP ranges I can block. Still they do a drive-by every now and then to see if they can sneak in.

jmccormac

8:49 pm on Jun 24, 2018 (gmt 0)

Their methodology for the national library surveys/scrapes isn't very good and isn't very precise. The main focus is on the local ccTLDs and the problem is that the people looking at the end product are librarians rather than Search heads. Thus they get an awful lot of rubbish (parking/holding pages etc, and hacked pages and sites) mixed in with the searches. Like most Infinite Monkeys based surveys, they follow links so if your site showed up in some LU Dmoz clone, it was following links from it.

Regards...jmcc

Dimitri

8:50 pm on Jun 24, 2018 (gmt 0)

About this spider, this is because The Biblioth�que nationale de Luxembourg, is using the open source crawler developed by the Internet Archive. [webarchive.jira.com...]

They say the crawler respects the robots.txt file (excepting for the home page of a site) , but they don't mention how to identify their robots.

Well I've been in conflict with them for about 10 years. I almost took them to court. I had to send numerous C&Ds and had my attorney serve them.

Sorry to hear about that.

Personally, I had my sites removed from the Wayback machine smoothly, in 2006, if I don't make mistake. I just wrote to them asking how to get a site removed. And they told me they just needed a proof I was the owner, this could be achieved by writing to them from an email at the domain name concerned (I guess you can't remove yahoo, by writing from a yahoo.com email), or by showing them a bill from the registrant. I was even able to remove a domain name I was no longer owning, I showed them I owned it from xxxx to xxxx, and they removed all pages from the archive in a matter of days.

Also, their robot obeys my robots.txt, I don't know why, because I often read they do not. May be, since my sites are not hosted in the USA, they have different rules.

keyplyr

8:54 pm on Jun 24, 2018 (gmt 0)

Looking back through my records, they first infringed my property in 2003 before they actually established themselves as a non-profit.