About this spider, this is because The Bibliothèque nationale de Luxembourg, is using the open source crawler developed by the Internet Archive. [
webarchive.jira.com...]
They say the crawler respects the robots.txt file (excepting for the home page of a site) , but they don't mention how to identify their robots.
Well I've been in conflict with them for about 10 years. I almost took them to court. I had to send numerous C&Ds and had my attorney serve them.
Sorry to hear about that.
Personally, I had my sites removed from the Wayback machine smoothly, in 2006, if I don't make mistake. I just wrote to them asking how to get a site removed. And they told me they just needed a proof I was the owner, this could be achieved by writing to them from an email at the domain name concerned (I guess you can't remove yahoo, by writing from a yahoo.com email), or by showing them a bill from the registrant. I was even able to remove a domain name I was no longer owning, I showed them I owned it from xxxx to xxxx, and they removed all pages from the archive in a matter of days.
Also, their robot obeys my robots.txt, I don't know why, because I often read they do not. May be, since my sites are not hosted in the USA, they have different rules.