Forum Moderators: open

Message Too Old, No Replies

netarkivet.dk/webcrawler

heritrix variant

         

keyplyr

12:23 am on Dec 10, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




UA: Mozilla/5.0 (compatible; heritrix/3.3.0 +http://netarkivet.dk/webcrawler/)
Protocol: HTTP/1.1
Robots.txt: Yes
Host: kb.dk (Denmark Bibliotek University)
Parent: Forskningsnettet (Danish Research Network)
130.225.0.0 - 130.226.255.255
130.225.0.0/16

heritrix has always followed robots.txt directives, but did not with this UA, Perhaps the netarkivet.dk/webcrawler people have edited this field. They were blocked by other filters, but I found this interesting.

jonasjacek

7:13 pm on Dec 13, 2016 (gmt 0)

5+ Year Member



This seems to be the danish version of bl.uk lddc bot ([webmasterworld.com ]).
On the english website (see [netarkivet.dk ]) it says:

Since 2005 the collection and preservation of the Danish part of the internet is included in the Danish Legal Deposit Law. The task is undertaken by the two legal deposit libraries in Denmark, State and University Library and The Royal Library.

Netarchive cannot be accessed by the general public.The archive is only accessible to researchers who have requested and been granted special permission to use the collection for specific research purposes.


Also very interesting:

If you want to know more about NetarchiveSuite, the open source software developed and used by the institutions behind Netarkivet, please check the NetarchiveSuite website ([sbforge.org ]).


If you want to see your robots.txt files respected, send a pull request!

keyplyr

11:03 pm on Dec 13, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



seems to be the danish version of bl.uk lddc bot
Good catch