British Library scraper

Forum Moderators: open

Message Too Old, No Replies

British Library scraper

bl.uk_lddc_bot/3.1.1

dstiles

6:17 pm on Aug 14, 2012 (gmt 0)

This has been around for a few months at least but the first time (I think!) I've noticed it. It seems innocuous until you read their remit.

UA: bl.uk_lddc_bot/3.1.1 (+http://www.bl.uk/aboutus/stratpolprog/digi/domresproj/index.html)

IP: 194.66.232.93

(Another IP in the 194.66.232.0/24 range has also made a scrape attempt recently.)

The bot info page linked to in the UA includes:

"The content harvested at this time will not be made available."

and

"This work is undertaken in anticipation of forthcoming Legal Deposit regulations that will make it the Library’s statutory responsibility to collect, preserve and provide long-term access to the UK’s online intellectual and cultural heritage."

The page does include opt-out and complaints options.

My worry would be: who will ultimately be able to view the details and how; and can it be easily scraped from them.

tangor

9:01 am on Aug 16, 2012 (gmt 0)

Another example of why whitelisting is the way to go. Pick your battles, and manage them before the "war" starts.

jmccormac

9:12 am on Aug 16, 2012 (gmt 0)

A British Library record for a website might be a very powerful thing if it came taking copyright over an infringement but the idea of the BL approaching the web as it does with print is interesting. Possibly one to whitelist.

Regards...jmcc

g1smd

1:35 pm on Aug 16, 2012 (gmt 0)

I think I'd let that in for sites with factual content and that are going to be archived for long term storage.

Samizdata

4:26 pm on Aug 16, 2012 (gmt 0)

"This work is undertaken in anticipation of forthcoming Legal Deposit regulations"

It would be interesting to see what the "forthcoming regulations" actually say - when I edited a magazine it was a legal requirement to send copies of each issue to the Legal Deposit libraries (no doubt web publishing will be treated differently).

How the bot is programmed would also be interesting - it should have no business "harvesting" from non-UK servers, but content can be hosted anywhere.

"to collect, preserve and provide long-term access to the UK’s online intellectual and cultural heritage"

That seems to rule out the vast majority of UK websites.

But they would still have to crawl all of them to decide what is worth preserving.

And it only takes the stroke of a politician's pen to make access a legal requirement.

My worry would be: who will ultimately be able to view the details and how; and can it be easily scraped from them.

If the project goes ahead I would expect the content to be globally scrapable.

But at least it isn't the Wayback Machine.

...

dstiles

7:01 pm on Aug 16, 2012 (gmt 0)

Nor is it in the BL database (at least, not at the moment).