Welcome to WebmasterWorld Guest from 54.163.65.181

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

British Library scraper

bl.uk_lddc_bot/3.1.1

     
6:17 pm on Aug 14, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3134
votes: 4


This has been around for a few months at least but the first time (I think!) I've noticed it. It seems innocuous until you read their remit.

UA: bl.uk_lddc_bot/3.1.1 (+http://www.bl.uk/aboutus/stratpolprog/digi/domresproj/index.html)

IP: 194.66.232.93

(Another IP in the 194.66.232.0/24 range has also made a scrape attempt recently.)

The bot info page linked to in the UA includes:

"The content harvested at this time will not be made available."

and

"This work is undertaken in anticipation of forthcoming Legal Deposit regulations that will make it the Library’s statutory responsibility to collect, preserve and provide long-term access to the UK’s online intellectual and cultural heritage."

The page does include opt-out and complaints options.

My worry would be: who will ultimately be able to view the details and how; and can it be easily scraped from them.
9:01 am on Aug 16, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:7051
votes: 423


Another example of why whitelisting is the way to go. Pick your battles, and manage them before the "war" starts.
9:12 am on Aug 16, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts: 2529
votes: 47


A British Library record for a website might be a very powerful thing if it came taking copyright over an infringement but the idea of the BL approaching the web as it does with print is interesting. Possibly one to whitelist.

Regards...jmcc
1:35 pm on Aug 16, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I think I'd let that in for sites with factual content and that are going to be archived for long term storage.
4:26 pm on Aug 16, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 29, 2006
posts:1312
votes: 0


"This work is undertaken in anticipation of forthcoming Legal Deposit regulations"

It would be interesting to see what the "forthcoming regulations" actually say - when I edited a magazine it was a legal requirement to send copies of each issue to the Legal Deposit libraries (no doubt web publishing will be treated differently).

How the bot is programmed would also be interesting - it should have no business "harvesting" from non-UK servers, but content can be hosted anywhere.

"to collect, preserve and provide long-term access to the UK’s online intellectual and cultural heritage"

That seems to rule out the vast majority of UK websites.

But they would still have to crawl all of them to decide what is worth preserving.

And it only takes the stroke of a politician's pen to make access a legal requirement.

My worry would be: who will ultimately be able to view the details and how; and can it be easily scraped from them.

If the project goes ahead I would expect the content to be globally scrapable.

But at least it isn't the Wayback Machine.

...
7:01 pm on Aug 16, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3134
votes: 4


Nor is it in the BL database (at least, not at the moment).