homepage Welcome to WebmasterWorld Guest from 54.204.59.230
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
British Library scraper
bl.uk_lddc_bot/3.1.1
dstiles




msg:4484758
 6:17 pm on Aug 14, 2012 (gmt 0)

This has been around for a few months at least but the first time (I think!) I've noticed it. It seems innocuous until you read their remit.

UA: bl.uk_lddc_bot/3.1.1 (+http://www.bl.uk/aboutus/stratpolprog/digi/domresproj/index.html)

IP: 194.66.232.93

(Another IP in the 194.66.232.0/24 range has also made a scrape attempt recently.)

The bot info page linked to in the UA includes:

"The content harvested at this time will not be made available."

and

"This work is undertaken in anticipation of forthcoming Legal Deposit regulations that will make it the Library’s statutory responsibility to collect, preserve and provide long-term access to the UK’s online intellectual and cultural heritage."

The page does include opt-out and complaints options.

My worry would be: who will ultimately be able to view the details and how; and can it be easily scraped from them.

 

tangor




msg:4485381
 9:01 am on Aug 16, 2012 (gmt 0)

Another example of why whitelisting is the way to go. Pick your battles, and manage them before the "war" starts.

jmccormac




msg:4485383
 9:12 am on Aug 16, 2012 (gmt 0)

A British Library record for a website might be a very powerful thing if it came taking copyright over an infringement but the idea of the BL approaching the web as it does with print is interesting. Possibly one to whitelist.

Regards...jmcc

g1smd




msg:4485471
 1:35 pm on Aug 16, 2012 (gmt 0)

I think I'd let that in for sites with factual content and that are going to be archived for long term storage.

Samizdata




msg:4485527
 4:26 pm on Aug 16, 2012 (gmt 0)

"This work is undertaken in anticipation of forthcoming Legal Deposit regulations"

It would be interesting to see what the "forthcoming regulations" actually say - when I edited a magazine it was a legal requirement to send copies of each issue to the Legal Deposit libraries (no doubt web publishing will be treated differently).

How the bot is programmed would also be interesting - it should have no business "harvesting" from non-UK servers, but content can be hosted anywhere.

"to collect, preserve and provide long-term access to the UK’s online intellectual and cultural heritage"

That seems to rule out the vast majority of UK websites.

But they would still have to crawl all of them to decide what is worth preserving.

And it only takes the stroke of a politician's pen to make access a legal requirement.

My worry would be: who will ultimately be able to view the details and how; and can it be easily scraped from them.

If the project goes ahead I would expect the content to be globally scrapable.

But at least it isn't the Wayback Machine.

...

dstiles




msg:4485585
 7:01 pm on Aug 16, 2012 (gmt 0)

Nor is it in the BL database (at least, not at the moment).

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved