Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- British Library scraper


Samizdata - 4:26 pm on Aug 16, 2012 (gmt 0)


"This work is undertaken in anticipation of forthcoming Legal Deposit regulations"

It would be interesting to see what the "forthcoming regulations" actually say - when I edited a magazine it was a legal requirement to send copies of each issue to the Legal Deposit libraries (no doubt web publishing will be treated differently).

How the bot is programmed would also be interesting - it should have no business "harvesting" from non-UK servers, but content can be hosted anywhere.

"to collect, preserve and provide long-term access to the UK’s online intellectual and cultural heritage"

That seems to rule out the vast majority of UK websites.

But they would still have to crawl all of them to decide what is worth preserving.

And it only takes the stroke of a politician's pen to make access a legal requirement.

My worry would be: who will ultimately be able to view the details and how; and can it be easily scraped from them.

If the project goes ahead I would expect the content to be globally scrapable.

But at least it isn't the Wayback Machine.

...


Thread source:: http://www.webmasterworld.com/search_engine_spiders/4484756.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com