homepage Welcome to WebmasterWorld Guest from 54.196.201.253
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
British Library bot
UK law now requires British Library to harvest UK web sites.
dstiles




msg:4569424
 8:25 pm on Apr 30, 2013 (gmt 0)

This is as of 6th April this year. I first saw the crawler about seven days ago (ie around 24th).

IP range: 194.66.224.0 - 194.66.239.255
Bot IPs seen so far are in the range: 194.66.232.84 - 194.66.232.93 but that will no doubt be extended.

Today's UA: Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.6.0 Safari/534.34

I do not think that is the genuine bot IP; possibly someone looking to see why the bot is blocked. An earlier UA was:

bl.uk_lddc_bot/3.1.1 (+http :// www.bl.uk / aboutus / legaldeposit / websites / websites / faqswebmaster / index.html)

(link broken up by me)

It's worth reading the legal web page. It claims the RIGHT to harvest ALL UK-based web content. Which has annoyed one of my clients who, although hosting in the UK, was specifically told, about 15 years ago, he should not trade with UK citizens.

There is an option to block through robots.txt but if that's obeyed then surely it negates their mandate? They also say we can block by IP. Hmm. But then, this is UK bureaucracy, which hasn't yet caught up with modern technolgy - ie later than 1950.

Currently blocked but clients canvassed as to what they want done; though I suspect we will have to comply. :(

 

jmccormac




msg:4569559
 7:21 am on May 1, 2013 (gmt 0)

This is going to be fun. Does the BL have the resources to spider large (>100M pages) websites?

Regards...jmcc

dstiles




msg:4569668
 1:56 pm on May 1, 2013 (gmt 0)

It has been pointed out to me...

I do not think that is the genuine bot IP

should read

I do not think that is the genuine bot UA

Thanks, Lucy. :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved