homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum


 1:40 am on Aug 1, 2014 (gmt 0)

File under:
Which part of "Disallow:" did you not understand?

Short version: - - [30/Jul/2014:13:12:57 -0700] "GET /robots.txt HTTP/1.1" 200 885 "http://www.bdbrandprotect.com" "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)" - - [30/Jul/2014:13:12:57 -0700] "GET /piwik/piwik.php?idsite=3&rec=1 HTTP/1.1" 200 247 "-" "BPImageWalker/2.0 (www.bdbrandprotect.com)"

That's from my personal site, where the piwik files live. The formulation is
<img src = "http://www.example.com/piwik"/piwik.php?idsite=3&rec=1
et cetera on all pages, hence the single request.

robots.txt on this site says in part:
User-Agent: *
Disallow: /piwik

Long version:
robots.txt plus 620 image requests-- including the entire contents of three roboted-out subdirectories-- from my main site.

The MSIE UA was used only for requesting robots.txt. (I serve the same file to everyone.) All other image requests-- i.e. 620 + 1-- are
BPImageWalker/2.0 (www.bdbrandprotect.com)
IP range for the full visit was (really).
BrandProtect as a whole is; the robots stuck to the narrower range.

There is a sister robot called LinkWalker
LinkWalker/3.0 (http://www.brandprotect.com)
that crawls pages. It did its stuff about 2 1/2 hours earlier. Mysteriously this one does seem to honor robots.txt, barring the common initial pattern of

robots.txt 301
/ 301
robots.txt 200
/ 200

meaning that it requested the front page before it had actually seen robots.txt. Apart from that, though, it behaved itself. It did not ask for any css or js.

Since it began its crawl on the front page and I've never met the range before, I don't know what prompted its interest. If I'm only going to see it once in three years, it may not be worth blocking ;)



 6:21 pm on Aug 1, 2014 (gmt 0)

BrandWatch/BrandProtect is worth blocking because it's violated your rules 600-plus times already. No reason to wait for it to do it again.

Plus BrandWatch's marauded in one form or another for years. From my notes for May - Sept., 2010. Note the verrry subtle UA variation:

By HOST: mail0-brandwatch.brandwatch.net [projecthoneypot.org...]
magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)

By IP: [projecthoneypot.org...]
magpie-crawler/1.1 (U; Linux amd64; en-GB; http://www.brandwatch.net)

Related: "Brandwatch/Magpie-crawler" [webmasterworld.com...]

I block BrandWatch and its variations by Host, CIDR and UA.


 7:11 pm on Aug 1, 2014 (gmt 0)

Mac's Network Utilities WHOIS says:
COGECODATA CDSI (NET-158-106-64-0-1) -
Brandprotect Inc CDSI-BRANDPROTECT (NET-158-106-67-0-1) -


 7:48 pm on Aug 1, 2014 (gmt 0)

User-agent: BPImageWalker
Disallow: /

has always worked for me


 12:44 am on Aug 2, 2014 (gmt 0)

Speak of the devil... A few hours ago, from the same long ago CIDR (
magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)

At least they asked for and heeded robots.txt this year.

Note: Upstream server farm for above is 4d-dc.com ( [myip.ms...]

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved