Welcome to WebmasterWorld Guest from 54.147.0.174

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

BrandProtect

     

lucy24

1:40 am on Aug 1, 2014 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



File under:
Which part of "Disallow:" did you not understand?

Short version:
158.106.67.181 - - [30/Jul/2014:13:12:57 -0700] "GET /robots.txt HTTP/1.1" 200 885 "http://www.bdbrandprotect.com" "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)" 
158.106.67.181 - - [30/Jul/2014:13:12:57 -0700] "GET /piwik/piwik.php?idsite=3&rec=1 HTTP/1.1" 200 247 "-" "BPImageWalker/2.0 (www.bdbrandprotect.com)"


That's from my personal site, where the piwik files live. The formulation is
<noscript>
<img src = "http://www.example.com/piwik"/piwik.php?idsite=3&rec=1
et cetera on all pages, hence the single request.

robots.txt on this site says in part:
User-Agent: *
...
Disallow: /piwik

Long version:
robots.txt plus 620 image requests-- including the entire contents of three roboted-out subdirectories-- from my main site.

The MSIE UA was used only for requesting robots.txt. (I serve the same file to everyone.) All other image requests-- i.e. 620 + 1-- are
BPImageWalker/2.0 (www.bdbrandprotect.com)

IP range for the full visit was
158.106.67.128-200 (really).
BrandProtect as a whole is 158.106.64.0/18; the robots stuck to the narrower range.

There is a sister robot called LinkWalker
LinkWalker/3.0 (http://www.brandprotect.com)

that crawls pages. It did its stuff about 2 1/2 hours earlier. Mysteriously this one does seem to honor robots.txt, barring the common initial pattern of

robots.txt 301
/ 301
robots.txt 200
/ 200

meaning that it requested the front page before it had actually seen robots.txt. Apart from that, though, it behaved itself. It did not ask for any css or js.


Since it began its crawl on the front page and I've never met the range before, I don't know what prompted its interest. If I'm only going to see it once in three years, it may not be worth blocking ;)

Pfui

6:21 pm on Aug 1, 2014 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



BrandWatch/BrandProtect is worth blocking because it's violated your rules 600-plus times already. No reason to wait for it to do it again.

Plus BrandWatch's marauded in one form or another for years. From my notes for May - Sept., 2010. Note the verrry subtle UA variation:

By HOST: mail0-brandwatch.brandwatch.net [projecthoneypot.org...]
magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)

By IP: 94.228.34.237 [projecthoneypot.org...]
magpie-crawler/1.1 (U; Linux amd64; en-GB; http://www.brandwatch.net)

Related: "Brandwatch/Magpie-crawler" [webmasterworld.com...]

I block BrandWatch and its variations by Host, CIDR and UA.

not2easy

7:11 pm on Aug 1, 2014 (gmt 0)

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



Mac's Network Utilities WHOIS says:
COGECODATA CDSI (NET-158-106-64-0-1)
158.106.64.0 - 158.106.127.255
Brandprotect Inc CDSI-BRANDPROTECT (NET-158-106-67-0-1)
158.106.67.0 - 158.106.67.255

keyplyr

7:48 pm on Aug 1, 2014 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




User-agent: BPImageWalker
Disallow: /

has always worked for me

Pfui

12:44 am on Aug 2, 2014 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Speak of the devil... A few hours ago, from the same long ago CIDR (94.228.34.192/26):

94.228.34.203
magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)

At least they asked for and heeded robots.txt this year.

Note: Upstream server farm for above is 4d-dc.com (94.228.32.0/20). [myip.ms...]
 

Featured Threads

Hot Threads This Week

Hot Threads This Month