homepage Welcome to WebmasterWorld Guest from 54.145.252.85
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
BrandProtect
lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4692166 posted 1:40 am on Aug 1, 2014 (gmt 0)

File under:
Which part of "Disallow:" did you not understand?

Short version:
158.106.67.181 - - [30/Jul/2014:13:12:57 -0700] "GET /robots.txt HTTP/1.1" 200 885 "http://www.bdbrandprotect.com" "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)"
158.106.67.181 - - [30/Jul/2014:13:12:57 -0700] "GET /piwik/piwik.php?idsite=3&rec=1 HTTP/1.1" 200 247 "-" "BPImageWalker/2.0 (www.bdbrandprotect.com)"


That's from my personal site, where the piwik files live. The formulation is
<noscript>
<img src = "http://www.example.com/piwik"/piwik.php?idsite=3&rec=1
et cetera on all pages, hence the single request.

robots.txt on this site says in part:
User-Agent: *
...
Disallow: /piwik

Long version:
robots.txt plus 620 image requests-- including the entire contents of three roboted-out subdirectories-- from my main site.

The MSIE UA was used only for requesting robots.txt. (I serve the same file to everyone.) All other image requests-- i.e. 620 + 1-- are
BPImageWalker/2.0 (www.bdbrandprotect.com)
IP range for the full visit was
158.106.67.128-200 (really).
BrandProtect as a whole is 158.106.64.0/18; the robots stuck to the narrower range.

There is a sister robot called LinkWalker
LinkWalker/3.0 (http://www.brandprotect.com)
that crawls pages. It did its stuff about 2 1/2 hours earlier. Mysteriously this one does seem to honor robots.txt, barring the common initial pattern of

robots.txt 301
/ 301
robots.txt 200
/ 200

meaning that it requested the front page before it had actually seen robots.txt. Apart from that, though, it behaved itself. It did not ask for any css or js.


Since it began its crawl on the front page and I've never met the range before, I don't know what prompted its interest. If I'm only going to see it once in three years, it may not be worth blocking ;)

 

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4692166 posted 6:21 pm on Aug 1, 2014 (gmt 0)

BrandWatch/BrandProtect is worth blocking because it's violated your rules 600-plus times already. No reason to wait for it to do it again.

Plus BrandWatch's marauded in one form or another for years. From my notes for May - Sept., 2010. Note the verrry subtle UA variation:

By HOST: mail0-brandwatch.brandwatch.net [projecthoneypot.org...]
magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)

By IP: 94.228.34.237 [projecthoneypot.org...]
magpie-crawler/1.1 (U; Linux amd64; en-GB; http://www.brandwatch.net)

Related: "Brandwatch/Magpie-crawler" [webmasterworld.com...]

I block BrandWatch and its variations by Host, CIDR and UA.

not2easy

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



 
Msg#: 4692166 posted 7:11 pm on Aug 1, 2014 (gmt 0)

Mac's Network Utilities WHOIS says:
COGECODATA CDSI (NET-158-106-64-0-1)
158.106.64.0 - 158.106.127.255
Brandprotect Inc CDSI-BRANDPROTECT (NET-158-106-67-0-1)
158.106.67.0 - 158.106.67.255

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4692166 posted 7:48 pm on Aug 1, 2014 (gmt 0)


User-agent: BPImageWalker
Disallow: /

has always worked for me

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4692166 posted 12:44 am on Aug 2, 2014 (gmt 0)

Speak of the devil... A few hours ago, from the same long ago CIDR (94.228.34.192/26):

94.228.34.203
magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)

At least they asked for and heeded robots.txt this year.

Note: Upstream server farm for above is 4d-dc.com (94.228.32.0/20). [myip.ms...]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved