File under: Which part of "Disallow:" did you not understand?
18.104.22.168 - - [30/Jul/2014:13:12:57 -0700] "GET /robots.txt HTTP/1.1" 200 885 "http://www.bdbrandprotect.com" "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)"
22.214.171.124 - - [30/Jul/2014:13:12:57 -0700] "GET /piwik/piwik.php?idsite=3&rec=1 HTTP/1.1" 200 247 "-" "BPImageWalker/2.0 (www.bdbrandprotect.com)"
That's from my personal site, where the piwik files live. The formulation is
<img src = "http://www.example.com/piwik"/piwik.php?idsite=3&rec=1
et cetera on all pages, hence the single request.
robots.txt on this site says in part:
robots.txt plus 620 image requests-- including the entire contents of three roboted-out subdirectories-- from my main site.
The MSIE UA was used only for requesting robots.txt. (I serve the same file to everyone.) All other image requests-- i.e. 620 + 1-- are
IP range for the full visit was
BrandProtect as a whole is 126.96.36.199/18; the robots stuck to the narrower range.
There is a sister robot called LinkWalker
that crawls pages. It did its stuff about 2 1/2 hours earlier. Mysteriously this one does seem to honor robots.txt, barring the common initial pattern of
meaning that it requested the front page before it had actually seen robots.txt. Apart from that, though, it behaved itself. It did not ask for any css or js.
Since it began its crawl on the front page and I've never met the range before, I don't know what prompted its interest. If I'm only going to see it once in three years, it may not be worth blocking ;)