While it asks for robots.txt under this user agent it is obviously using other user agents to gain access. Since I whitelist this UA is never getting access yet some of my pages, from an older site using older code, still ended up in their listings but it had clues.
First, the page is tagged with a code that it passed the UA test meaning they were masking themselves as a browser to collect the data from the following IPs:
inetnum: 18.104.22.168 - 22.214.171.124 netname: Bulldog descr: 40:1 Dynamic IP Pool country: GB role: Cable and Wireless Access Ltd address: SE1 0SL
Sorry I can't provide more details but it if passes all the other filters the most I do is identify the source of the crawl and not the user agent. Working on something better but I didn't want to maintain that amount of forensic data on my end as it gets a little crazy after a while, esp. since you don't often find the destination of where the data ended for months which makes for a LOT of storage.
Maybe I should try a new rule so that anything in the reverse DNS with "host" or "server" gets the axe and see just how many cyber ships get stranded on that reef as it would certainly help locate hosting within normal service provider ranges which is always problematic at best.