I have one theory of why bots attempt to crawl pages they are forbidden to access: link checking.
Assume all bots are blocked from the entire site with "disallow: \" in robots.txt
Assume that real search engines honor robots.txt and don't crawl your site.
Now, assume you see entries in your log files where those same spiders are accessing various pages in your site that defy the robots.txt block?
The only reason I can come up with is that the bad bots, the scrapers that ignore robots.txt, scraped the site and those links were indexed on the scraper sites.
Assuming the search engines crawled the scraper sites they may be accessing your pages, despite being blocked in robots.txt, to link check those links on the scraped pages.
That's my theory about why search engines might appear to defy robots.txt and I'm thinking about setting up a honeypot site designed just to test that theory.
At a minimum, it would be nice if the search engine told us why it was on our site. Giving us a simple referrer to where it found that link so we could diagnose the situation would be the simplest solution.
Any thoughts on this?