lucy24 - 4:38 am on May 4, 2013 (gmt 0)
The part that enraged me was when I realized it had been INVITED IN. Not by me, by someone who wanted to read a group of seven pages and couldn't be bothered to ask his browser to save them in complete form. Or ask me if I could zip up the package for him, which I would readily have done. (It's a large and massively illustrated e-book, in preparation.) Nope, just fire up the scraping utility and let it loose on THE ENTIRE SITE.
btw, I was mistaken in my first post. The robot didn't collect 2752 files. It was only 2252. The other 500 (exactly) were the human.
Full sequence, deduced by picking apart UAs:
15:57:32 human visits one page, which has links to six others.
16:01:05 human lets robot loose on same page-- which happens to live in a roboted-out directory. For the next two minutes it gobbles up all pages and supporting files linked from the starting page. When it reaches a dead end, it has collected all the pages and supporting files that the original human would have wanted.
16:03:57-16:04:50 human skips around the same group of seven pages, with all images. Why he didn't read the version the robot had just finished collecting is anyone's guess.
19:25:08 robot makes final visit, attempting once again to get nonexistent file with name ending in /images, with resulting redirect to /images/ followed by resounding 403.
Each of the robot's three visits began with a pickup of robots.txt. I cannot begin to imagine what it does with them. Wallpaper, possibly.
It's listed many places.
Yes, when I looked it up here, the most common type of hit was a cut-and-paste UA block list including the element "WebReaper". The word "leech" is prominent. This thread [webmasterworld.com] summed it up most concisely :) The words "highly antisocial" are a heck of a lot more polite than I was in Other Forum.