-- Search Engine Spider and User Agent Identification
---- Don't stop fearing the webreaper
lucy24 - 4:38 am on May 4, 2013 (gmt 0)
The part that enraged me was when I realized it had been INVITED IN. Not by me, by someone who wanted to read a group of seven pages and couldn't be bothered to ask his browser to save them in complete form. Or ask me if I could zip up the package for him, which I would readily have done. (It's a large and massively illustrated e-book, in preparation.) Nope, just fire up the scraping utility and let it loose on THE ENTIRE SITE.
btw, I was mistaken in my first post. The robot didn't collect 2752 files. It was only 2252. The other 500 (exactly) were the human.
Full sequence, deduced by picking apart UAs:
Each of the robot's three visits began with a pickup of robots.txt. I cannot begin to imagine what it does with them. Wallpaper, possibly.
It's listed many places.
Yes, when I looked it up here, the most common type of hit was a cut-and-paste UA block list including the element "WebReaper". The word "leech" is prominent. This thread [webmasterworld.com] summed it up most concisely :) The words "highly antisocial" are a heck of a lot more polite than I was in Other Forum.