homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

Don't stop fearing the webreaper

 5:26 am on May 3, 2013 (gmt 0)

I have seldom been so angry in my life.

When you take a quick look at your logs to make sure the latest htaccess edit hasn't had ::cough-cough:: unintended consequences, and find the access log at midday fully five times as fat as normal, it might be a good sign.

:: insert chorus of "I'm always a cockeyed optimist" here ::

When the accompanying error log is similarly almost as fat as a normal day's entire access log, it is definitely not a good sign.

:: cut details of long and exhaustive investigation ::

User-Agent for three occurrences of collecting robots.txt:
WebReaper v10.0 - www.webreaper.net

User-Agent for two thousand, seven hundred forty-nine occurrences of ignoring robots.txt:
WebReaper [support@webreaper.net]

IP: Irrelevant.

:: further detour to offending robot's www site, using Safari with fake UA ::

Q: Some sites result in an "Access denied" error in WebReaper. How can I download them?
Unfortunately, you can't. WebReaper obeys the internet Robots Exclusion Standard

This is a, um, uh... Dang! Can't think of the word, although I'm pretty sure it's only three letters.

So why am I so angry? Above and beyond the fact that this is the single biggest robot attack I have ever sustained-- I don't believe I even have 2749 files (of all kinds) on my site-- close study of the beginning of the visit reveals that it was triggered by someone I know. Not face-to-face personally, or I'd go over and tear their head off, but online. And their actual target-- this is where the personal knowledge comes in-- consisted of not 2700+ but seven html files plus css and images.


First stop: Forum I share with nameless offender, where I post a curt character assassination benefiting from forum's inexplicable lack of word censoring.

Second stop: Here to remind everyone. If you haven't met WebReaper in a while and are thinking of commenting-out the block-- don't do it just yet.



 10:23 pm on May 3, 2013 (gmt 0)

I don't fear the webreaper because it's not one of the about 20 user agents whitelisted to access my server.


 2:12 am on May 4, 2013 (gmt 0)

WebReaper must have been one of the first scraper tools identified in this forum. It's listed many places. Even wannaBrowser uses the UA to test your blocking defenses. My notes say I blocked it about 11 years ago.


 4:38 am on May 4, 2013 (gmt 0)

The part that enraged me was when I realized it had been INVITED IN. Not by me, by someone who wanted to read a group of seven pages and couldn't be bothered to ask his browser to save them in complete form. Or ask me if I could zip up the package for him, which I would readily have done. (It's a large and massively illustrated e-book, in preparation.) Nope, just fire up the scraping utility and let it loose on THE ENTIRE SITE.

btw, I was mistaken in my first post. The robot didn't collect 2752 files. It was only 2252. The other 500 (exactly) were the human.

Full sequence, deduced by picking apart UAs:

15:57:32 human visits one page, which has links to six others.
16:01:05 human lets robot loose on same page-- which happens to live in a roboted-out directory. For the next two minutes it gobbles up all pages and supporting files linked from the starting page. When it reaches a dead end, it has collected all the pages and supporting files that the original human would have wanted.
16:03:57-16:04:50 human skips around the same group of seven pages, with all images. Why he didn't read the version the robot had just finished collecting is anyone's guess.
19:09:53-19:16:14 robot returns, this time starting from the site's front page, and systematically devours every single human-accessible file, excluding only images called by CSS (which it is too stupid to request without their enclosing quotation marks, even when it results in forms like /paintings/"/paintings/refrats/music/bluenote.gif"), supplemented by every single javascript function on the entire site-- requested as if they were filenames, as in /paintings/refrats/playMusic('band'). The originally targeted group of pages are conspicuous by their absence, since they are linked from nowhere.
19:25:08 robot makes final visit, attempting once again to get nonexistent file with name ending in /images, with resulting redirect to /images/ followed by resounding 403.

Each of the robot's three visits began with a pickup of robots.txt. I cannot begin to imagine what it does with them. Wallpaper, possibly.

It's listed many places.

Yes, when I looked it up here, the most common type of hit was a cut-and-paste UA block list including the element "WebReaper". The word "leech" is prominent. This thread [webmasterworld.com] summed it up most concisely :) The words "highly antisocial" are a heck of a lot more polite than I was in Other Forum.


 8:38 am on May 4, 2013 (gmt 0)

A couple years ago I was condensing and trimming some fat with rewrite rules blocking UAs. I can identify with the ironic justice of removing a block only to have that UA show up again after a long absence.


 7:53 pm on May 4, 2013 (gmt 0)

removing a block only to have that UA show up again after a long absence

Uh-oh, you mean someone's about to start asking for muieblackcat again? :) I've never met WebReaper before. I don't have a lot of, er, scrape-worthy content.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved