Forum Moderators: open
Yet another menace --
This is a Firefox extension and its specifics do not appear in UA strings. (The current version is 1.3.4.) It lets users scrape pages and capture and edit sites, etc., with total disregard for sites, and/or ToS/ToU, let alone copyrights. No robots.txt, no nothing. Just HEAD hit and run.
That's the nasty downside for sites and site owners.
There's also a nasty downside for users --
The referer shows their entire file path, the entire directory structure where the ScrapBook files are saved. In the case of the person who just hit us, hard, because of how they set up their PC, I now know their real name and their Mozilla profile name. Yikes.
How you can spot/stop this --
ScrapBook came to my attention in two ways: The first was because our uptime suddenly and atypically tripled. The second was the blast of 403s in my logs because of the HEAD reqs and the referer:
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
[line break to prevent side scroll]
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html"
(All "file:///C:/"-containing referers get 403'd because they wreak havoc with image blocks and such.)
I think Firefox saves the page from its cache, thus sending only the HEAD check. I puzzled with a way to block it. Are you sure you're not blocking due to some other rule?
Scenario:
User comes to a web page using Firefox w/ ScrapBook add-on.
User chooses to "save" that page to his ScrapBook folder.
The ScrapBook add-on looks in the browser cache to get the necessary files, and saves those files to a folder on the user's machine.
When the user wants to view that saved page, it comes from the folder and not the server - so no rule on the server will matter.
Now - I do see 403s in my logs for similar requests that come from "file://C...
But, I have the ScrapBook add-on and I am able to save pages from my server and they are *not* blocked. Why some requests are blocked by my Rewrite rules and still I can save page with ScrapBook is the mystery. I'll look through my logs in a few hours to find out why.
Also, beware of blocking "Profile" since Blackberry includes it in refer.
[edited by: keyplyr at 7:45 am (utc) on Sep. 9, 2009]
1.) There is NO identifiable way via UA to know someone's using the ScrapBook FF add-on. (Unlike, say, AutoPager or other horn-blowing scrapers.) ScrapBook's specifics do NOT appear in UA strings.
2.) Neither is ScrapBook's activity visible via URI other than by HEAD requests. I only identified it because of its name in the referer, which contained the visitor's local PC file path.
3.) As of right now, the only way I know to block it is by blocking HEAD requests (which I do, except from aol.com and certain trusted bots), and/or PC-specific local file path referers.
4.) Alternatively, if you're super savvy vis-a-vis server coding, you probably already block too-rapid file requests. If you don't, your sites are vulnerable (...as are mine. Dang I wish I knew how to get mod_bandwidth or its ilk working to limit hit rate).
What makes this nastier than just another cloaked scraper/bot is that if you use ScrapBook, you're also vulnerable to it silently bleeding your private info everywhere you use it.
If there is not some valid attempt to follow protocol, why would the requests even show up in your logs (even if they are mere HEAD reqyests?)
Looked at the plugin and requires a version of FF, which I do not use, thus any testing on my end is out.
Don
Do you have any "no cache" pages?
no
How does the plugin and subsequent requests react to those?
I wouldn't know.
If there is not some valid attempt to follow protocol, why would the requests even show up in your logs (even if they are mere HEAD reqyests?)
The ScrapBook HEAD requests did not show in my logs, but Pfui says they did for her.
65.207.77.nnn - - [08/Sep/2009:11:05:37 -0700] "HEAD /dir01/fileA.html HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"
65.207.77.nnn - - [08/Sep/2009:11:05:37 -0700] "HEAD /dir02/fileB.html HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"
65.207.77.nnn - - [08/Sep/2009:11:05:38 -0700] "HEAD /dir03/fileC.html HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"
65.207.77.nnn - - [08/Sep/2009:11:05:39 -0700] "HEAD /fileD.html HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"
The Scrap(e)Book scrape came after the visitor 'manually' browsed bunches of pages. Then they let 'er rip, leaving a mess in their wake. Wonder when they'll discover they got bubkes for their trouble? ;)
This particular backbone has a very large subnet range for commercial users.
Also in this instance GTB5 is a usable approach.
Apparently though (as keyplr provided) once the page is cached, denying the head request seems to be the only key.
I'm still curious as to how the software works with NO CACHE pages?
Don
1.) Cache-related?
Header append Cache-Control: "no-store, no-cache (etc.)"
2.) Page-based?
<META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
3.) Robots-specific?
Header append X-Robots-Tag "noarchive,nosnippet,notranslate"
Basically, ScrapBook (mis)behaved like your typical bad bot, trying to scrape everything from html to gif/jpg files. It even tried to scrape a cgi script (and that really ticks me off):
65.207.77.nnn - - [08/Sep/2009:11:05:41 -0700] "HEAD /cgi-bin/scriptname.cgi HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"
It ignored/didn't request robots.txt. It ignored #2. It ignored #3. It based new/re-scrape retrieves on its saved file paths which it screwed up:
[Tue Sep 8 11:04:55 2009] [error] [client 65.207.77.nnn] File does not exist: /path-to-dir/imageA.gif
[Tue Sep 8 11:04:55 2009] [error] [client 65.207.77.nnn] File does not exist: /path-to-dir/imageB.gif
[Tue Sep 8 11:04:55 2009] [error] [client 65.207.77.nnn] File does not exist: /path-to-dir/imageC.gif
[Tue Sep 8 11:04:55 2009] [error] [client 65.207.77.nnn] File does not exist: /path-to-dir/imageD.gif
Why so much follow-up interest in what ScrapBook does, or doesn't, do on a server?
But for irresponsibly leaking the user's personal info, something I find alarming and inexcusable, ScrapBook's actions are simply typical of any nasty cloaked scraper or bot.
Which kind of NO CACHE?
Page-Meta
CONTENT="noarchive"
Why so much follow-up interest in what ScrapBook does, or doesn't, do on a server?
Because the HEAD requests are follow-up to the initial harvesting and no explanation has been provided of that same initial crawling.
According to keyplr, the HEAD requests are a result of a return verfication when the pages have previously been cached by the software.
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
FWIW...
From the looks of how the person visited, w/o ScrapBook and then during/after ScrapBook, they first browsed around normally, ultimately landing on a page that's like a section TOC, with intra-site links galore. THEN they started the scrape to other pages. Here's the gear-shift, occurring w/in approx. 30 sec.:
Normal referer:
65.207.77.nnn - - [08/Sep/2009:11:04:56 -0700] "GET /dirA/image.gif HTTP/1.1" 404 3243
"http://www.sitename.com/dirA/filename.html"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"
ScrapBook referer:
65.207.77.nnn - - [08/Sep/2009:11:05:37 -0700] "HEAD /dirB/filename.html HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"
FWIW redux, the ScrapBook referers were NOT previously cached pages. In the second example, above, the visitor had not already hit the page being HEAD-requested.
Confusing? Sorry! In other words, it looks like my server file --
/dirA/filename.html
-- was morphed by ScrapBook into the visitor's local drive file:
/ScrapBook/data/20090908140451/index.html
Then, using that "index.html" as a kind of local site map, the scrape ran from local "index.html" to 26 server files -- none of which had been visited.
FWIW, not all links on the server-based page were scraped. I cannot explain the selectiveness other than the visitor could control or set a preference re what got scraped. Or, perhaps more likely, they realized they were getting nothing but 403s.
Anyway, after the 26 thwarted ScrapBook scrapes, the visitor resumed normal browsing, with normal referers.
Hope that helps/makes sense.
Additionally, there 10 tools. It can scrape multiple URLs, create its own directories, edit pages, republish (output as new HTML), rediscover those pages, rediscover the web, drill down 1 layer, 2 layers, etc... the list continues. Additional add-ons can even be added to ScrapBook.
You really need to install it to get a better understanding.
You're fortunate that your server shows this log activity, mine does not.
You're fortunate that your server shows this log activity, mine does not.
keyplr,
this is the standard log format by most providers.
This seems to be utilized by a vistor highly focused in your site (s), rather than a random harvester.
Simply blacklisting the IP range when you become aware, should deter return visits.
- Correction: I just noticed the pre-transition info I added a but ago shows a 404 in response to the GIF URI. (slaps head) They still started on the TOC-like page, but the scrape started a few seconds earlier, with graphics calls that reflected its wrongly set file paths, rather than the real ones.
- My logs are Extended Log Format (ELF) on an older Apache. I've thought about adding more configurations but at some point, you know how it goes -- you end up with so much data that you never get around to half of it.
- A few minutes after I discovered the ScrapBook visitor up to no good, their IP (no rDNS) became a RewriteCond %{REMOTE_ADDR} entry. They'll need to request access next time around.
- I should've linked up the ScrapBook site [amb.vis.ne.jp] in my OP. Then maybe the creator, or a Mozilla add-on [addons.mozilla.org] rep, might have, akin to other bot-makers like Majestic and Digsby, noticed links from here in their logs and changed a thing or three about this poorly coded plug. (Apologies to Bill if you remove the links I added to this paragraph:)
I think the author means that he changed the code, and that the change should fix this problem as of the September 13th release of version 1.3.5 ... Language differences may enter into the ambiguity of his statement.
Jim