ScrapBook - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

ScrapBook

Site scraper bot; also discloses user's local file info

Pfui

8:16 pm on Sep 8, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

http:// amb.vis.ne.jp/mozilla/scrapbook/?lang=en
https:// addons.mozilla.org/en-US/firefox/addon/427

Yet another menace --

This is a Firefox extension and its specifics do not appear in UA strings. (The current version is 1.3.4.) It lets users scrape pages and capture and edit sites, etc., with total disregard for sites, and/or ToS/ToU, let alone copyrights. No robots.txt, no nothing. Just HEAD hit and run.

That's the nasty downside for sites and site owners.

There's also a nasty downside for users --

The referer shows their entire file path, the entire directory structure where the ScrapBook files are saved. In the case of the person who just hit us, hard, because of how they set up their PC, I now know their real name and their Mozilla profile name. Yikes.

How you can spot/stop this --

ScrapBook came to my attention in two ways: The first was because our uptime suddenly and atypically tripled. The second was the blast of 403s in my logs because of the HEAD reqs and the referer:

"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
[line break to prevent side scroll]
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html"

(All "file:///C:/"-containing referers get 403'd because they wreak havoc with image blocks and such.)

keyplyr

1:32 am on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Pfui, blocking the refer does not stop this tool. I installed it and since I was already blocking requests that contain "c:/" and "file:" I assumed it could not save my page but it did. Then I forbid any request containg "ScrapBook" using mod_rewrite; still did not stop it since Firefox is calling the file from the newly created folder and not my server.

I think Firefox saves the page from its cache, thus sending only the HEAD check. I puzzled with a way to block it. Are you sure you're not blocking due to some other rule?

wilderness

1:47 am on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

keyplr,
Might you have a full log line?

keyplyr

1:56 am on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Sorry, no logs for several hours. Ask Pfui (LOL)

Seriously, I blocked:

SetEnvIf Request_Method HEAD - [F]

and

RewriteRule ScrapBook - [F]

and I already had "c:/" and "file:" blocked in refer string. So far I can't stop this thing.

wilderness

2:36 am on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

try this

RewriteCond %{REQUEST_URI} Profiles
RewriteRule .* - [F]

blend27

3:00 am on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Don,

That cuts down enouphfff.., sometimes it's looking right at you. and..

made my evening!

Thanks.

keyplyr

3:14 am on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

RewriteCond %{REQUEST_URI} Profiles
RewriteRule .* - [F]

Didn't stop it on my server :(

keyplyr

4:20 am on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The problem is that when the "saved" page is requested, that request is not to the server, but to the folder on the user's machine. And since the page is "saved" form Firefox's cache, there doesn't seem to be a way of identifying anything unique to block.

wilderness

6:00 am on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

keyplr,
if you use a refer deny, it should provide a 403 despite the page being cached locally.

RewriteCond %{HTTP_REFERER} ^file:///
RewriteCond %{HTTP_REFERER} (Documents¦Settings¦Application¦Data¦Firefox¦Profiles¦ScrapBook)
RewriteRule .* - [F]

keyplyr

6:55 am on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

No (as I explained) any rule made for the server does not matter since the call for the saved page is to the user's machine and not the server.

Scenario:

User comes to a web page using Firefox w/ ScrapBook add-on.

User chooses to "save" that page to his ScrapBook folder.

The ScrapBook add-on looks in the browser cache to get the necessary files, and saves those files to a folder on the user's machine.

When the user wants to view that saved page, it comes from the folder and not the server - so no rule on the server will matter.

Now - I do see 403s in my logs for similar requests that come from "file://C...

But, I have the ScrapBook add-on and I am able to save pages from my server and they are *not* blocked. Why some requests are blocked by my Rewrite rules and still I can save page with ScrapBook is the mystery. I'll look through my logs in a few hours to find out why.

Also, beware of blocking "Profile" since Blackberry includes it in refer.

[edited by: keyplyr at 7:45 am (utc) on Sep. 9, 2009]

Pfui

7:34 am on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Sorry if I wasn't clear enough in my OP...

1.) There is NO identifiable way via UA to know someone's using the ScrapBook FF add-on. (Unlike, say, AutoPager or other horn-blowing scrapers.) ScrapBook's specifics do NOT appear in UA strings.

2.) Neither is ScrapBook's activity visible via URI other than by HEAD requests. I only identified it because of its name in the referer, which contained the visitor's local PC file path.

3.) As of right now, the only way I know to block it is by blocking HEAD requests (which I do, except from aol.com and certain trusted bots), and/or PC-specific local file path referers.

4.) Alternatively, if you're super savvy vis-a-vis server coding, you probably already block too-rapid file requests. If you don't, your sites are vulnerable (...as are mine. Dang I wish I knew how to get mod_bandwidth or its ilk working to limit hit rate).

What makes this nastier than just another cloaked scraper/bot is that if you use ScrapBook, you're also vulnerable to it silently bleeding your private info everywhere you use it.

keyplyr

9:04 am on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The only HEAD requests in the logs were from known sources, so none of my ScrapBook activity showed at all. Of course this could just be the way my (shared) server logs are config'd. So blocking HEAD requests is not a solution at least from what I see.

wilderness

3:31 pm on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

keyplr,
Do you have any "no cache" pages?
How does the plugin and subsequent requests react to those?

If there is not some valid attempt to follow protocol, why would the requests even show up in your logs (even if they are mere HEAD reqyests?)

Looked at the plugin and requires a version of FF, which I do not use, thus any testing on my end is out.

Don

keyplyr

6:45 pm on Sep 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Do you have any "no cache" pages?

no

How does the plugin and subsequent requests react to those?

I wouldn't know.

If there is not some valid attempt to follow protocol, why would the requests even show up in your logs (even if they are mere HEAD reqyests?)

The ScrapBook HEAD requests did not show in my logs, but Pfui says they did for her.

Pfui

8:21 am on Sep 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Below are four access_log entries spanning three seconds. There were loads more in subsequent seconds. Line breaks prevent side scroll:

65.207.77.nnn - - [08/Sep/2009:11:05:37 -0700] "HEAD /dir01/fileA.html HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"

65.207.77.nnn - - [08/Sep/2009:11:05:37 -0700] "HEAD /dir02/fileB.html HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"

65.207.77.nnn - - [08/Sep/2009:11:05:38 -0700] "HEAD /dir03/fileC.html HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"

65.207.77.nnn - - [08/Sep/2009:11:05:39 -0700] "HEAD /fileD.html HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"

The Scrap(e)Book scrape came after the visitor 'manually' browsed bunches of pages. Then they let 'er rip, leaving a mess in their wake. Wonder when they'll discover they got bubkes for their trouble? ;)

wilderness

1:41 pm on Sep 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Many thanks Pfui.

This particular backbone has a very large subnet range for commercial users.

Also in this instance GTB5 is a usable approach.

Apparently though (as keyplr provided) once the page is cached, denying the head request seems to be the only key.

I'm still curious as to how the software works with NO CACHE pages?

Don

Pfui

4:09 pm on Sep 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Which kind of NO CACHE?

1.) Cache-related?
Header append Cache-Control: "no-store, no-cache (etc.)"

2.) Page-based?
<META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">

3.) Robots-specific?
Header append X-Robots-Tag "noarchive,nosnippet,notranslate"

Basically, ScrapBook (mis)behaved like your typical bad bot, trying to scrape everything from html to gif/jpg files. It even tried to scrape a cgi script (and that really ticks me off):

65.207.77.nnn - - [08/Sep/2009:11:05:41 -0700] "HEAD /cgi-bin/scriptname.cgi HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"

It ignored/didn't request robots.txt. It ignored #2. It ignored #3. It based new/re-scrape retrieves on its saved file paths which it screwed up:

[Tue Sep 8 11:04:55 2009] [error] [client 65.207.77.nnn] File does not exist: /path-to-dir/imageA.gif
[Tue Sep 8 11:04:55 2009] [error] [client 65.207.77.nnn] File does not exist: /path-to-dir/imageB.gif
[Tue Sep 8 11:04:55 2009] [error] [client 65.207.77.nnn] File does not exist: /path-to-dir/imageC.gif
[Tue Sep 8 11:04:55 2009] [error] [client 65.207.77.nnn] File does not exist: /path-to-dir/imageD.gif

Why so much follow-up interest in what ScrapBook does, or doesn't, do on a server?

But for irresponsibly leaking the user's personal info, something I find alarming and inexcusable, ScrapBook's actions are simply typical of any nasty cloaked scraper or bot.

wilderness

4:28 pm on Sep 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Which kind of NO CACHE?

Page-Meta
CONTENT="noarchive"

Why so much follow-up interest in what ScrapBook does, or doesn't, do on a server?

Because the HEAD requests are follow-up to the initial harvesting and no explanation has been provided of that same initial crawling.

According to keyplr, the HEAD requests are a result of a return verfication when the pages have previously been cached by the software.

Pfui

6:59 pm on Sep 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Okay, let's see... ScrapBook also ignored page-based no-archive, a.k.a.:

<META NAME="ROBOTS" CONTENT="NOARCHIVE">

FWIW...

From the looks of how the person visited, w/o ScrapBook and then during/after ScrapBook, they first browsed around normally, ultimately landing on a page that's like a section TOC, with intra-site links galore. THEN they started the scrape to other pages. Here's the gear-shift, occurring w/in approx. 30 sec.:

Normal referer:

65.207.77.nnn - - [08/Sep/2009:11:04:56 -0700] "GET /dirA/image.gif HTTP/1.1" 404 3243
"http://www.sitename.com/dirA/filename.html"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"

ScrapBook referer:

65.207.77.nnn - - [08/Sep/2009:11:05:37 -0700] "HEAD /dirB/filename.html HTTP/1.1" 403 0
"file:///C:/Documents%20and%20Settings/usernamehere/Application%20Data/
Mozilla/Firefox/Profiles/profilenamehere.default/ScrapBook/data/20090908140451/index.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5 (.NET CLR 3.5.30729)"

FWIW redux, the ScrapBook referers were NOT previously cached pages. In the second example, above, the visitor had not already hit the page being HEAD-requested.

Confusing? Sorry! In other words, it looks like my server file --

/dirA/filename.html

-- was morphed by ScrapBook into the visitor's local drive file:

/ScrapBook/data/20090908140451/index.html

Then, using that "index.html" as a kind of local site map, the scrape ran from local "index.html" to 26 server files -- none of which had been visited.

FWIW, not all links on the server-based page were scraped. I cannot explain the selectiveness other than the visitor could control or set a preference re what got scraped. Or, perhaps more likely, they realized they were getting nothing but 403s.

Anyway, after the 26 thwarted ScrapBook scrapes, the visitor resumed normal browsing, with normal referers.

Hope that helps/makes sense.

wilderness

7:14 pm on Sep 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Many thanks.

keyplyr

8:05 pm on Sep 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Pfui (et al) there are various settings a ScrapBook user can use to "save" someone's web pages and do various tasks once they have your page.

Additionally, there 10 tools. It can scrape multiple URLs, create its own directories, edit pages, republish (output as new HTML), rediscover those pages, rediscover the web, drill down 1 layer, 2 layers, etc... the list continues. Additional add-ons can even be added to ScrapBook.

You really need to install it to get a better understanding.

You're fortunate that your server shows this log activity, mine does not.

wilderness

8:12 pm on Sep 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You're fortunate that your server shows this log activity, mine does not.

keyplr,
this is the standard log format by most providers.

This seems to be utilized by a vistor highly focused in your site (s), rather than a random harvester.
Simply blacklisting the IP range when you become aware, should deter return visits.

keyplyr

8:22 pm on Sep 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

keyplr, this is the standard log format by most providers.

LOL, you're missing my point.

I *do* have raw log files on all 23 sites I manage, at 6 different hosts. Some are config'd differently than others and show/omit different data. It all depends how the admins choose to do it.

wilderness

8:31 pm on Sep 10, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Had a host for a short while which used a script to generate logs, rather than providing full logs.

I wasn't a customer long.

Pfui

12:01 am on Sep 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Misc. Bits

- Correction: I just noticed the pre-transition info I added a but ago shows a 404 in response to the GIF URI. (slaps head) They still started on the TOC-like page, but the scrape started a few seconds earlier, with graphics calls that reflected its wrongly set file paths, rather than the real ones.

- My logs are Extended Log Format (ELF) on an older Apache. I've thought about adding more configurations but at some point, you know how it goes -- you end up with so much data that you never get around to half of it.

- A few minutes after I discovered the ScrapBook visitor up to no good, their IP (no rDNS) became a RewriteCond %{REMOTE_ADDR} entry. They'll need to request access next time around.

- I should've linked up the ScrapBook site [amb.vis.ne.jp] in my OP. Then maybe the creator, or a Mozilla add-on [addons.mozilla.org] rep, might have, akin to other bot-makers like Majestic and Digsby, noticed links from here in their logs and changed a thing or three about this poorly coded plug. (Apologies to Bill if you remove the links I added to this paragraph:)

keyplyr

1:22 am on Sep 11, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Pfui, I also noticed numerous 404s from myself at the time I was "saving" my pages w/ Scrapbook. However my UA string remained unchanged in my logs. I also recreated the same results at another web site I manage at another host.

jdMorgan

2:21 pm on Sep 16, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

On the issue of exposing user information to Web servers, the creator of the scrapbook add-on reports that this "problem may be fixed in ScrapBook 1.3.5".

I think the author means that he changed the code, and that the change should fix this problem as of the September 13th release of version 1.3.5 ... Language differences may enter into the ambiguity of his statement.

Jim