Forum Moderators: open

Message Too Old, No Replies

instapaper

instascraper

         

Pfui

4:11 pm on Oct 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Instapaper scrapes your content and makes it available for viewing and/or downloading without your layout, ads, google-analytics code, etc. (Read: Copyright be damned.) Ironically, the company uses robots.txt on its server [instapaper.com...] but its bots and apps ignore robots.txt (& 403s) on mine.

1.) Apparently initiated by users' bookmarklet'ings but automatic thereafter, instapaper's bot-running is rapid-fire and relentless. For example, on Oct. 6th, its host-named servers and bots started hitting the exact same 15 files in under 10 seconds every 24 hours:

06:35:59 /dir/file52.html
06:36:00 /dir/file51.html
06:36:01 /dir/file50.html
06:36:02 /dir/file49.html
06:36:02 /dir/file48.html
06:36:03 /dir/file45.html
06:36:04 /dir/file44.html
06:36:05 /dir/file43.html
06:36:05 /dir/file42.html
06:36:06 /dir/file40.html
06:36:07 /dir/file36.html
06:36:08 /dir/file32.html
06:36:08 /dir/file31.html
06:36:09 /dir/file30.html
06:36:10 /dir/file29.html

(Files are in a 50-plus docs set.)

Then last night, instapaper double-double hit-hit the exact same 15 files alternating named AND cloaked UAs with each split-second hit:

www6.instapaper.com
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Instapaper/4.0 (+http://www.instapaper.com/)
04:56:48 /dir/file52.html

www6.instapaper.com
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.7) Gecko/2009021906 Firefox/3.0.7
04:56:49 /dir/file52.html

2.) Also, apparently owned versions of its UAs rifle directories independently AND in tandem with the company's servers. Note the times, and the web directory traversal (even the original paths to which were wrong anyway):

www6.instapaper.com
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Instapaper/4.0 (+http://www.instapaper.com/)
15:30:39 /dir/file.html

host81-155-205-252.range81-155.btcentralplus.com
InstapaperPro/3.0.3 CFNetwork/485.12.7 Darwin/10.4.0
15:30:47 /dir/../dir/dot.gif
15:30:50 /dir/../dir/site.gif
15:30:51 /dir/../dir/sub.gif
15:30:52 /dir/../dir/ind.gif

3.) Instapaper's scrape-n-serve servers and bots include:

www1.instapaper.com [174.121.186.250]
www5.instapaper.com [184.172.0.213]
www6.instapaper.com [184.172.0.211]

(robtex: "www3.instapaper.com, www6.instapaper.com, mail.instapaper.com, www4.instapaper.com, www1.instapaper.com and at least two other hosts are subdomains to this hostname.")

Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1.4) Gecko/20091007 Firefox/3.5.4
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.7) Gecko/2009021906 Firefox/3.0.7
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Instapaper/4.0 (+http://www.instapaper.com/)

4.) Other/customer(?) UA variations include:

cpc12-cmbg15-2-0-custnnn.5-4.cable.virginmedia.com
InstapaperPro/3.0.3 CFNetwork/548.0.3 Darwin/11.0.0

184.173.115.2nn-static.reverse.softlayer.com
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Instapaper/4.0 (+http://www.instapaper.com/)

5.) Again, always, about all instapaper-related hits from anywhere:

robots.txt? NO

Instapaper's bot-running isn't new. It's way, waay past time they got a clue.

dstiles

8:54 pm on Oct 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The Planet IPs I can understand but BT and Virgin dynamic?

Looks like at least partial interaction as a distributed bot?

Looking at its info as reported by SEs, I would guess it hits the web site for content that the bot can't get at or hasn't got time for?

Pfui

10:41 pm on Oct 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hard to say from the receiving end. For example, after I posted, guess who came by -- in tandem? The exact same softlayer address AND not one but two instapaper hosts, ALL using the same UA:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Instapaper/4.0 (+http://www.instapaper.com/)

www5.instapaper.com
09:12:42 /dir/file07.html
09:12:50 /dir/file06.html

www6.instapaper.com
09:13:05 /dir/file01.html
09:13:22 /dir/file02.html

184.173.115.2nn-static.reverse.softlayer.com
09:13:46 /dir/file03.html
09:14:11 /dir/file05.html

For all: robots.txt? NO

That trio looks less user-initiated or distributed and more company cloudy/amazonaws'esque because it's simply unlikely people suddenly bookmarkletted a total of 21 pages to read later when the site's up 24/7 and includes html and text versions of the same files, and the URL's easy to reach and remember.

keyplyr

1:19 am on Oct 15, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Nasty thing it is too. Been blocking it and all other "reader" type UAs that reformat HTML. I even crippled the new native RSS reader in Safari5+.

dstiles

8:31 pm on Oct 15, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



pfui - the way I read it is: "download all pages and read off-line", so it's possible it was a single person via instapaper and with the odd side-effect of looking for icons?

Pfui

11:21 pm on Oct 15, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No calls to favicons ever.

No need for instapaper servers to hit the same 15 pages every night for almost two weeks now. That's not one person over and over and over again. That looks anticipatory at instapaper's end, or perhaps related to app-start, like Safari and its wretched Top Site hits [webmasterworld.com...] (Aside: I've not forgotten you; I've been avoiding re-tackling that snafu:)

Early-2011 info about how instapaper scrapes -- "the bookmarklet compresses the page data right there in the browser before sending it" -- suggests the former to me. [owni.eu...]

dstiles

7:54 pm on Oct 16, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sorry, I saw dot.gif and thought it may be some kind of icon.

It COULD be the same person if they have some kind of timed update set up.

Either way, I'm in favour of blocking it.