Instapaper scrapes your content and makes it available for viewing and/or downloading
without your layout, ads, google-analytics code, etc. (Read: Copyright be damned.) Ironically, the company uses robots.txt on its server [
instapaper.com...] but its bots and apps ignore robots.txt (& 403s) on mine.
1.) Apparently initiated by users' bookmarklet'ings but automatic thereafter, instapaper's bot-running is rapid-fire and relentless. For example, on Oct. 6th, its host-named servers and bots started hitting the exact same 15 files in under 10 seconds
every 24 hours:
06:35:59 /dir/file52.html
06:36:00 /dir/file51.html
06:36:01 /dir/file50.html
06:36:02 /dir/file49.html
06:36:02 /dir/file48.html
06:36:03 /dir/file45.html
06:36:04 /dir/file44.html
06:36:05 /dir/file43.html
06:36:05 /dir/file42.html
06:36:06 /dir/file40.html
06:36:07 /dir/file36.html
06:36:08 /dir/file32.html
06:36:08 /dir/file31.html
06:36:09 /dir/file30.html
06:36:10 /dir/file29.html
(Files are in a 50-plus docs set.)
Then last night, instapaper double-double hit-hit the exact same 15 files alternating named AND cloaked UAs with each split-second hit:
www6.instapaper.com
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Instapaper/4.0 (+http
://www
.instapaper
.com/)
04:56:48 /dir/file52.html
www6.instapaper.com
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.7) Gecko/2009021906 Firefox/3.0.7
04:56:49 /dir/file52.html
2.) Also, apparently owned versions of its UAs rifle directories independently AND in tandem with the company's servers. Note the times, and the web directory traversal (even the original paths to which were wrong anyway):
www6.instapaper.com
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Instapaper/4.0 (+http
://www
.instapaper
.com/)
15:30:39 /dir/file.html
host81-155-205-252.range81-155.btcentralplus.com
InstapaperPro/3.0.3 CFNetwork/485.12.7 Darwin/10.4.0
15:30:47 /dir/../dir/dot.gif
15:30:50 /dir/../dir/site.gif
15:30:51 /dir/../dir/sub.gif
15:30:52 /dir/../dir/ind.gif
3.) Instapaper's scrape-n-serve servers and bots include:
www1.instapaper.com [174.121.186.250]
www5.instapaper.com [184.172.0.213]
www6.instapaper.com [184.172.0.211]
(robtex: "www3.instapaper.com, www6.instapaper.com, mail.instapaper.com, www4.instapaper.com, www1.instapaper.com and at least two other hosts are subdomains to this hostname.")
Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1.4) Gecko/20091007 Firefox/3.5.4
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.7) Gecko/2009021906 Firefox/3.0.7
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Instapaper/4.0 (+http
://www
.instapaper
.com/)
4.) Other/customer(?) UA variations include:
cpc12-cmbg15-2-0-custnnn.5-4.cable.virginmedia.com
InstapaperPro/3.0.3 CFNetwork/548.0.3 Darwin/11.0.0
184.173.115.2nn-static.reverse.softlayer.com
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Instapaper/4.0 (+http
://www
.instapaper
.com/)
5.) Again, always, about all instapaper-related hits from anywhere:
robots.txt? NO
Instapaper's bot-running isn't new. It's way, waay past time they got a clue.