newspaper

Forum Moderators: open

Message Too Old, No Replies

newspaper

Pfui

3:12 am on Feb 26, 2019 (gmt 0)

UA: newspaper/0.2.8
robots.txt? NO
rss.xml, etc.? NO

First seen 02-23, a single hit by a single university. Then another. Then those same schools by Host and IP, then repeat hits, then private accounts today:

Duke; Princeton; .ny.cable.rcncustomer.com; .nvidia.com

Hit counts doubling every day -- not a good omen.

At first I thought someone included a link in a school somethingorother, but every hit's to a different file, even repeat visitors, typically to non-graphic file types including cgi.

And then today, and perhaps most bot-tellingly, hits attempt graphics but the file paths are wrong.

Anyone else see this thing? If yes, see any patterns or referrers, or anything?

(LTNS to the legacy folks around here:)

tangor

4:03 am on Feb 26, 2019 (gmt 0)

Not yet ... sounds like computer lab stuff being shared for group stuff... kill or not? Me, kill first until they can explain why. (Uni's are not my fave bot hits)

lucy24

6:06 am on Feb 26, 2019 (gmt 0)

:: detour to raw logs ::

Huh. I find just a couple hits, from almost a year ago. Back then, they were at version 0.2.6:

129.10.110.abc - - [29/Apr/2018:06:04:14 -0700] "GET /directory/subdir/pagename.html HTTP/1.1" 403 1838 "-" "newspaper/0.2.6"

(IP is Northeastern University).

And, more interestingly, going back still earlier (different page):

128.173.237.abc - - [22/Feb/2018:16:54:35 -0800] "GET /directory/subdir/ HTTP/1.1" 403 3537 "-" "Goose/1.1.2" 
128.173.237.abc - - [22/Feb/2018:16:54:35 -0800] "GET /directory/subdir/ HTTP/1.1" 403 3537 "-" "Goose/1.1.2" 
128.173.237.abc - - [22/Feb/2018:16:54:35 -0800] "GET /directory/subdir/ HTTP/1.1" 403 1838 "-" "newspaper/0.0.9.8" 
128.173.237.abc - - [22/Feb/2018:16:54:35 -0800] "GET /images/banner-icon.png HTTP/1.1" 200 1768 "http://example.com/directory/subdir/" "newspaper/0.0.9.8"

(had to look that one up: Virginia Polytechnic)
along with (analytics lives on a different site, which happens to be IPv6, hence the different IP)

2001:468:c80:2129:1618:77ff:etcetera - - [22/Feb/2018:16:54:35 -0800] "GET /piwik/piwik.php?idsite=3&rec=1 HTTP/1.1" 403 6827 "http://example.com/directory/subdir/" "newspaper/0.0.9.8"

The image banner-icon.png is used by the 403 page. So is the piwik file: the noscript version is in <img> tags. But they didn't request the stylesheet which also accompanies error documents.

Cross-checking �Goose� (which is to say, \bGoose, so as to avoid DuckDuckGoose) leads to

52.187.52.abc - - [25/Jun/2018:20:50:03 -0700] "GET /directory/subdir/pagename.html HTTP/1.1" 403 3537 "-" "Goose/1.0.25"

but that�s not an academic IP. (It's mildly interesting to note that /directory/subdir/ is the same as the one requested by /newspaper/ in April, though not the identical page.)

Huh.

Edit: It is possible that phranque or someone like him can explain why the two user-agents got such differently sized responses on identical 403'd requests. (I know why the piwik was bigger; that's an https site.)

Delving deeper into logs, I find a couple requests from way back in December 2016 (newspaper/0.0.9.8 from 5.9.142.abc--Hetzner, yawn--different requests on different days with same odd garbage attached to URL). I don't save headers that long, so can't verify that any or all of these are the same. Goose and newspaper sent different headers on that one day when both made requests from the same IP in the same time period. (Is this why the 403 came out different?)

SumGuy

4:13 am on Feb 27, 2019 (gmt 0)

I have what would otherwise look like a "normal" hit to default.html from 178.165.79.59 (Ukraine) on 12/18/2016 with user-agent newspaper/0.1.7 (no referer). It downloaded everything it should have except for favicon.ico. Then a hit to default.html (but no other files) from 77.133.242.134 (France) on 1/3/2017, user-agent was newspaper/0.0.9.8, again no referer. I didn't look further back than Jan 2015.

I got a single hit (to a pdf file) on 10/12/2015 from 23.98.64.120 (Microsoft) with user-agent Goose/1.0.25. No referer.

That's all the sightings I have for Goose and Newspaper (since Jan 2015).

Pfui

2:32 pm on Jul 28, 2019 (gmt 0)

Thanks for thoughts, gang. FWIW, am still seeing "newspaper/0.2.8" every day, from all over: Germany; France; Comcast and other U.S. Hosts; Japan; the despised .googleusercontent.com; etc. Still blocking its always-single hits to a score of files. Still no clue as to its origin or purpose(s).

Have never seen a Goose.

tangor

8:16 am on Jul 29, 2019 (gmt 0)

Oddly enough, Pfui ... I have never been visited by this particular ... Even so, if encountered, will be blocked on general principals. :)

The web is a wild and woolly place (always has been) but the bots are worse than ever!

iamlost

6:15 pm on Jul 29, 2019 (gmt 0)

name='newspaper3k',
version='0.2.8',
description='Simplified python article discovery & extraction.',
long_description=readme,
author='Lucas Ou-Yang',
author_email='lucasyangpersonal@gmail.com',
url='https://github.com/codelucas/newspaper/',

lucy24

9:11 pm on Jul 29, 2019 (gmt 0)

Simplified python article discovery & extraction

That seems an awfully verbose way to say �scraping�.

Pfui

10:32 pm on Jul 29, 2019 (gmt 0)

Thanks for solving the mystery, iamlost. And lucy, lol. Exactly.

newspaper

Pfui

tangor

lucy24

SumGuy

Pfui

tangor

iamlost

lucy24

Pfui

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week