Welcome to WebmasterWorld Guest from 3.93.74.227

Forum Moderators: Ocean10000

newspaper

     
3:12 am on Feb 26, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


UA: newspaper/0.2.8
robots.txt? NO
rss.xml, etc.? NO

First seen 02-23, a single hit by a single university. Then another. Then those same schools by Host and IP, then repeat hits, then private accounts today:

Duke; Princeton; .ny.cable.rcncustomer.com; .nvidia.com

Hit counts doubling every day -- not a good omen.

At first I thought someone included a link in a school somethingorother, but every hit's to a different file, even repeat visitors, typically to non-graphic file types including cgi.

And then today, and perhaps most bot-tellingly, hits attempt graphics but the file paths are wrong.

Anyone else see this thing? If yes, see any patterns or referrers, or anything?

(LTNS to the legacy folks around here:)
4:03 am on Feb 26, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10572
votes: 1125


Not yet ... sounds like computer lab stuff being shared for group stuff... kill or not? Me, kill first until they can explain why. (Uni's are not my fave bot hits)
6:06 am on Feb 26, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15936
votes: 889


:: detour to raw logs ::

Huh. I find just a couple hits, from almost a year ago. Back then, they were at version 0.2.6:
129.10.110.abc - - [29/Apr/2018:06:04:14 -0700] "GET /directory/subdir/pagename.html HTTP/1.1" 403 1838 "-" "newspaper/0.2.6" 
(IP is Northeastern University).

And, more interestingly, going back still earlier (different page):
128.173.237.abc - - [22/Feb/2018:16:54:35 -0800] "GET /directory/subdir/ HTTP/1.1" 403 3537 "-" "Goose/1.1.2" 
128.173.237.abc - - [22/Feb/2018:16:54:35 -0800] "GET /directory/subdir/ HTTP/1.1" 403 3537 "-" "Goose/1.1.2"
128.173.237.abc - - [22/Feb/2018:16:54:35 -0800] "GET /directory/subdir/ HTTP/1.1" 403 1838 "-" "newspaper/0.0.9.8"
128.173.237.abc - - [22/Feb/2018:16:54:35 -0800] "GET /images/banner-icon.png HTTP/1.1" 200 1768 "http://example.com/directory/subdir/" "newspaper/0.0.9.8"
(had to look that one up: Virginia Polytechnic)
along with (analytics lives on a different site, which happens to be IPv6, hence the different IP)
2001:468:c80:2129:1618:77ff:etcetera - - [22/Feb/2018:16:54:35 -0800] "GET /piwik/piwik.php?idsite=3&rec=1 HTTP/1.1" 403 6827 "http://example.com/directory/subdir/" "newspaper/0.0.9.8" 
The image banner-icon.png is used by the 403 page. So is the piwik file: the noscript version is in <img> tags. But they didn't request the stylesheet which also accompanies error documents.

Cross-checking “Goose” (which is to say, \bGoose, so as to avoid DuckDuckGoose) leads to
52.187.52.abc - - [25/Jun/2018:20:50:03 -0700] "GET /directory/subdir/pagename.html HTTP/1.1" 403 3537 "-" "Goose/1.0.25" 
but that’s not an academic IP. (It's mildly interesting to note that /directory/subdir/ is the same as the one requested by /newspaper/ in April, though not the identical page.)

Huh.

Edit: It is possible that phranque or someone like him can explain why the two user-agents got such differently sized responses on identical 403'd requests. (I know why the piwik was bigger; that's an https site.)

Delving deeper into logs, I find a couple requests from way back in December 2016 (newspaper/0.0.9.8 from 5.9.142.abc--Hetzner, yawn--different requests on different days with same odd garbage attached to URL). I don't save headers that long, so can't verify that any or all of these are the same. Goose and newspaper sent different headers on that one day when both made requests from the same IP in the same time period. (Is this why the 403 came out different?)
4:13 am on Feb 27, 2019 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:99
votes: 0


I have what would otherwise look like a "normal" hit to default.html from 178.165.79.59 (Ukraine) on 12/18/2016 with user-agent newspaper/0.1.7 (no referer). It downloaded everything it should have except for favicon.ico. Then a hit to default.html (but no other files) from 77.133.242.134 (France) on 1/3/2017, user-agent was newspaper/0.0.9.8, again no referer. I didn't look further back than Jan 2015.

I got a single hit (to a pdf file) on 10/12/2015 from 23.98.64.120 (Microsoft) with user-agent Goose/1.0.25. No referer.

That's all the sightings I have for Goose and Newspaper (since Jan 2015).
2:32 pm on July 28, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


Thanks for thoughts, gang. FWIW, am still seeing "newspaper/0.2.8" every day, from all over: Germany; France; Comcast and other U.S. Hosts; Japan; the despised .googleusercontent.com; etc. Still blocking its always-single hits to a score of files. Still no clue as to its origin or purpose(s).

Have never seen a Goose.
8:16 am on July 29, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10572
votes: 1125


Oddly enough, Pfui ... I have never been visited by this particular ... Even so, if encountered, will be blocked on general principals. :)

The web is a wild and woolly place (always has been) but the bots are worse than ever!
6:15 pm on July 29, 2019 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Nov 25, 2003
posts:1351
votes: 443



name='newspaper3k',
version='0.2.8',
description='Simplified python article discovery & extraction.',
long_description=readme,
author='Lucas Ou-Yang',
author_email='lucasyangpersonal@gmail.com',
url='https://github.com/codelucas/newspaper/',
9:11 pm on July 29, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15936
votes: 889


Simplified python article discovery & extraction

That seems an awfully verbose way to say “scraping”.
10:32 pm on July 29, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


Thanks for solving the mystery, iamlost. And lucy, lol. Exactly.