Welcome to WebmasterWorld Guest from 34.236.171.181

Forum Moderators: Ocean10000

Message Too Old, No Replies

Scrapy.org

     
9:07 pm on Jul 24, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member aristotle is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 4, 2008
posts:3671
votes: 374


I'm not sure I understand this, but it appears to be some kind ready-made software for scrapers to customize.
Host: 23.29.134.16 
/
Http Code: 200 Date: Jul 24 16:45:31 Http Version: HTTP/1.1 Size in Bytes: 21786
Referer: {It came from a backlink]
Agent: Scrapy/1.0.1 (+http://scrapy.org)

NOTE: The referer was another site that links to my site. Apparently the bot followed the backlink to get to my site.
2:47 am on July 25, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


Apparently the bot followed the backlink to get to my site.

Well, for a given definition of "follow". Sure, that other site was probably how the bot learned about yours. But they can always choose whether or not to send a referer. If they do, it makes them look more human and sometimes makes it easier to get past barriers intended to sniff out robots. For comparison purposes consider the googlebot. It sometimes sends a referer for supporting files such as scripts or stylesheets. I've got a lurking suspicion it does this to see whether the css/js/whatever content is really always the same, no matter what page is doing the invoking. I think the w3c link checker also has an option to send or not send a referer.
3:41 am on July 25, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


Host: 23.29.134.16

On the 20th, Scrapy stopped by from nearby 23.29.134.14 (a.k.a. ip14.23-29-134.static.steadfastdns.net), which reminded me to drop-kick all of cesspool steadfastnetworks via iptables.

23.29.128.0 - 23.29.159.255
23.29.128.0/19

The hit similarly included a legit backlink/referrer but like lucy, I don't think it actually came from there first.
7:55 pm on July 26, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


Follow-up:
23.29.134.15 - - [23/Jul/2015:16:01:30 -0700] "GET / HTTP/1.1" 301 505 "http://www.example.tld/" "Scrapy/1.0.1 (+http://scrapy.org)" 
23.29.134.14 - - [23/Jul/2015:16:01:33 -0700] "GET / HTTP/1.1" 200 2638 "http://www.example.tld/" "Scrapy/1.0.1 (+http://scrapy.org)"
The 301 presumably means that they first asked for the without-www form of a with-www name. But as it happens, I personally know this referer. (Hint: in the "linking text" section of WMT it's listed as "my mom's games", making the link older than dirt.) But the real-life link is not to my front page, let alone to the without-www version thereof; it's to /games/

So, nice try, scrapy.
2:44 am on July 29, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


Why would a Netherlands-based anti-viral/anti-spyware company run Scrapy from its mothership? Came by today, no robots.txt, no referrer:

ams07-015.ff.avast.com [5.45.60.115]
Scrapy/1.0.1 (+http://scrapy.org)

5.45.60.0 - 5.45.60.255 [5.45.60.0/24]

Scrapy's the only "Browser Agent/s on [that] IP": [myip.ms...]
4:43 pm on July 29, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1997
votes: 75


Some of the earlier visits from the following sources on my sites:

12/20/13
162.204.244.13 (AT&T)
Scrapy/0.20.2 (+http://scrapy.org)
------------------------------------------
02/22/14
140.112.31.73
nlg10.csie.ntu.edu.tw
Scrapy/0.20.1 (+http://scrapy.org)
------------------------------------------
04/01/14
162.13.9.67 (RSPC-UK-Rackspace-Cloud-Servers)
Scrapy/0.22.2 (+http://scrapy.org)
------------------------------------------
10/23/14 (bot went ape-sh....t)
ip70-173-68-141.lv.lv.cox.net
Scrapy/0.24.4 (+http://scrapy.org)
------------------------------------------
10/23/14
195.228.45.176 (MT-HOSTING in HU)
Scrapy/0.24.4 (+http://scrapy.org)
------------------------------------------
04/01/15
45.55.170.178 (DIGITALOCEAN-11)
Scrapy/0.24.5 (+http://scrapy.org)


After April 1st, I banned by UA(Scrapy) in .htaccess and never looked back :)
8:10 pm on July 29, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


I've always blocked it