Forum Moderators: open

Message Too Old, No Replies

Scrapy.org

         

aristotle

9:07 pm on Jul 24, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm not sure I understand this, but it appears to be some kind ready-made software for scrapers to customize.
Host: 23.29.134.16 
/
Http Code: 200 Date: Jul 24 16:45:31 Http Version: HTTP/1.1 Size in Bytes: 21786
Referer: {It came from a backlink]
Agent: Scrapy/1.0.1 (+http://scrapy.org)

NOTE: The referer was another site that links to my site. Apparently the bot followed the backlink to get to my site.

lucy24

2:47 am on Jul 25, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Apparently the bot followed the backlink to get to my site.

Well, for a given definition of "follow". Sure, that other site was probably how the bot learned about yours. But they can always choose whether or not to send a referer. If they do, it makes them look more human and sometimes makes it easier to get past barriers intended to sniff out robots. For comparison purposes consider the googlebot. It sometimes sends a referer for supporting files such as scripts or stylesheets. I've got a lurking suspicion it does this to see whether the css/js/whatever content is really always the same, no matter what page is doing the invoking. I think the w3c link checker also has an option to send or not send a referer.

Pfui

3:41 am on Jul 25, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Host: 23.29.134.16

On the 20th, Scrapy stopped by from nearby 23.29.134.14 (a.k.a. ip14.23-29-134.static.steadfastdns.net), which reminded me to drop-kick all of cesspool steadfastnetworks via iptables.

23.29.128.0 - 23.29.159.255
23.29.128.0/19

The hit similarly included a legit backlink/referrer but like lucy, I don't think it actually came from there first.

lucy24

7:55 pm on Jul 26, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Follow-up:
23.29.134.15 - - [23/Jul/2015:16:01:30 -0700] "GET / HTTP/1.1" 301 505 "http://www.example.tld/" "Scrapy/1.0.1 (+http://scrapy.org)" 
23.29.134.14 - - [23/Jul/2015:16:01:33 -0700] "GET / HTTP/1.1" 200 2638 "http://www.example.tld/" "Scrapy/1.0.1 (+http://scrapy.org)"
The 301 presumably means that they first asked for the without-www form of a with-www name. But as it happens, I personally know this referer. (Hint: in the "linking text" section of WMT it's listed as "my mom's games", making the link older than dirt.) But the real-life link is not to my front page, let alone to the without-www version thereof; it's to /games/

So, nice try, scrapy.

Pfui

2:44 am on Jul 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why would a Netherlands-based anti-viral/anti-spyware company run Scrapy from its mothership? Came by today, no robots.txt, no referrer:

ams07-015.ff.avast.com [5.45.60.115]
Scrapy/1.0.1 (+http://scrapy.org)

5.45.60.0 - 5.45.60.255 [5.45.60.0/24]

Scrapy's the only "Browser Agent/s on [that] IP": [myip.ms...]

blend27

4:43 pm on Jul 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Some of the earlier visits from the following sources on my sites:

12/20/13
162.204.244.13 (AT&T)
Scrapy/0.20.2 (+http://scrapy.org)
------------------------------------------
02/22/14
140.112.31.73
nlg10.csie.ntu.edu.tw
Scrapy/0.20.1 (+http://scrapy.org)
------------------------------------------
04/01/14
162.13.9.67 (RSPC-UK-Rackspace-Cloud-Servers)
Scrapy/0.22.2 (+http://scrapy.org)
------------------------------------------
10/23/14 (bot went ape-sh....t)
ip70-173-68-141.lv.lv.cox.net
Scrapy/0.24.4 (+http://scrapy.org)
------------------------------------------
10/23/14
195.228.45.176 (MT-HOSTING in HU)
Scrapy/0.24.4 (+http://scrapy.org)
------------------------------------------
04/01/15
45.55.170.178 (DIGITALOCEAN-11)
Scrapy/0.24.5 (+http://scrapy.org)


After April 1st, I banned by UA(Scrapy) in .htaccess and never looked back :)

keyplyr

8:10 pm on Jul 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've always blocked it