Welcome to WebmasterWorld Guest from 3.214.184.124

Forum Moderators: Ocean10000

Python, Curl and Robots.txt

     
9:11 am on Oct 26, 2019 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3289
votes: 19


I'm seeing various robots.txt accesses for python and curl, which suggests they (may) obey robots.txt.

Trouble is, I can't find the names to put into the file to discourage them - the obvious are python and curl but the nearest I can find is pycurl. Anyone know about these ubiquitous bots?
12:34 pm on Oct 26, 2019 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4602
votes: 376


Could be (?) one of the server based http clients like Pcore - see this old thread: [webmasterworld.com...]
5:26 pm on Oct 26, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


My approach is a bit more extreme: I simply 403 all non-Mozilla UAs but for robots.txt (& specific whitelisting). Then, robots.txt's default is --

User-agent: *
Disallow: /

-- so even if something asks properly, they still get a No, thanks.

I hate that every robots.txt request doesn't mean they're friend, not foe. The wholly obnoxious, log-exploding package o' exploits -- last night, it reached 894 hits -- starts out faux-innocently enough. Then in the next second, WHAM:

[03:37:54] "GET / HTTP/1.1"
[03:37:54] "GET /robots.txt HTTP/1.1"
[03:37:55] "POST /dc6beecc/admin.php HTTP/1.1"
(891 additional hits not included:)
5:55 pm on Oct 26, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15956
votes: 898


I don't think they're planning to obey robots.txt. I think they're just looking for ideas about what to get next.

Compliant entities are supposed to interpret robots.txt as broadly as possible, so if you have a rule matching "python" or "curl" (case-INsensitive) they should follow it. Someone hereabouts, possibly phranque, once explained it in some detail. But really, I tend to doubt that compliance forms any part of their intention.

:: detour to logs ::

Lot of this kind of thing:
aa.bb.cc.dd - - [31/Aug/2019:14:19:10 -0700] "GET /robots.txt HTTP/1.1" 200 3152 "-" "python-requests/2.22.0" 
aa.bb.cc.dd - - [31/Aug/2019:14:19:10 -0700] "GET / HTTP/1.1" 403 1837 "-" "python-requests/2.22.0"
Well, they do tend to request robots.txt before their other requests, in contrast to the popular malign-robot behavior of asking only after a series of (usually blocked) page requests.

At one time I must have seen a lot of “Python-urllib”, because I find a robots.txt disallow. They're still around, but haven't asked for robots.txt in the recent past. Over on the “install a deadbolt” side (as opposed to the robots.txt “post a No Admittance sign” side) I've got a comprehensive block on
^[Pp]ython
where the opening anchor doesn't mean “it’s OK if you say Python somewhere further along” but simply that Python always happens to come first--exceptions are vanishingly rare--so the server doesn't need to check the whole thing.

Edit: I've stopped checking for “Mozilla” at all. By this time, almost 90% of all requests--including almost 3/4 of blocked requests--claim to be Mozilla, and most of the rest are known quantities one way or the other. So it’s no longer as dispositive as it was a few years ago.

If someone comes in claiming to be Chrome or Firefox, I set an environmental variable called “lying_bot”. This is not used directly for access control, but causes robots.txt (which is really robots.php) to issue the minimalist
User-Agent: *
Disallow: /
version. Yes, this also means that if humans snoopily ask for robots.txt, they probably won't see the real thing. But oh well.
10:34 pm on Oct 26, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10709
votes: 1151


python requests for me are VERY small ... and mostly from geo IPs I don't support ... So these get robots.txt for free and 403 for anything else. :)

Probably one of the few NOT using php ... so when I get a .php request that also gets a 403 ... has really knocked out a LOT of noise!
1:54 am on Oct 27, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


tangor, I don't use PHP either and I am soooo glad I don't have to worry about seemingly non-stop updates/breaches. (lucy, my robots.txt is actually robots.cgi:)
1:58 am on Oct 27, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15956
votes: 898


I use a bit of php ... but it is never present in my visible URLs. So if need be, I can block them comprehensively after checking %{THE_REQUEST}.
3:02 am on Oct 27, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10709
votes: 1151


Seems like great minds think alike!

Not saying anything ugly about php ... but is a recognition that php is being targeted by bad actors everywhere ... and the noise is just getting louder.

Worse, only "bing" or "google" send this my way (though the ip addresses are NOT b or g!

Duh!
3:54 am on Oct 27, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10709
votes: 1151


Check that ... all kinds of ips ... most of which "declare" they are g or b.

Ip addresses are a tiny bit more reliable ... but even those can be spoofed.
11:04 am on Oct 27, 2019 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3289
votes: 19


Pfui - so how do you allow (eg) google etc? What sequence, allowed first then bad ones or the other way up? I have to say I've always considered robots.txt a very poor tool, badly formulated and of no real use beyond a guide to "real" SEs.

Lucy - I've always included some nasty traps just for robots that follow things in robots.txt. Otherwise, in setenv and IIS traps I trap for lower case M mozilla - I've seen a few of those over the years - likewise firefox, chrome etc and for anything that isn't Mozilla/5 or ^Mozilla/5.0$ or \sMozilla (occasionally get one of those in the middle of a UA, even now. I like the lying_bot idea but haven't yet made a php version of robots.txt. Time - ah, well. :(

I have traps within pages (IIS) and setenv (apache) to reject baddies such as python and curl; I was hoping that the ones that read robots.txt might not bother to read the pages in the first place. :(
7:43 pm on Oct 28, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15956
votes: 898


Currently my only comprehensive Mozilla rule is
^Mozilla/[0-36]
This mainly intercepts robots whose script is so ancient, they're still claiming to be MSIE 3 or the like.

:: quick detour to raw logs ::

I particularly like this one:
Mozilla/6.0 (compatible; MSIE 7.0a1; Windows NT 5.2; SV1)
though there's something to be said for
Mozilla/2.0 (compatible; MSIE 3.02; Windows CE; 240x320)
(er ... a 1992-vintage phone?) The CE is reassuring; I might otherwise have suspected BC.
12:20 am on Oct 30, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:2758
votes: 846


As a Python programmer I feel I should clarify somethings

Probably one of the few NOT using php ... so when I get a .php request that also gets a 403 ...

tangor, I don't use PHP either and I am soooo glad I don't have to worry about seemingly non-stop updates/breaches


The UA that includes either
'Python-requests'

'Python-urlib'

are from users (most likely bots) that are using either the Requests or Urllib packages. The Python code is being executed on the client computer not the server. This is not the same as requests made for .php where the client is trying to get the server to execute php code on the server. Both those python packages are used to make http requests.
8:39 am on Oct 30, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10709
votes: 1151


in request I look for .php in ua I look for python

and 403 all

The first is upwards of 50%+ of log entries and the latter is less than 1% :)
4:46 pm on Oct 30, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


Speaking of python, these just in. Usually just a single hit from anywhere...

- ec2-3-233-238-226.compute-1.amazonaws.com [09:06:43] 403 "python-requests/2.18.1"
- ec2-3-233-238-226.compute-1.amazonaws.com [09:06:43] 403 "python-requests/2.18.1"
- ec2-3-227-239-120.compute-1.amazonaws.com [09:06:44] 403 "python-requests/2.18.1"
- ec2-3-233-238-226.compute-1.amazonaws.com [09:06:44] 403 "python-requests/2.18.1"
- ec2-3-227-239-120.compute-1.amazonaws.com [09:06:44] 403 "python-requests/2.18.1"
- ec2-3-227-239-120.compute-1.amazonaws.com [09:06:45] 403 "python-requests/2.18.1"
- ec2-3-218-142-34.compute-1.amazonaws.com [09:07:04] 403 "python-requests/2.18.1"
- ec2-3-218-142-34.compute-1.amazonaws.com [09:07:05] 403 "python-requests/2.18.1"
- ec2-3-218-142-34.compute-1.amazonaws.com [09:07:05] 403 "python-requests/2.18.1"

(I do so detest AWS.)
5:53 pm on Oct 30, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15956
votes: 898


Your robots are out of date :) Most of my python-requests are currently /2.21.0 or /2.22.0 while the oldest I see is a 2.10.0 with an odd taste for the favicon. (But why? What do they plan to do with it?)

:: pause to delve deeper into logs ::

Oddly, python-requests for the favicon tend to come at the tail of a series of blocked requests from the same IP with no UA (which, of course, is a dandy way to get blocked). Elsewhere in the list is a recurring /.well-known/security.txt which makes for yet another convenient 403.*

Gooood robots. I like robots that provide multiple grounds for blocking them. So much easier than the scattering of humanoids who slip past the barriers.


* Tangentially: On my test site, I 403 all .well-known requests. So far, this has not prevented the security certificate from being updated as some sources say it will.
1:44 am on Oct 31, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:2758
votes: 846


It also worth pointing out that changing the default user-agent in Python-Requests is trivial. Those that are operating these bots aren't really trying very hard.
8:48 pm on Nov 12, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts: 2000
votes: 75


-- Not saying anything ugly about php --

PHP is beautiful, never used it though...
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members