Python, Curl and Robots.txt

Forum Moderators: open

Message Too Old, No Replies

Python, Curl and Robots.txt

dstiles

9:11 am on Oct 26, 2019 (gmt 0)

I'm seeing various robots.txt accesses for python and curl, which suggests they (may) obey robots.txt.

Trouble is, I can't find the names to put into the file to discourage them - the obvious are python and curl but the nearest I can find is pycurl. Anyone know about these ubiquitous bots?

not2easy

12:34 pm on Oct 26, 2019 (gmt 0)

Could be (?) one of the server based http clients like Pcore - see this old thread: [webmasterworld.com...]

Pfui

5:26 pm on Oct 26, 2019 (gmt 0)

My approach is a bit more extreme: I simply 403 all non-Mozilla UAs but for robots.txt (& specific whitelisting). Then, robots.txt's default is --

User-agent: *
Disallow: /

-- so even if something asks properly, they still get a No, thanks.

I hate that every robots.txt request doesn't mean they're friend, not foe. The wholly obnoxious, log-exploding package o' exploits -- last night, it reached 894 hits -- starts out faux-innocently enough. Then in the next second, WHAM:

[03:37:54] "GET / HTTP/1.1"
[03:37:54] "GET /robots.txt HTTP/1.1"
[03:37:55] "POST /dc6beecc/admin.php HTTP/1.1"
(891 additional hits not included:)

lucy24

5:55 pm on Oct 26, 2019 (gmt 0)

I don't think they're planning to obey robots.txt. I think they're just looking for ideas about what to get next.

Compliant entities are supposed to interpret robots.txt as broadly as possible, so if you have a rule matching "python" or "curl" (case-INsensitive) they should follow it. Someone hereabouts, possibly phranque, once explained it in some detail. But really, I tend to doubt that compliance forms any part of their intention.

:: detour to logs ::

Lot of this kind of thing:

aa.bb.cc.dd - - [31/Aug/2019:14:19:10 -0700] "GET /robots.txt HTTP/1.1" 200 3152 "-" "python-requests/2.22.0" 
aa.bb.cc.dd - - [31/Aug/2019:14:19:10 -0700] "GET / HTTP/1.1" 403 1837 "-" "python-requests/2.22.0"

Well, they do tend to request robots.txt before their other requests, in contrast to the popular malign-robot behavior of asking only after a series of (usually blocked) page requests.

At one time I must have seen a lot of �Python-urllib�, because I find a robots.txt disallow. They're still around, but haven't asked for robots.txt in the recent past. Over on the �install a deadbolt� side (as opposed to the robots.txt �post a No Admittance sign� side) I've got a comprehensive block on
^[Pp]ython
where the opening anchor doesn't mean �it�s OK if you say Python somewhere further along� but simply that Python always happens to come first--exceptions are vanishingly rare--so the server doesn't need to check the whole thing.

Edit: I've stopped checking for �Mozilla� at all. By this time, almost 90% of all requests--including almost 3/4 of blocked requests--claim to be Mozilla, and most of the rest are known quantities one way or the other. So it�s no longer as dispositive as it was a few years ago.

If someone comes in claiming to be Chrome or Firefox, I set an environmental variable called �lying_bot�. This is not used directly for access control, but causes robots.txt (which is really robots.php) to issue the minimalist
User-Agent: *
Disallow: /
version. Yes, this also means that if humans snoopily ask for robots.txt, they probably won't see the real thing. But oh well.

tangor

10:34 pm on Oct 26, 2019 (gmt 0)

python requests for me are VERY small ... and mostly from geo IPs I don't support ... So these get robots.txt for free and 403 for anything else. :)

Probably one of the few NOT using php ... so when I get a .php request that also gets a 403 ... has really knocked out a LOT of noise!

Pfui

1:54 am on Oct 27, 2019 (gmt 0)

tangor, I don't use PHP either and I am soooo glad I don't have to worry about seemingly non-stop updates/breaches. (lucy, my robots.txt is actually robots.cgi:)

lucy24

1:58 am on Oct 27, 2019 (gmt 0)

I use a bit of php ... but it is never present in my visible URLs. So if need be, I can block them comprehensively after checking %{THE_REQUEST}.

tangor

3:02 am on Oct 27, 2019 (gmt 0)

Seems like great minds think alike!

Not saying anything ugly about php ... but is a recognition that php is being targeted by bad actors everywhere ... and the noise is just getting louder.

Worse, only "bing" or "google" send this my way (though the ip addresses are NOT b or g!

Duh!

tangor

3:54 am on Oct 27, 2019 (gmt 0)

Check that ... all kinds of ips ... most of which "declare" they are g or b.

Ip addresses are a tiny bit more reliable ... but even those can be spoofed.

dstiles

11:04 am on Oct 27, 2019 (gmt 0)

Pfui - so how do you allow (eg) google etc? What sequence, allowed first then bad ones or the other way up? I have to say I've always considered robots.txt a very poor tool, badly formulated and of no real use beyond a guide to "real" SEs.

Lucy - I've always included some nasty traps just for robots that follow things in robots.txt. Otherwise, in setenv and IIS traps I trap for lower case M mozilla - I've seen a few of those over the years - likewise firefox, chrome etc and for anything that isn't Mozilla/5 or ^Mozilla/5.0$ or \sMozilla (occasionally get one of those in the middle of a UA, even now. I like the lying_bot idea but haven't yet made a php version of robots.txt. Time - ah, well. :(

I have traps within pages (IIS) and setenv (apache) to reject baddies such as python and curl; I was hoping that the ones that read robots.txt might not bother to read the pages in the first place. :(

lucy24

7:43 pm on Oct 28, 2019 (gmt 0)

Currently my only comprehensive Mozilla rule is
^Mozilla/[0-36]
This mainly intercepts robots whose script is so ancient, they're still claiming to be MSIE 3 or the like.

:: quick detour to raw logs ::

I particularly like this one:

Mozilla/6.0 (compatible; MSIE 7.0a1; Windows NT 5.2; SV1)

though there's something to be said for

Mozilla/2.0 (compatible; MSIE 3.02; Windows CE; 240x320)

(er ... a 1992-vintage phone?) The CE is reassuring; I might otherwise have suspected BC.

NickMNS

12:20 am on Oct 30, 2019 (gmt 0)

As a Python programmer I feel I should clarify somethings

Probably one of the few NOT using php ... so when I get a .php request that also gets a 403 ...

tangor, I don't use PHP either and I am soooo glad I don't have to worry about seemingly non-stop updates/breaches

The UA that includes either

'Python-requests'

'Python-urlib'

are from users (most likely bots) that are using either the Requests or Urllib packages. The Python code is being executed on the client computer not the server. This is not the same as requests made for .php where the client is trying to get the server to execute php code on the server. Both those python packages are used to make http requests.

tangor

8:39 am on Oct 30, 2019 (gmt 0)

in request I look for .php in ua I look for python

and 403 all

The first is upwards of 50%+ of log entries and the latter is less than 1% :)

Pfui

4:46 pm on Oct 30, 2019 (gmt 0)

Speaking of python, these just in. Usually just a single hit from anywhere...

- ec2-3-233-238-226.compute-1.amazonaws.com [09:06:43] 403 "python-requests/2.18.1"
- ec2-3-233-238-226.compute-1.amazonaws.com [09:06:43] 403 "python-requests/2.18.1"
- ec2-3-227-239-120.compute-1.amazonaws.com [09:06:44] 403 "python-requests/2.18.1"
- ec2-3-233-238-226.compute-1.amazonaws.com [09:06:44] 403 "python-requests/2.18.1"
- ec2-3-227-239-120.compute-1.amazonaws.com [09:06:44] 403 "python-requests/2.18.1"
- ec2-3-227-239-120.compute-1.amazonaws.com [09:06:45] 403 "python-requests/2.18.1"
- ec2-3-218-142-34.compute-1.amazonaws.com [09:07:04] 403 "python-requests/2.18.1"
- ec2-3-218-142-34.compute-1.amazonaws.com [09:07:05] 403 "python-requests/2.18.1"
- ec2-3-218-142-34.compute-1.amazonaws.com [09:07:05] 403 "python-requests/2.18.1"

(I do so detest AWS.)

lucy24

5:53 pm on Oct 30, 2019 (gmt 0)

Your robots are out of date :) Most of my python-requests are currently /2.21.0 or /2.22.0 while the oldest I see is a 2.10.0 with an odd taste for the favicon. (But why? What do they plan to do with it?)

:: pause to delve deeper into logs ::

Oddly, python-requests for the favicon tend to come at the tail of a series of blocked requests from the same IP with no UA (which, of course, is a dandy way to get blocked). Elsewhere in the list is a recurring /.well-known/security.txt which makes for yet another convenient 403.*

Gooood robots. I like robots that provide multiple grounds for blocking them. So much easier than the scattering of humanoids who slip past the barriers.

* Tangentially: On my test site, I 403 all .well-known requests. So far, this has not prevented the security certificate from being updated as some sources say it will.

NickMNS

1:44 am on Oct 31, 2019 (gmt 0)

It also worth pointing out that changing the default user-agent in Python-Requests is trivial. Those that are operating these bots aren't really trying very hard.

blend27

8:48 pm on Nov 12, 2019 (gmt 0)

-- Not saying anything ugly about php --

PHP is beautiful, never used it though...