I don't think they're planning to obey robots.txt. I think they're just looking for ideas about what to get next.
Compliant entities are supposed to interpret robots.txt as broadly as possible, so if you have a rule matching "python" or "curl" (case-INsensitive) they should follow it. Someone hereabouts, possibly phranque, once explained it in some detail. But really, I tend to doubt that compliance forms any part of their intention.
:: detour to logs ::
Lot of this kind of thing:
aa.bb.cc.dd - - [31/Aug/2019:14:19:10 -0700] "GET /robots.txt HTTP/1.1" 200 3152 "-" "python-requests/2.22.0"
aa.bb.cc.dd - - [31/Aug/2019:14:19:10 -0700] "GET / HTTP/1.1" 403 1837 "-" "python-requests/2.22.0"
Well, they do tend to request robots.txt before
their other requests, in contrast to the popular malign-robot behavior of asking only after a series of (usually blocked) page requests.
At one time I must have seen a lot of “Python-urllib”, because I find a robots.txt disallow. They're still around, but haven't asked for robots.txt in the recent past. Over on the “install a deadbolt” side (as opposed to the robots.txt “post a No Admittance sign” side) I've got a comprehensive block on
where the opening anchor doesn't mean “it’s OK if you say Python somewhere further along” but simply that Python always happens to come first--exceptions are vanishingly rare--so the server doesn't need to check the whole thing. Edit:
I've stopped checking for “Mozilla” at all. By this time, almost 90% of all requests--including almost 3/4 of blocked requests--claim to be Mozilla, and most of the rest are known quantities one way or the other. So it’s no longer as dispositive as it was a few years ago.
If someone comes in claiming to be Chrome or Firefox, I set an environmental variable called “lying_bot”. This is not used directly for access control, but causes robots.txt (which is really robots.php) to issue the minimalist
version. Yes, this also means that if humans snoopily ask for robots.txt, they probably won't see the real thing. But oh well.