DeuSu

Forum Moderators: open

Message Too Old, No Replies

DeuSu

keyplyr

2:32 am on May 3, 2016 (gmt 0)

UA: Mozilla/5.0 (compatible; DeuSu/5.0.2; +https://deusu.de/robot.html)
Protocol: HTTP/1.0
Robots.txt: Yes
Host: deusu.de
85.93.90.0 - 85.93.91.255
85.93.90.0/23
Parent: hosteurope.de (plusserver.de)
85.93.64.0 - 85.93.95.255
85.93.64.0/19

lucy24

5:50 am on May 3, 2016 (gmt 0)

Oh, ###, how did I overlook them? They're one of a clutch of robots that started showing up when I added my recent ebooks to a respectable directory. Not that all the resulting robots can be called respectable ;) but DeuSu -- always from the exact IP 85.93.91.84 -- does tend to ask for robots.txt a second or two before the page request (currently blocked on standard grounds).

I've been adding Deny lines to robots.txt as a rock-bottom test of "If I ask them to stay out, will they stay out?" Can't think why I missed them; they have now been added. And then I'll see if there is a difference next time I add a book.

fwiw, their www page says they may also crawl from 62.138.3.245 and 130.180.122.35

keyplyr

10:23 am on May 3, 2016 (gmt 0)

a test of "If I ask them to stay out, will they stay out?

So if you have a bot disallowed and they ignore it and take files anyway, do you then remove the Deny line as irrelevant or leave it in? Leave it in for a few days so the bot can update its cache?

In an effort to appear benign, some bad actors request robots.txt with no intention of supporting that standard. Some even post bot info pages with rhetoric about how to stop their bot and then continue to ignore it.

lucy24

8:04 pm on May 3, 2016 (gmt 0)

So if you have a bot disallowed and they ignore it and take files anyway, do you then remove the Deny line as irrelevant or leave it in?

If I'm considering allowing someone in, the first step is to deny them in robots.txt and see if they oblige. No hole-poking unless and until they've passed that test. I test for at least a month-- longer if it's a rare visitor. So they might not respond to the first robots.txt change, but they've got plenty of subsequent chances. In fact I just recently reviewed the first batch: robots.txt denials added on some date in March, behavior checked at the end of April. This time, two holes were poked, and two others were explicitly flagged as "Continue Blocking". (The latter two continue to be denied in robots.txt on the off chance that they will eventually mend their ways.)

Some visitors don't even get the first test. If they show up out of the blue, with no prior history, I ### well expect them to read robots.txt before requesting anything else. So if their very first visit involves requests for files in roboted-out directories, they are SOL forever, more or less.

Disclaimer: Since I only changed my access-control system a month or two back, almost all of my baseline information is based on what happened in the old IP-based system. I haven't fully worked out what to do about brand-new robots, since my current default is to ignore any request for robots.txt that's immediately followed by a 403. (The assumption was that these are known non-compliant robots.) But each time I add an ebook to the directory, I check requests for that specific page for a few days. In fact this is an especially handy test, because a compliant robot's behavior is to first check robots.txt and then, if permitted, request exactly one interior file. No use saying you're working on a cached version of robots.txt if you've never seen it before.

Some even post bot info pages with rhetoric about how to stop their bot and then continue to ignore it.

Yes, I like the ones that politely explain exactly how to deny their UA in robots.txt-- when careful study of logs reveals that neither their stated UA nor anyone else from the same IP has ever looked at robots.txt

keyplyr

9:20 pm on May 3, 2016 (gmt 0)

I only changed my access-control system a month or two back

Really? I change mine a half dozen times a day, every day. Maybe I'm doing something wrong :)

lucy24

10:09 pm on May 3, 2016 (gmt 0)

:-P I meant the overall system. I still tweak htaccess files regularly. If nothing else, it's a good way to ensure that malign agents* don't sneak in and change the file. Whatever action they've taken will last no longer than the next time you yourself edit or replace it.

* This reminds me that in the latest bout of log-wrangling, I met a vast flurry of

172.111.186.dd - - [02/May/2016:10:03:59 -0700] "GET /admin/fckeditor/editor/----403/fckeditor/editor/ HTTP/1.1" 403 3027 "-" "compatible;Baiduspider/2.0; +http://www.baidu.com/search/spider.html"

It's so thoughtful when your script itself incorporates the expected response. (I had to look up the IP, as it's a new one on me. Hilariously, it belongs to something calling itself Secure Internet. Within SoftLayer, which makes all plain. The fake Baidu is an optional extra.)

keyplyr

10:38 pm on May 3, 2016 (gmt 0)

Thanks for the range - added to Server Farm thread [webmasterworld.com...]

Secure Internet
172.111.128.0 - 172.111.255.255
172.111.128.0/17

managed by:gaditek.com

Managed Cloud Hosting, Virtual Private Network, Security Services, SSL, Data Encryption

wilderness

11:35 pm on May 3, 2016 (gmt 0)

The fake Baidu is an optional extra

compatible;Baiduspider/2.0;

Umh! Missing a trailing space.

keyplyr

1:56 am on May 4, 2016 (gmt 0)

Umh! Missing a trailing space.

And not from Baidu :)

lucy24

5:12 am on May 4, 2016 (gmt 0)

Missing a trailing space.

Well, when you're visited by a fake Known Entity, it doesn't gain or lose any points by misspelling itself. It's still fake. For a while I was blocking a GoogleBot, like that.

DeuSu

keyplyr

lucy24

keyplyr

lucy24

keyplyr

lucy24

keyplyr

wilderness

keyplyr

lucy24

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week