Forum Moderators: open

Message Too Old, No Replies

DeuSu

         

keyplyr

2:32 am on May 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: Mozilla/5.0 (compatible; DeuSu/5.0.2; +https://deusu.de/robot.html)
Protocol: HTTP/1.0
Robots.txt: Yes
Host: deusu.de
85.93.90.0 - 85.93.91.255
85.93.90.0/23
Parent: hosteurope.de (plusserver.de)
85.93.64.0 - 85.93.95.255
85.93.64.0/19

lucy24

5:50 am on May 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, ###, how did I overlook them? They're one of a clutch of robots that started showing up when I added my recent ebooks to a respectable directory. Not that all the resulting robots can be called respectable ;) but DeuSu -- always from the exact IP 85.93.91.84 -- does tend to ask for robots.txt a second or two before the page request (currently blocked on standard grounds).

I've been adding Deny lines to robots.txt as a rock-bottom test of "If I ask them to stay out, will they stay out?" Can't think why I missed them; they have now been added. And then I'll see if there is a difference next time I add a book.

fwiw, their www page says they may also crawl from 62.138.3.245 and 130.180.122.35

keyplyr

10:23 am on May 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



a test of "If I ask them to stay out, will they stay out?
So if you have a bot disallowed and they ignore it and take files anyway, do you then remove the Deny line as irrelevant or leave it in? Leave it in for a few days so the bot can update its cache?

In an effort to appear benign, some bad actors request robots.txt with no intention of supporting that standard. Some even post bot info pages with rhetoric about how to stop their bot and then continue to ignore it.

lucy24

8:04 pm on May 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So if you have a bot disallowed and they ignore it and take files anyway, do you then remove the Deny line as irrelevant or leave it in?

If I'm considering allowing someone in, the first step is to deny them in robots.txt and see if they oblige. No hole-poking unless and until they've passed that test. I test for at least a month-- longer if it's a rare visitor. So they might not respond to the first robots.txt change, but they've got plenty of subsequent chances. In fact I just recently reviewed the first batch: robots.txt denials added on some date in March, behavior checked at the end of April. This time, two holes were poked, and two others were explicitly flagged as "Continue Blocking". (The latter two continue to be denied in robots.txt on the off chance that they will eventually mend their ways.)

Some visitors don't even get the first test. If they show up out of the blue, with no prior history, I ### well expect them to read robots.txt before requesting anything else. So if their very first visit involves requests for files in roboted-out directories, they are SOL forever, more or less.

Disclaimer: Since I only changed my access-control system a month or two back, almost all of my baseline information is based on what happened in the old IP-based system. I haven't fully worked out what to do about brand-new robots, since my current default is to ignore any request for robots.txt that's immediately followed by a 403. (The assumption was that these are known non-compliant robots.) But each time I add an ebook to the directory, I check requests for that specific page for a few days. In fact this is an especially handy test, because a compliant robot's behavior is to first check robots.txt and then, if permitted, request exactly one interior file. No use saying you're working on a cached version of robots.txt if you've never seen it before.

Some even post bot info pages with rhetoric about how to stop their bot and then continue to ignore it.

Yes, I like the ones that politely explain exactly how to deny their UA in robots.txt-- when careful study of logs reveals that neither their stated UA nor anyone else from the same IP has ever looked at robots.txt

keyplyr

9:20 pm on May 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I only changed my access-control system a month or two back
Really? I change mine a half dozen times a day, every day. Maybe I'm doing something wrong :)

lucy24

10:09 pm on May 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:-P I meant the overall system. I still tweak htaccess files regularly. If nothing else, it's a good way to ensure that malign agents* don't sneak in and change the file. Whatever action they've taken will last no longer than the next time you yourself edit or replace it.


* This reminds me that in the latest bout of log-wrangling, I met a vast flurry of
172.111.186.dd - - [02/May/2016:10:03:59 -0700] "GET /admin/fckeditor/editor/----403/fckeditor/editor/ HTTP/1.1" 403 3027 "-" "compatible;Baiduspider/2.0; +http://www.baidu.com/search/spider.html" 
It's so thoughtful when your script itself incorporates the expected response. (I had to look up the IP, as it's a new one on me. Hilariously, it belongs to something calling itself Secure Internet. Within SoftLayer, which makes all plain. The fake Baidu is an optional extra.)

keyplyr

10:38 pm on May 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the range - added to Server Farm thread [webmasterworld.com...]

Secure Internet
172.111.128.0 - 172.111.255.255
172.111.128.0/17

managed by:gaditek.com
Managed Cloud Hosting, Virtual Private Network, Security Services, SSL, Data Encryption

wilderness

11:35 pm on May 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The fake Baidu is an optional extra

compatible;Baiduspider/2.0;


Umh! Missing a trailing space.

keyplyr

1:56 am on May 4, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Umh! Missing a trailing space.
And not from Baidu :)

lucy24

5:12 am on May 4, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Missing a trailing space.

Well, when you're visited by a fake Known Entity, it doesn't gain or lose any points by misspelling itself. It's still fake. For a while I was blocking a GoogleBot, like that.