Forum Moderators: phranque

Message Too Old, No Replies

To Serve or NOT to Serve

mostly about blocking bots, move on if not your thing.

         

topr8

9:59 pm on Sep 6, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've been think about what to serve up to those I don't want to serve!

I hope that makes sense ... i'm talking about what to serve to stealth bots ... eg. bots that do not identify themselves as what they are and that i don't want.

(obviously those that identify themselves i can allow or block as i see fit - although blocked bots may revisit in another disguise, i know that)

there's a bunch of possible options:

200 ... but an empty or minimal file
204 ... obviously empty
401 ... unauthorized
402 ... just for my own amusement
403 ... forbidden
404 ... not found
418 ... haha
500 ... server error
503 ... unavailable

quick proviso - i don't claim to be catching all bots, i'm sure some/many get through
likewise there might be the odd real user trapped who shouldn't be - that is collateral damage that i'm willing to accept.

i guess i've gone through phases, especially 500+ stage and 401 or 404

however i'm currently inclined towards 200 with a very minimal file.

the reasoning is that:

the majority of bot runners are dumb and just feed their list to their bot and keep doing so without modification, bandwidth is cheap and they don't care, they just scrape for their own reasons.

however, of course, some are smart, some doubtless way smarter than me, i'm of the view that they have their lists of uri's, which they believe to be 'valid', therefore any response other than a 200 OK, is likely to mean they are going to keep trying and if they still fail to try again in a different disguise.

... ultimately i'm talking about the 99% ... the 1% are going to get in anyway.

any thoughts?

LifeinAsia

11:32 pm on Sep 6, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



How about 301 (to their own IP)?

keyplyr

11:34 pm on Sep 6, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



403 is the obvious choice for bad bots. However I serve 404 to some of these hijacked ISP accounts that run vulnerability checks for wp-login and other popular hacks. I don't know if it helps diminish the frequency of these probes, but it is the proper response since I do not have those files on my server.

How about 301 (to their own IP)?
Besides the ethical issue, doing that may get you in trouble with your current host. I know Godaddy has specific rules against forwarding requests to remote servers

LifeinAsia

11:47 pm on Sep 6, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



the ethical issue
Care to explain?

phranque

12:16 am on Sep 7, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



rules against forwarding requests to remote servers

a 301 isn't forwarding a request - it's providing a response.
it's the user agent's decision to make the subsequent request of the "remote" server, which would not involve godaddy unless godaddy was hosting the bad bot.

i wonder if godaddy's rules prevent a webmaster from passing through requests to proxy servers...

keyplyr

2:28 am on Sep 7, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My intention is not to start an ethical debate... just sharing my own view.

If a 301 to 127.0.0.1 is being used, then many hosting companies may have issue with it; probably associated with other tactics not supported by the hosting company.

301 would not be the correct server response. That document was not moved permanently to "their own IP." Either a 403 to block, or a 404 if the file does not exist or a 410 if the file is gone.

Misleading the visitor by forwarding their request to a remote destination not associated with the document they were seeking may not be unethical for some but IMO it is not the proper response. I know many do this, but I think it is poor webmastering; again, just my own view. YMMV.

graeme_p

5:41 am on Sep 7, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you return a 403 a wrongly identified good bot has some inkling what the problem is and if they are keen to spider your site they could contact you.

topr8

9:26 am on Sep 7, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



403 is the 'correct' response i agree, keyplyr, however i wonder sometimes, in a reverse of graeme_p's point that a bad bot could then be even more interested if it is forbidden.

keyplyr

9:42 am on Sep 7, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@topr8 - yes, that's what I meant above where I started serving bad bots 404s for vulnerability probes

They were being blocked with 403s due to malformed headers but they kept coming. Switching to 404 ( as they should be for nonexistent file requests) helped reduce the hits.

whitespace

12:05 pm on Sep 7, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



i'm currently inclined towards 200 with a very minimal file.


Even for a "dumb" bot I can't see that returning a 200 OK response is going to help matters - I would have thought it could only make matters worse? What if this "bad bot" was trying to build some kind of public index, a 200 response is only going to encourage your URL being listed on this unscrupulous index.

How about 301 (to their own IP)?


There's always one! ;) However, this does assume that the "dumb" bot will follow the redirect - which I have my doubts - I imagine it will probably just get dropped (as with most other non-200 responses). Sending such a redirect response can also alert the bot that "I'm onto you!", rather than a more neutral response. A default Apache redirect response is perhaps a little bigger than a "minimal response". Do you really want your logs littered with these ambiguous 301 responses?

They were being blocked with 403s due to malformed headers but they kept coming. Switching to 404 ( as they should be for nonexistent file requests) helped reduce the hits.


I would agree, if the file does not exist then return a 404, regardless of whether it is a bad bot or not.

However, this isn't necessarily so easy. As with many database driven web "apps" these days, the only way to determine whether it's a 404 is by routing the request through the application - which in some ways defeats the point. (Presumably you want to block the bad bot early, to consume as fewer resources as possible.)

bakedjake

1:40 pm on Sep 7, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



@topr8 - Along with the HTTP response code, another lever you can play with is the response time back to the bots.

So maybe you want to hold that HTTP connection open for an absurdly long time, because at least that bot (or one of its threads) is tied up and can't do anything else.

In SMTP land they call it tarpitting. Exponential back off from badly behaving clients.

topr8

7:54 pm on Sep 9, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



thanks bakedjake i'll look into that - it sounds like a little bit of doing my bit to slow this relentless tide down.

keyplyr ... i'm glad to hear it has made a difference serving a 404 to those guys ... i have started doing that in the last month or so for various probes that i pick up in the access log, especially the endless wordpress ones.

>> (Presumably you want to block the bad bot early, to consume as fewer resources as possible.)
naturally whitespace yes i do, that is the objective - essentially to block as much as possible before i do any 'heavy lifting', part of that is deterring bots from coming back.
although i'm sure bandwidth is so cheap that many bots just work through their url lists and keep doing so whatever the response.

phranque

10:55 pm on Sep 9, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you want to hold that HTTP connection open for an absurdly long time

that seems like a resource-intensive form of "punishment", no?

iamlost

11:14 pm on Sep 10, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



router(config)#interface null 0
router(config-if)#no ip unreachables


Although I am tempted to return a 451 Unavailable For Legal Reasons :) since it was published early this year.