"Anonymous" Bots - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

"Anonymous" Bots

aristotle

4:40 pm on Jul 3, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

We're always talking about ways to block various bots, but there's one category of them that I don't remember seeing any discussions about. These are bots that don't provide any information, except for IPs, that can be used to block them. I actually see more of these in my logs than any other kind of bot.

Here's an example of what I'm referring to:

Host: 173.244.181.29
/
Http Code: 403 Date: Jul 02 23:25:20 Http Version: HTTP/1.1 Size in Bytes: 13
Referer: -
Agent: Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0

This one has shown up 7 times already today on one of my sites, with each request from a slightly different IP at xlhost. I don't see any way to block it by file or file type, or by referer, or by UA.

In other words, you usually have to use IPs to block this category of requests. But since they keep coming from so many new and different places, it can consume a lot of time trying to keep up with them.

Another problem with this kind of bot is that it's not as noticeable in your logs, so if you're just scanning, as I usually do, you won't catch a lot of them.

Please tell me if I'm overlooking something here, some better way to deal with this, because I simply don't have time to try to keep up with these things.

lucy24

6:24 pm on Jul 3, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This one

How do you know it's all the same bot? Obviously there's some unifying feature, or you'd assume it was seven random different robots. If it's all the same UA, why can't you block it?

:: detour to own raw logs ::

Gosh, that's interesting. I've hardly ever seen this exact UA-- but the ones that did use it almost always had a tripartite request: robots.txt, favicon, front page, in that order. No humans.

Incidentally, the request you quoted did in fact get a 403. So what's the issue? Blocking by IP isn't a last-resort fallback; it should be considered your first choice, since it's less work for the server than any other rule you can think of.

aristotle

7:08 pm on Jul 3, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Thanks for the reply, Lucy. That was just meant to be an example of the TYPE of request I'm referring to. Evidently it was blocked because I'd already blocked some xlhost IP ranges.

Yes, they can always be blocked. My point is that this type of request has so many variations, with new ones appearing everyday, that it's just too time consuming to try to keep up with them. Also, they're not as noticeable in the logs as named bots.

lucy24

9:31 pm on Jul 3, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I generally disregard robots who only ask for the front page, unless there's some extra misbehavior like a blatantly fake UA or referer. Not worth the bother. Now, robots who ask for interior pages without first consulting robots.txt ... that's a problem.

Asking for the favicon was nasty, though, because in my automated log-wrangling that got them marked as human. (Query: What do they do with it?) If they hadn't asked for that, then they would have been handed off to a later function that marks iffy requests as robots if they happen to ask for robots.txt. (Can't do it universally on this basis, because you do get the occasional snoopy human. I've been known to do it myself!)

aristotle

12:38 am on Jul 4, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Maybe we can agree that one characteristic of a good .htaccess file is its efficiency. My approach is to try to develop a fairly simple short .htaccess file that blocks a large majoroty of unwanted requests. If you try to block anything and everything, you run into the law of diminishing return.

keyplyr

1:43 pm on Jul 4, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

"develop a fairly simple short .htaccess file that blocks a large majoroty of unwanted requests."

18 years ago that was my plan too. Now I have one 1.2MB but highly efficient. Even at this size Google's Page Speed tool says my server responds fast and to the eye pages load almost instantly on a fast network.

aristotle

1:54 pm on Jul 4, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

1.2MB? Do you mean megabytes?
That's hard to imagine. None of the .htaccess files for my five sites is more than 6 kilobytes.

At any rate, as I mentioned earlier, it's also a matter of how much time you want to spend on this matter. Nobody has more than 24 hours a day, and I don't enjoy looking at logs and working on .htaccess, and prefer to spend my time in other ways..

lucy24

6:09 pm on Jul 4, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Now I have one 1.2MB

None of the .htaccess files for my five sites is more than 6 kilobytes.

Equally ?! to each.

I've got two layers of htaccess. The shared one-- mod_authwhateveritis and mod_setenvif, located in the userspace shared by all sites-- is currently 28K. The site-specific ones obviously vary; the biggest is around 18K.

I don't think the size of an htaccess file would ever be a significant part of site performance, compared to the time it takes to build pages and send out content, but still. The longer it is, the more you yourself have to plow through when editing it.

:: insert boilerplate about how the significant thing for the server isn't htaccess filesize but its mere existence or even potential existence ::

I have no idea how big a typical config file would be if it's all one site. The one that comes with MAMP is currently 21K, plus half a dozen or so included files (vhosts and so on) that are a couple of K each. But most of that filesize is in comments, not in executable code; Apache is very generous with informational lines. Since it's MAMP there is of course no access-control business except the built-in rules for ".ht" and the like.

keyplyr

10:36 pm on Jul 4, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

...it's also a matter of how much time you want to spend on this matter.

Webmaster priorities vary greatly I would assume. Some are hobbyists, some run fan pages or family sites and others are in business, to varying degrees. My main site pays my monthly mortgage and most of my other bills, so I spend the amount of time necessary.

The longer it is, the more you yourself have to plow through when editing it.

Well as I've said in other threads, I don't have very many rewrite rules, probably under a dozen lines so there's nothing to "plow through." However, it's probably a matter of perception. I'm used to working with my file. I know where everything is :)

When adding or editing CIDRs, the Search function of my text editor takes the cursor to where it needs to be. The only challenge with an htaccess file of this size is having it open at the same time as several other files (access & error logs, several scripts, etc) so it was necessary to buy a more robust text editor as well as add more RAM to my machine. This also makes it nearly impossible to manage the htaccess when I'm on-the-road using only a phablet.

blend27

10:50 pm on Jul 5, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I don't enjoy looking at logs and working on .htaccess, and prefer to spend my time in other ways

it is not just the access logs you have to pay attention to. You know your sites better that anyone else. If you start setting up basic traps with in your sites, you will see how many patterns occur over and over from the bots that are not completely stealth. Those are always one step ahead of us.

One of my sites has no link from the homepage to homepage itself. Anything that sends a referrer as root of that domain(http :// www.domain.tld/ ) is banned on the spot.

That IP gets into a central table that all other sites feed from.

That IP is later(0.5 seconds) gets pulled into a manual review pool for the IP range(use other tools to get the rest of the hosting company ranges to block), headers sent, UA, Country, URI accessed, referrer, access patterns.

It's not that hard, it is a battle You have to enjoy though...

lucy24

12:52 am on Jul 6, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

One of my sites has no link from the homepage to homepage itself.

Well, no reason why any site would ever need a link from any page to itself. I'd count it as "worse than useless" because it confuses the user.

Anything that sends a referrer as root of that domain ... is banned on the spot.

I've got a whole cluster of this kind of site-specific lockouts. mod_rewrite by itself can't globally block auto-referers,* worse luck, but some patterns can be ruled out. No matter how your site is laid out, there will be universals, like claiming to come from EXAMPLE.COM when the site is really www.example.com.

* Not long ago I poked a hole for DuckDuckGo's faviconbot, inexplicably crawling from an AWS range ... only to find it blocked all over again for requesting the front page with autoreferer. Honestly, DDG, are you trying to get locked out?

keyplyr

2:02 am on Jul 6, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Well, no reason why any site would ever need a link from any page to itself. I'd count it as "worse than useless" because it confuses the user.

Maybe you misundstood (or maybe I am) but it is standard practice for sites to link their logo back to the "home" page. Almost every major site does this, including all mine.

Anything that sends a referrer as root of that domain ... is banned on the spot.

My version of that is - I block any "page" request that includes that respective page as the referrer. Luckily my server config does not redundantly include the same referrer page when the browser reloads.

lucy24

5:09 am on Jul 6, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

it is standard practice for sites to link their logo back to the "home" page

Sure, but you don't need it to be an active link on the home page itself.

Luckily my server config does not redundantly include the same referrer page when the browser reloads.

I doubt it has anything to do with the server. A browser simply doesn't send a referer on a refresh/reload; it's functionally the same as a bookmark request.

I block any "page" request that includes that respective page as the referrer.

How? I mean, how, mechanically? Do all page requests detour via a quick php script, or is it a very small site? (Other than the front page, I can only block auto-referers for a handful of specified html pages, each coded individually.)

keyplyr

7:12 am on Jul 6, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Sure, but you don't need it to be an active link on the home page itself.

Don't need to, but if I have the logo linked to the homepage on 400 pages. If the logo on the homepage doesn't do the same thing, I compromise the trust factor. IMO consistency is an asset. Visitors need to trust the site. Besides, I built it into the page header I include atop all pages. Not to have it on one page would cause a second header to be needed.

I doubt it has anything to do with the server. A browser simply doesn't send a referer on a refresh/reload; it's functionally the same as a bookmark request.

Of course, that's what I'm saying... so when my script identifies this behavior, it is a bot.

How? I mean, how, mechanically? Do all page requests detour via a quick php script, or is it a very small site? (Other than the front page, I can only block auto-referers for a handful of specified html pages, each coded individually.)

Well "detour" implicates leaving the normal path. My server-side scripting is more succinct than that IMO. This is what I meant when I said I got rid of most all the rewrites that once resided in htaccess. Depending on the request header, some hits run through cgi scripting that, if conditions matched, run several rules. The "no-self-referring" rule does allow for query string & parameter occurrence by method as well as host.

aristotle

12:29 pm on Jul 6, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

One of my sites has no link from the homepage to homepage itself. Anything that sends a referrer as root of that domain(http :// www.domain.tld/ ) is banned on the spot.

I do that type of block too on a couple of my sites, but it mainly serves as a defense against botnets. I originally got the code for it from Lucy.

But that code doesn't work against the type of bots that I wanted to discuss when I started this thread, because they don't provide a referer.

keyplyr

12:45 pm on Jul 6, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

But that code doesn't work against the type of bots that I wanted to discuss when I started this thread, because they don't provide a referer.

I invite you to consider the benefit of parsing request headers. Bots will usually send malformed header data that can be used to identify them as bots.

Several different approaches to how you'd pull that info to write to either a dynamic htaccess, a DB or a server-side script, but all this can be learned by searching the web. A good place to start would of course be your own host's knowledge base or member forum since server admins often set things up a bit differently.

aristotle

2:24 pm on Jul 6, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Thanks keyplyr
Unfortunately that kind of approach is far beyond my current level of knowledge and skill, and I don't have time right now to do enough study to get to that level. I have to use simple-minded methods.

lucy24

4:18 pm on Jul 6, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

consider the benefit of parsing request headers

For a while I was handily blocking some types of request simply by looking at the value of one particular header. And then assorted ### mobiles started giving the same ### value to the ### header so I had to throw that one out the window unless I went to complicated if/then constructions-- which simply wasn't worth the bother.

As has already been pointed out, if the time and effort involved in creating the rule is creater than the cost of all harm done by the robot, it may not be worth it. (Like when an insurance company discovers that the baseline cost of processinging pre-approval for some service is more than the cost of paying for the occasional undeserving case. Pick your own analogy.)

keyplyr

4:27 pm on Jul 6, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

If you block all ranges assigned to VPN, server farms, data centers & colos listed in these forums you'll affectively stop a large percentage of bots, even botnets however it is just not possible to forsee which ISP account or home wifi router is being zombied, so that's always the Achilles heel.

keyplyr

12:34 pm on Jul 7, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

"...unless I went to complicated if/then constructions-- which simply wasn't worth the bother."

I see that as the simple part. IMO a third of all the scripts out there are built on that framework, in various scripting languages.

The "complicated" part for me is always the implimentation, sewing it into the chain of processes.

lucy24

6:39 pm on Jul 7, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I probably haven't spent enough time considering sites that are built around their own, hand-rolled php (or equivalent for other platforms). I always think in terms of either hard-coded HTML, or a CMS that works on the assumption that the user doesn't know anything. That is: you can tweak your WP code, but the essence of the system is that you don't have to.

Obviously if everything goes to php regardless, then there's little need for complicated rules in htaccess. It all starts with a simple
IF {all-possible-access-control-stuff comes back clean}
THEN {build page}
ELSE {send back a 403, with option on different types of physical content depending on exact result of first step}

It took me a year or so to wrap my brain around the idea that the server can record a 200 while the user sees a 404/410. Give me another couple of years and I may manage to do the same for a 403.

keyplyr

7:27 pm on Jul 8, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

While I do use it for a couple contact forms and echo responses, I've always considered PHP an easy target for hacking. This idea probably comes from the old phpBB bulletin board forums that were the prime vulnerability target prior to the Word Press trend.

In earlier PHP builds, that global directive was an open door for bad actors to come in and move laterally from account to account on shared hosting. Those php.ini files defaulted to globals:on. Now I'm aware that all that has been fixed for a while now, but I still have a bad feeling whenever I am confronted with using out-of-the-box PHP, so I usually write my own with several security checks, as well as a custom CAPTCHA (if I use PHP at all.)

I also don't feel that PHP in the best choice for server-side scripting for what I do. PERL via CGI has always been my method of choice.

[edited by: bill at 12:59 am (utc) on Jul 13, 2015]
[edit reason] fix formatting [/edit]