blocking user agents - 403 - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

blocking user agents - 403

UA continues request

smallcompany

7:28 am on Apr 15, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

When I block a specific UA in .htaccess via RewriteCond and then by applying RewriteRule ^(.*)$ - [F,L] I expect it to drop off at that point or to make another request for a same or different page, whatever.

But what I see is that a specific UA and IP keeps requesting everything what is within the page, including images, js, css files, everything.

The UAs are those like:

- Empty UA
- Mozilla/4.0 (compatible;)

Now I wonder how is that possible if the server already issued 403?

I must mention that this is not the case for all 403s issued and that the number of 403s varies for different sites.

There are cases where a single IP makes all those requests.
There are also cases where IP changes but in a fashion 64.12.x.x where a third X changes sometimes, and fourth all the time.

The only thing I could think of was that they're using some kind of cache.

If they're new to the site, they should not be able to go through 403 and actually "know" what's within the requested page - am I correct?

Thanks

Pfui

4:24 pm on Apr 15, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

1.) Who, What, Where, When, How

Many bots crawl from non-obvious Hosts/IPs using multiple, including cloaked, agents. Thus if an apparent newcomer already knows your file paths, there's a strong likelihood they've been to your site before.

Alternatively, if your file paths are 'visible' via robots.txt, and/or you've not specifically denied caching in page-based HTML, .htaccess, and/or via at least some of the majors' webmaster tools, again everything's visible.

2.) When 'No' Doesn't Mean No

Many bots could care less about 403s, ditto many individual browser add-ons, link-checkers, file -downloaders, and users. The former is bad programming, imho. The latter can simply be clueless, or in too many cases, compromised.

My Solution (ymmv): When I'm hit by regularly 403-ignoring visitors of any kind, I rewrite them to 127.0.0.1. Then, if they're relentless or their hit rate's too rapid, I send a Cease-and-Desist (C&D) to the ISP. I'm often surprised how frequently the latter works. (Alas, in the case of notorious ISPs like theplanet or amazonaws, don't hold your breath.)

In extreme cases, you can place a firewall rule against them if you have the means, or, depending on your server software, you can deny them from the get-go so denials waste the least amount of resources and don't clog your site-level logs.

3.) Belt-and-Suspenders

If the offending Host/IP/bot is notorious -- search Goo, projecthoneypot.org, this forum's posts, etc. -- don't sweat locking out the address. However, if it's someone clueless, sooner or later they may revisit and realize what havoc they've been wreaking. That's why I wait awhile between 403s (with my e-address in graphic form) and oblivion (127.0.0.1).

For example, in the case of Safari's 'Top Sites' feature, the browser learns to revisit oft' visited pages on launch. (This supposedly cool code thing is a MAJOR headache on dynamic sites because indices get hit countless times/day for no real-time purpose whatsoever.) Anyway, before you kill because of 403-abuse, make sure it's not just a regular visitor's browser doing its thing.

4.) Huh-Wha--?

If the preceding is more geek than not, you'll find info and how-to details about the majority of the preceding options, in the appropriate forums here, e.g., Apache Web Server, and specifically their Library docs.

thetrasher

4:26 pm on Apr 16, 2010 (gmt 0)

10+ Year Member

Mozilla/4.0 (compatible;)
= prefetch requests (Blue Coat proxy)

jdMorgan

12:08 am on Apr 21, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

To be clear, this is a "security check" implemented as what I call a "side-car" request. That is, the security check is not done ahead of the browser's fetches, but rather at the same time or a little later.

Therefore, this agent has already noted all URLs fetched by the real visitor's browser, and so your 403 has only the effect of blocking "this" request. The others (for images, css, js, etc.) result not as a result of any 'hole' in your 403 handling, but rather because the security software is working from a list already made when the browser requested your page.

dstiles notes in a concurrent thread [webmasterworld.com] that the BlueCoat requests arrive with an HTTP header of "X_BLUECOAT_VIA" which you may be able to use in some way -- for example, to suppress logging of these requests or to 'give them a pass' through your access controls if you want to stay off BlueCoat's block list. I should note that I do not know that 403ing these request will actually put you on their block-list. This is just an example.

Jim

smallcompany

3:04 am on Apr 21, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Oooo... got it. Things like Trend Micro or AVG, or anything else that "browses" besides big brother browser, as those little fish attached to the body of a shark.

Those little pests...

My question now is:

Is there still really a reason to continue blocking such UAs, or simply let them to their thing?

The only reason why those got onto my 403 list at the first place was requests for non-existing crap that made me think something was wrong with those UAs.
If I did not see bunch of 404s coming from them, they would not get 403 later on.

I guess I could stop doing this, analyze existing logs and make conclusions about how to condition this so some go through and other don't.

This is really science for itself.