Forum Moderators: phranque

Message Too Old, No Replies

Folder view results in 301 to 404, fix required

         

JS_Harris

7:55 am on Feb 11, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hello,

I redirect non-www requests to their www version towards the end of my htaccess file, after security and other fun things are taken care of.

I launched a site this week and immediately began seeing bot activity asking for common wordpress folder information via HEAD request. 50% of those are hitting the www version, 50% are not, so my logs show half are receiving a 404 error and half receiving a 301 redirect to a 404 error. It's not a wordpress site and the domain was never previously registered.

I'd like requests to any non-existent folder to receive a 404 error before the redirect, if possible,regardless of which version they seek. How would I accomplish that?

For the curious, the requests are coming from several ips
- 82.81.163.202
- 81.218.175.83
- 177.129.90.37
- 195.234.215.147
- and others, none have a referrer, user agent or browser information
- 5 successive requests for /blogs/, /blog/, /wp/, /wordpress/ as well as the home page each time. None of these have ever existed.

whitespace

9:02 am on Feb 11, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



Are you doing any other internal rewrites? No front controller, etc? So, all requests are for real files?

If so, then you could perhaps do an early check for non-existent files/directories (before your canonical www redirect):


RewriteConf %{REQUEST_FILENAME} !-f
RewriteConf %{REQUEST_FILENAME} !-d
RewriteRule .* - [R=404]


Although the other part of my brain is saying... just filter these 301's from your logs. (?)

I redirect non-www requests to their www version towards the end of my htaccess file


Fair enough if you have other security business going on first, although the canonical www redirect would normally go nearer the start of your .htaccess file.

JS_Harris

10:53 am on Feb 11, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I do have some other internal rewrites... example
RewriteRule ^page-(.*)$ /?code=$1


My non-www to www is near the bottom of the file but that's because of the bot control and security taking up most of the space above.

not2easy

3:49 pm on Feb 11, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



My non-www to www is near the bottom of the file

As it should be. Positioning the www-non-www rewrite before other rewrites can cause requests to be processed more than once and in some cases can cause a 500 error. External redirects (301/302) will execute before any internal rewrites, unless that has been changed at the server level.

This:
RewriteRule ^page-(.*)$ /?code=$1

The target should be formatted with the protocol and path:
RewriteRule ^page-(.*)$ http://www.example.com/?code=$1 [R=301,L]

whitespace

6:30 pm on Feb 11, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



Positioning the www-non-www rewrite before other rewrites can cause requests to be processed more than once and in some cases can cause a 500 error.


"can cause a 500 error"?! I'd have to see it to believe it.

External redirects (301/302) will execute before any internal rewrites, unless that has been changed at the server level.


Say what?! mod_rewrite directives execute top to bottom. There is no server directive that can change this? (Are you thinking of mod_rewrite rewrites and mod_alias Redirects?)

For example, what happens in the following?


RewriteRule ^foo$ bar
RewriteRule ^foo$ http://example.com/ [L,R]


The target should be formatted with the protocol and path:


Errrm, no it shouldn't - not in this case. The original directive is an internal rewrite. By including the protocol (and especially the R flag) it becomes an external redirect.

lucy24

8:47 pm on Feb 11, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



towards the end of my htaccess file

Domain-name canonicalization (the fancy term for with/without www) would normally be your very last external redirect within mod_rewrite. (The second-to-last is generally the index.html redirect.) When you talk about order of things in htaccess, you're really talking about the order within any given module, since each mod is an island. For most people, mod_rewrite will end up having the highest line count,* so you'll find it easiest to group it at the end of your htaccess.

If so, then you could perhaps do an early check for non-existent files/directories (before your canonical www redirect):

Urk. That strikes me as a last-choice option, especially in htaccess, since the server then has to look for everything twice (once as part of the RewriteRule, and then later when it comes time to serve up the file for real). I'd do something like this instead:

RewriteCond %{THE_REQUEST} same-as-below
RewriteRule ^(admin|wp|blahblahetcetera) - [R=404]
That way you're not letting anyone in-- but you're also not giving away any information about what's really on your site. By returning the 404 manually, you save the server the work of going to look for the file. The flag R=anything-outside-the-300-range carries an implied [L], so the request will never reach the www redirect and will get an immediate 404. Yes, it is perfectly all right to lie to malign robots ;)

Later, of course, you'll eyeball your 404s and check for IPs that should be blocked categorically. Make sure you've got an [L] exemption for your 403 page right at the beginning of your mod_rewrite section, or you'll get robots asking for it by name when they used the wrong www. Ask how I know this.


* I think my access-control directives "Deny from" actually take up more space-- but those are located in a different htaccess file, one directory upstream, so that's separate anyway.

JS_Harris

9:16 pm on Feb 11, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks Lucy, if the same non-existent directories are continually targeted then standard bot control works if I place it over the targets. That's much simpler a solution than playing cat and mouse with these people, as long as I remember having blocked those folders if/when I do actually create any of them :)

The original directive is an internal rewrite. By including the protocol (and especially the R flag) it becomes an external redirect.

Correct, I am only internally rewriting with that example, not redirecting anyone. /page-code is more user friendly than /?code=thisnthat and I don't really want to give away the name of the file performing the work right in the address bar.

Make sure you've got an [L] exemption for your 403 page right at the beginning of your mod_rewrite section, or you'll get robots asking for it by name when they used the wrong www. Ask how I know this.
How? :) You made me think, should I be giving bots that are filtered by user agent a 404 or is just F ideal? A 404 tells them very little while a 403 suggests I'm onto them.
RewriteCond %{HTTP_USER_AGENT} ^(example|random|badguy|bot|list) [NC]
RewriteRule .* - [F]