homepage Welcome to WebmasterWorld Guest from 54.227.160.102
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 77 message thread spans 3 pages: 77 ( [1] 2 3 > >     
.htm to Extensionless URLs - Plus Renaming Files
.htaccess on deck
MarkOly



 
Msg#: 4587284 posted 11:04 pm on Jun 24, 2013 (gmt 0)

After much deliberation, I've decided to convert from .htm extensions to extensionless URLs. I'm also changing the names of most pages and moving them to subfolders - about 80 pages. I've pieced together the .htaccess code based on the great examples I've cherry picked here.

RewriteEngine On
RewriteBase /

#1 - Redirect requests for old URLs to new URLs
RewriteRule ^old-page\.html?$ http://www.example.com/new-folder/new-page [R=301,L]
# Then repeat the above 80 times.

#2 - Redirect index.html or .htm in any directory to root of that directory and force www
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.html?[^\ ]*\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1? [R=301,L]

#3 - Redirect all .html requests to .htm on canonical host.
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1.htm [R=301,L]

#4 - Redirect direct client request for old URL with .htm extension
# to new extensionless URL if the .htm file exists
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+\.htm\ HTTP/
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^(([^/]+/)*[^.]+)\.htm$ http://www.example.com/$1 [R=301,L]

#5 - Redirect any request for a URL with a trailing slash to extensionless URL
# without a trailing slash unless it is a request for an existing directory
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^.]+)/$ http://www.example.com/$1 [R=301,L]

#6 - Redirect requests for non-www/ftp/mail subdomain to www subdomain.
RewriteCond %{HTTP_HOST} !^(www|ftp|mail)\.example\.com$
RewriteRule ^([^.]+)$ http://www.example.com/$1 [R=301,L]

#7 - Internally rewrite extensionless URL request
# to .htm file if the .htm file exists
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.htm [L]


I'm wondering if it would be a good idea to trim some fat from this. For one thing, on the 80 specific URL redirects (#1), will the inclusion of html extensions be a huge extra burden? Considering that there's 80 lines to go through, would it be a good idea to only include the necessary .htm extensions?

If there's one error I see more than any others in my logs, it's the .html requests. That's why I added #3 (redirect .html to .htm). I know you want to avoid multiple redirects, so I'll probably want to get rid of #3. I could probably easily combine it with #4 (redirect .htm to extensionless) - if I could delete the file check line in #4 (RewriteCond %{REQUEST_FILENAME} !-f). So I'm wondering how important that file check is. There's another file check in #7 (internal rewrite to .html), so it doesn't seem that necessary. It looks like the file check would prevent Apache from cycling through again in the case of a bad file name. But from what I've read, the filename and directory requests use alot of resources. So it seems like more resources would be used checking every request for file name vs the extra burden of occasional bad file name requests cycling through once more. Am I missing something?

I'm also wondering how important #5 is (remove trailing slash from files), which requires a directory check line (RewriteCond %{REQUEST_FILENAME} !-d). I don't have problems with this error now. The .html requests are alot more common. But if I was using extensionless URLs, I bet it would be a different story. Is this a common error once you convert to extensionless?

If you see anything else I should be concerned with, please let me know. Thanks for any help!

MarkOly

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4587284 posted 12:25 am on Jun 25, 2013 (gmt 0)

Looking primarily at the rule list itself rather than at your own questions (which are good ones):

# Then repeat the above 80 times.

Is there no pattern at all that will allow you to collapse the redirects into a smaller number of rules? 80 is not a vast number in the greater scheme of things, but it's an awful lot of lines filling up the htaccess. Are there a lot of residual URLs that will continue to have extensions? Can those fit into a pattern?

You may find it cleaner to rewrite (not redirect) to a php script that does the lookup and issues the redirect. If you take this approach, leave the rule in the same place-- as if it were a page-specific redirect, even though on the surface it's a rewrite alone.

.html to .htm: Then again, you could say
^(.+\.htm)l to $1 ;)
But you may find it conceptually easier to stay with the
(capture)\.html to $1.htm form.

That's assuming your paths contain literal periods so you can't use [^.]. Your other rules suggest they don't, so you can use the same formulation here.

#3 - Redirect all .html requests to .htm on canonical host.

#4 - Redirect direct client request for old URL with .htm extension

Seems like these should go the other way around. Otherwise you're potentially redirecting the same request twice: from html to htm to nothing. Besides, didn't rule group #1 already take care of any requests for the old URLs?

The -f or !-f test is always a last resort. If you've got an alternative, such as feeding in a list of known names, use it.

#5 - Redirect any request for a URL with a trailing slash to extensionless URL without a trailing slash unless it is a request for an existing directory

Why do you need this rule? Have you been getting requests with spurious directory slash? There's quite a long list of "rules you don't need unless you need them", and this would seem to qualify.

#6 - Redirect requests for non-www/ftp/mail subdomain to www subdomain.
RewriteCond %{HTTP_HOST} !^(www|ftp|mail)\.example\.com$

If you've got human visitors using http 1.0 you'll need a second condition that says
%{HTTP_HOST} example\.com
(no anchors needed) to skip over requests with no hostname at all.

Dideved



 
Msg#: 4587284 posted 1:49 am on Jun 25, 2013 (gmt 0)

^(.+\.htm)l to $1 ;)


Wink back. ;)

Unless it wasn't meant for me, in which case... whistles nonchalantly and casually steps away. :p

MarkOly



 
Msg#: 4587284 posted 4:15 am on Jun 25, 2013 (gmt 0)

Besides, didn't rule group #1 already take care of any requests for the old URLs?

OMG you're right! I completely eliminated rule #4 by sheer brute force! The only thing that's leftover for #4 is bad .htm requests.

Is there no pattern at all that will allow you to collapse the redirects into a smaller number of rules?
You may find it cleaner to rewrite (not redirect) to a php script that does the lookup and issues the redirect.

Unfortunately, there's no patterns that would make a real difference. The file names are all over the map and bear little resemblance to the products or subjects they represent. They were created in the FrontPage 2000 days as you can probably tell from the .htm extensions. It was finally time to give it a makeover. Embarrassingly, I have yet to dabble in PHP. So I think I'm siding with embracing my 80-line redirect set, but then going completely minimalist on everything that's not essential - meaning no worrying about .html requests (except for #2, index rule), no f tests (except for #7) or d tests. I can always just keep an eye on my logs in case anything becomes a problem.

^(.+\.htm)l to $1 wink

Apache humor? Nice! Did you spell bad things upside-down with your calculator when you were a kid?

That's assuming your paths contain literal periods so you can't use [^.]. Your other rules suggest they don't, so you can use the same formulation here.

No, no periods. I did my homework. I wanted to try and be (.*)-free if I could.

Why do you need this rule? Have you been getting requests with spurious directory slash? There's quite a long list of "rules you don't need unless you need them", and this would seem to qualify.

Most of the examples I read here include #4, #5, and #7 together as the standard set for doing extensionless URL's. I suspect that once you go extensionless, it's a common thing for a slash to get slapped on the end every now and then. I'll do without it but keep an eye on it.

So my updated minimalist except for 80-line rule ruleset:

RewriteEngine On
RewriteBase /

#1 - Redirect requests for old URLs to new URLs
RewriteRule ^old-page\.htm$ http://www.example.com/new-folder/new-page [R=301,L]
# Then repeat the above 80 times.

#2 - Redirect index.html or .htm in any directory to root of that directory and force www
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.html?[^\ ]*\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1? [R=301,L]

#6 - Redirect requests for non-www/ftp/mail subdomain to www subdomain.
RewriteCond %{HTTP_HOST} !^(www|ftp|mail)\.example\.com$
RewriteCond %{HTTP_HOST} example\.com
RewriteRule ^([^.]+)$ http://www.example.com/$1 [R=301,L]

#7 - Internally rewrite extensionless URL request
# to .htm file if the .htm file exists
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.htm [L]

I like that much better. I think eliminating the 3 rules makes the 80-line rule a little easier to swallow. Then I can watch the logs and patch any holes as they come up.

Thanks for the help Lucy! I really appreciate it.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4587284 posted 5:46 pm on Jun 25, 2013 (gmt 0)

Depending on what happens in your server, it may be necessary-- horrid possibility!-- to preface every single one of your 80 redirects with a RewriteCond looking at THE_REQUEST. But let's not borrow trouble :)

Rule #7 will probably run faster if you express the pattern as
^([^.]+[^./])$
That way, requests for directories will meet only a single-character hiccup at the very end, while requests for non-page files will have no hiccup at all. I don't know whether there is any measurable difference in parsing speed between [^./] and \w (assuming for the sake of discussion that you have no filenames ending in hyphen or other non-word character).

Here too it may be necessary to include a condition looking at THE_REQUEST. More necessary, possibly, since it's an internal rewrite. If so, it should come before the existing condition. (When there's more than one RewriteCond, joined by the default AND, list them in order of likelihood to fail, starting with the most likely.)

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4587284 posted 7:05 pm on Jun 25, 2013 (gmt 0)

If you re-include Rule 3 it should instead redirect both .html and .htm requests to extensionless. If you have no pages which map old to new URLs simply by removing the extension then you probably don't need this rule. However, I'd leave it in simply in case someone does ask for one of your new pages but for whatever reason adds an extension to the URL request. A slightly modified Rule 4 could achieve the same result.

The code in Rule 6 doesn't match the plain-English description in the comment. As coded, the first RewriteCond stops the redirect when any of the www. or ftp. or mail. sub-domains are requested by HTTP. Is the comment what you want to do and the code wrong, or is the code right and the comment wrong?

MarkOly



 
Msg#: 4587284 posted 7:36 pm on Jun 25, 2013 (gmt 0)

Depending on what happens in your server, it may be necessary-- horrid possibility!-- to preface every single one of your 80 redirects with a RewriteCond looking at THE_REQUEST. But let's not borrow trouble!

I think it should be okay. I already have a small list of these in place that have been working. Yeah that would be a PIA.

Rule #7 will probably run faster if you express the pattern as
^([^.]+[^./])$

I had to go over this a million times. You reminded me about the directory pages. So I thought yeah, what about the directory pages, but wait, that rule doesn't allow the directory pages. Then it finally dawned on me that I need to actually allow the directory page requests through. So I'm good now. I'll change #7.

Here too it may be necessary to include a condition looking at THE_REQUEST. More necessary, possibly, since it's an internal rewrite. If so, it should come before the existing condition.

Is that to save resources? To keep #7 from doing a file check on every single request that hits it?

Thanks Lucy!

MarkOly



 
Msg#: 4587284 posted 8:03 pm on Jun 25, 2013 (gmt 0)

A slightly modified Rule 4 could achieve the same result.

Yeah that's what I was thinking too in my first comments. I kind of threw #3 in there as a sacrificial lamb, leading to the idea of combining it into #4. But then when Lucy opened my eyes to the fact that every single valid .htm request that could ever come through is covered by one of the 80-lines in rule #1, that lead to me just removing #4 completely. Then I thought, you know what, I might as well go all the way and remove everything else that's not essential, since I'm already taxing things with the 80-line rule. Then see what errors come through in my logs and add a rule if needed.

As coded, the first RewriteCond stops the redirect when any of the www. or ftp. or mail. sub-domains are requested by HTTP.

That's what I'm trying to do. I think I should have worded it "#6 - Redirect requests for non-www, non-ftp, or non-mail subdomains to www subdomain." Does that fix it?

Thanks for the help g1smd!

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4587284 posted 8:11 pm on Jun 25, 2013 (gmt 0)

Post written before preceding post was visible
The code in Rule 6 doesn't match the plain-English description in the comment. As coded, the first RewriteCond stops the redirect when any of the www. or ftp. or mail. sub-domains are requested by HTTP. Is the comment what you want to do and the code wrong, or is the code right and the comment wrong?

Uh-oh, did I read the rule backward?

:: repeat look at #6 ::

#6 - Redirect requests for non-www/ftp/mail subdomain to www subdomain.
RewriteCond %{HTTP_HOST} !^(www|ftp|mail)\.example\.com$
RewriteCond %{HTTP_HOST} example\.com
RewriteRule ^([^.]+)$ http://www.example.com/$1 [R=301,L]

To me the rule says: If the request is for a named subdomain other that www or ftp or mail, then redirect to www. Come to think of it, was this rule intended for domain-name canonicalization? If so, use the ordinary
!^(www\.example\.com)?$
pattern. (One condition only.) I don't think you need to say anything about mail or ftp at all, unless those really are named subdomains accessed via http.

Is that to save resources? To keep #7 from doing a file check on every single request that hits it?

Ordering of RewriteConds isn't a carved-in-stone issue, because details depend on your individual site. At bottom there are two possibilities:
--two or more conditions which ALL have to be met
and/or
--two or more conditions, of which AT LEAST ONE has to be met

mod_rewrite works on a sudden-death principle. If a condition has to succeed, and it doesn't, mod_rewrite doesn't even look at the rest of the list. Conversely, if only one condition in a group has to succeed, and it does, mod_rewrite doesn't care about the rest of the group.

I say "condition", lower case, because all of this includes RewriteRule as well as any RewriteConds. If the pattern in the body of the rule says ^foobar and the request doesn't begin in /foobar, the RewriteConds aren't even evaluated. In effect, the first condition failed. You can say something analogous about a pattern in the form (foo|bar), though technically this isn't about mod_rewrite; it's how Regular Expressions work.

What this means is:

If more than one condition has to be met, list them starting with most likely to fail. Not much use making the server run through a long list of things that apply to 1/10 of all requests if the last thing on the list only applies 1/1000 of the time.

If any one of a group of conditions has to be met, list them starting with most likely to succeed.

In each case the object is simply to let the rewrite engine finish its stuff and get out of there sooner.

Dideved



 
Msg#: 4587284 posted 12:13 am on Jun 26, 2013 (gmt 0)

#7
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.htm [L]


Rule #7 will probably run faster if you express the pattern as
^([^.]+[^./])$


I feel like there should be an informal rule around here that any talk of performance must come with a benchmark to back it up, because people's gut feelings are notoriously unreliable. And I truly don't mean that as an insult against anyone in particular. This is such a widespread problem in the whole field of computing that it's discussed in some of the most renowned books:

It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.
- Knuth


I actually bothered to benchmark the OP's pattern, Lucy's alternative, as well as a .* variation, and they all fell within each other's range of variability. That is, on some runs, OP's pattern was faster; on some runs, Lucy's was faster; and -- yes -- on some runs, even .* was faster. (The .* pattern came with the added benefit that it would work correctly for all valid URLs.)

In addition to none of these patterns being a clear performance winner, it's also worth mentioning that the difference between them was always measured not even in microseconds, but nanoseconds. That difference is so infinitesimally small that for all practical purposes, there is no performance difference.

MarkOly



 
Msg#: 4587284 posted 2:18 am on Jun 26, 2013 (gmt 0)

To me the rule says: If the request is for a named subdomain other that www or ftp or mail, then redirect to www. Come to think of it, was this rule intended for domain-name canonicalization? If so, use the ordinary
!^(www\.example\.com)?$
pattern. (One condition only.) I don't think you need to say anything about mail or ftp at all, unless those really are named subdomains accessed via http.

Yeah, it's for domain-name canonicalization. For some reason, I was thinking of ftp as a subdomain. So no, that's not accessed via http. But webmail is accessed via http. That's what I was thinking of when I used 'mail'. I don't really ever use the webmail. I did need it once when I couldn't access emails the normal way. So I might as well not lock myself out of it. I'll change it to (www|webmail) then.

If more than one condition has to be met, list them starting with most likely to fail. Not much use making the server run through a long list of things that apply to 1/10 of all requests if the last thing on the list only applies 1/1000 of the time.
If any one of a group of conditions has to be met, list them starting with most likely to succeed.
In each case the object is simply to let the rewrite engine finish its stuff and get out of there sooner.

Okay that makes sense. Thanks for explaining it!


In addition to none of these patterns being a clear performance winner, it's also worth mentioning that the difference between them was always measured not even in microseconds, but nanoseconds. That difference is so infinitesimally small that for all practical purposes, there is no performance difference.

That's interesting. It did make me wonder when I thought about the Hosts file software I used to use. You know, the blacklist of bad sites that are added to your Hosts file so you never access one of them on accident. That thing was gigantic, like 140,000 lines long! If I'm not mistaken, it goes through the entire list with every request. Yet the delay when loading pages was almost imperceptible. I say almost, and that's why I don't use it anymore. My 80-line rule might be spitting in the ocean. I don't know if it's apples to apples though. I'm still going to keep it light at first go just to see what happens.

MarkOly



 
Msg#: 4587284 posted 2:37 pm on Jun 28, 2013 (gmt 0)

So far so good! I put the new htaccess in place and switched over to the extensionless pages late Wednesday night. I'm looking at my logs and there's no errors at all that could be tied with the htaccess code. Just a few image related errors - a few image file names that I changed - plus the apple-touch-icon.png error, which has been on the bottom of my todo list.

I looked at the 301's and they all seem to be working fine. Googlebot is redirecting to the new pages without any errors. I noticed one of my old incoming links is using the index.htm version of the home page. So I'm glad I put that rule in there. Speed seems to be fine. I'm not noticing any delays on any of the redirects.

As of now, I'm only using #1, #2, #6, and #7 from above. Sometime in the next couple days, I'm going to add a rule for redirecting htm and html to extensionless - plus the rule redirecting trailing slash to non-trailing slash. I'll post the final code when I'm finished adding rules.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4587284 posted 10:06 am on Jun 30, 2013 (gmt 0)

Run Xenu LinkSleuth over the site and check for errors.

Also construct a text file list of "good" and "bad" URLs. Duplicate the whole lot for both non-www and www versions. In LinkSleuth set the scan depth to the lowest possible then import that list of URLs and check you get the right results.

MarkOly



 
Msg#: 4587284 posted 5:35 pm on Jun 30, 2013 (gmt 0)

Run Xenu LinkSleuth over the site and check for errors.

Also construct a text file list of "good" and "bad" URLs. Duplicate the whole lot for both non-www and www versions. In LinkSleuth set the scan depth to the lowest possible then import that list of URLs and check you get the right results.

Well I ran Xenu a couple days ago and got it down to no errors. So there's no bad URLs to add. But I did what you said and duplicated all the www versions as non-www and ran that list thru at depth of 1. That came out with no errors. The non-www were all redirected properly to www versions. Is that the point of doing this? To verify that the 'non-www to www' rule is doing its job? Also, why add the bad URLs to the list? Wouldn't you want to fix the errors first?

Now that I think about it, maybe I can use Xenu to test some of my other rules, once I add them - like htm and html to extensionless. I ran my old list of .htm URL's through and they all look to be redirecting without error.

The one error that persists is for the USPS Express Mail Service Commitment page I have posted on my Shipping support page: [postcalc.usps.com...] It always reports: error code: 503 (temporarily overloaded). I wonder if Google sees that as a broken link? It's strange. The link works. But it always reports as 503.

I need to spend some time torture-testing with different error combinations using web-sniffer.net and awebguy's HTTP Response Header Checker. I know that before I made all these changes, I spent some time doing that and I was very surprised at how many error combinations resulted in 200-responses that shouldn't have.

Thanks g1smd!

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4587284 posted 7:52 pm on Jun 30, 2013 (gmt 0)

It always reports: error code: 503 (temporarily overloaded). I wonder if Google sees that as a broken link? It's strange. The link works. But it always reports as 503.

Is the exact wording "temporarily overloaded" coming from Xenu or from the server? It's more precise than the definition of 503 ("service unavailable").

As a human, can you get to the page "cold" by simply typing in the URL, or do you have to follow some kind of procedure?

I was very surprised at how many error combinations resulted in 200-responses that shouldn't have.

If you've got dynamic pages, the response the server sends may not be the response the client receives. Only the received response matters.

MarkOly



 
Msg#: 4587284 posted 4:50 am on Jul 1, 2013 (gmt 0)

It always reports: error code: 503 (temporarily overloaded). I wonder if Google sees that as a broken link? It's strange. The link works. But it always reports as 503.

Is the exact wording "temporarily overloaded" coming from Xenu or from the server? It's more precise than the definition of 503 ("service unavailable").

As a human, can you get to the page "cold" by simply typing in the URL, or do you have to follow some kind of procedure?

"error code: 503 (temporarily overloaded)" is the exact wording coming from Xenu. You can retrieve the page cold by typing in the URL. It's a standard USPS online tool. The same thing happens with the standard USPS Rate calculator: [postcalc.usps.com...] I used to link to that and Xenu would always report it as "error code: 503 (temporarily overloaded)".

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4587284 posted 8:57 am on Jul 1, 2013 (gmt 0)

The reason you feed both good and bad URLs to Xenu in the text file list is to test that the site returns the correct response for both correct and incorrect requests, for wanted and unwanted requests. You add a selection of page names that don't exist, incorrect extensions, unwanted or unnecessary parameters, and so on. Some of my test files have thousands of URLs and can quickly verify that I haven't introduced problems when altering the site configuration.

MarkOly



 
Msg#: 4587284 posted 5:22 am on Jul 2, 2013 (gmt 0)

The reason you feed both good and bad URLs to Xenu in the text file list is to test that the site returns the correct response for both correct and incorrect requests, for wanted and unwanted requests. You add a selection of page names that don't exist, incorrect extensions, unwanted or unnecessary parameters, and so on. Some of my test files have thousands of URLs and can quickly verify that I haven't introduced problems when altering the site configuration.

That sounds like a great idea. You can have different text files for different categories of error. Beats entering URLs one at a time. The only thing I see that could be improved on would be to report the actual response code in the list view, rather than "ok" or "not found". The 301 redirects and the 200's all report as "ok". You can view the redirects in the html report though.

MarkOly



 
Msg#: 4587284 posted 4:51 pm on Jul 25, 2013 (gmt 0)

After reading more threads and doing more testing, I added a few rules to what I had - mainly things to protect me against sloppy incoming links: removing trailing characters and query strings, redirecting https to http. This is working great for me, except for a couple very minor things. I didn't want to mix up rule numbers, so the new rules start with 8:

RewriteEngine On
RewriteBase /

#1 Redirect requests for old URL to new URL
RewriteRule ^old-page\.htm$ http://www.example.com/new-folder/new-page [R=301,L]
# Then repeat the above 80 times.

#2 Redirect index requests in any directory to root of that directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index(\.[a-z0-9]+)?[^\ ]*\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index(\.[a-z0-9]+)?$ http://www.example.com/$1? [NC,R=301,L]

#8 Redirect remaining .htm or .html requests to extensionless URL
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+\.html?\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*[^.]+)\.html?$ http://www.example.com/$1 [NC,R=301,L]

#9 Redirect URLs containing valid characters to remove query string except for specific folders
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?#\ ]*)\?[^\ ]*\ HTTP/
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

#10 Redirect URLs containing valid characters to remove trailing invalid characters
RewriteRule ^([/0-9a-z._\-]*)[^/0-9a-z._\-] http://www.example.com/$1 [NC,R=301,L]

#11 Redirect URLs containing valid characters to remove trailing punctuation
RewriteRule ^(.*)[^/0-9a-z]+$ http://www.example.com/$1 [NC,R=301,L]

#5 Redirect requests with trailing slash to extensionless URL unless a directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+/\ HTTP/
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(([^/]+/)*[^.]+)/ http://www.example.com/$1 [R=301,L]

#6 Redirect requests for non-www and non-webmail subdomains to www subdomain
RewriteCond %{HTTP_HOST} !^(www|webmail)\.example\.com$ [NC]
RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

#13 Redirect https requests to http except for specific file types, folders, and file
RewriteCond %{SERVER_PORT} ^443$
RewriteCond $1 !\.(css|gif|jpe?g|bmp|png|js|ico|xml|txt)$ [NC]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteCond $1 !^file1
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

#7 Internally rewrite extensionless URL requests to .htm file if .htm file exists
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /[^.]+[^./]\ HTTP/
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^([^.]+[^./])$ /$1.htm [L]

There's one thing I would like to clean up. On #8, if a bogus .htm or .html request comes in, it gets a 301 first, then cycles through the htaccess again to get its 404. That might be a bad thing from a search engine's perspective. To make that an immediate 404, I can add a file exists check:

#8 Redirect remaining .htm or .html request to extensionless URL if file exists as an .htm version
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+\.html?\ HTTP/ [NC]
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^(([^/]+/)*[^.]+)\.html?$ http://www.example.com/$1 [NC,R=301,L]

That will only work for .htm requests though, not html requests. I'm wondering if I can modify the file exists check to work for .html requests also? As far as I can tell though, it looks like I'll need to add a rule above #8 to convert .html requests to .htm first.

Other than that, most of the punctuation type errors I tested redirected properly, at least the more important ones. At this point, I have 3 important punctuation errors that are still giving me a 200 response:

http://www.example.com/.
http://www.example.com?
http://www.example.com/?

The URL redirects work, but still a 200 response. I understand that . and ? are special characters that get ignored by Apache. But from a Google standpoint, those are 3 separate URLs, aren't they? I haven't been able to find a solution for this and I see that there's alot of talk about it. Is this just one of those things you have to accept and move on?

I've done alot of testing on this in Xenu and LiveHTTPHeaders. My error logs are clean. But I'm sure there are things I haven't thought of. So if you see any pitfalls I might be setting myself up for, please let me know.

Thanks!
MarkOly

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4587284 posted 7:27 pm on Jul 25, 2013 (gmt 0)

Do rules 10 and 11 really work as you expect? At first glance, the patterns look ambiguous and prone to mismatching.

I think rule 13 is in the wrong place/wrong order.

MarkOly



 
Msg#: 4587284 posted 11:48 pm on Jul 25, 2013 (gmt 0)

Do rules 10 and 11 really work as you expect? At first glance, the patterns look ambiguous and prone to mismatching.

Yeah, the rules do actually work. I copied them from this thread here: [webmasterworld.com...] - jdMorgan's last post. I did want to put a RewriteCond in #11 due to the (.*). I couldn't come up with it though. It was easy to come up with the RewriteCond for #8 because I basically just regurgitated the RewriteRule. But with a (.*), that's hard to regurgitate. I gave up eventually.

I think rule 13 is in the wrong place/wrong order.

I searched for examples where that rule was mixed in with a few others and just couldn't find it. The https to http discussion seems to occur in a vacuum. So the logic I applied, I looked at all the rules I had already. (#13 was the last rule I made.) All of those rules were already converting https to http on their own. The only https to http hole I had was on actual valid pages. So I had to come up with a rule for that. Since the other rules already convert https to http by default, the only logical place to put #13 seemed to be at the end, right before the internal rewrite rule. Is that flawed? Yeah, I think I do need to do more testing. When I did test in Xenu, I created alot of text files to test. But I didn't take the approach where I was looking at my rules and testing them specifically, especially for correct order. I'll have to try that.

Thanks!

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4587284 posted 12:08 am on Jul 26, 2013 (gmt 0)

Last things first:
http://www.example.com/.
http://www.example.com?
http://www.example.com/?

You mean, in RegEx terms,

http://www.example.com/\.
http://www.example.com\?
http://www.example.com/\?
?

:: detour for quick testing in MAMP with two unrelated browsers ::

The last two come out the same: browser adds / after domain name, so what htaccess sees both ways is
^\?

The first magically changes into / alone.

Look at your logs. Some of these may be non-problems: that is, the server never even sees the bad request, because the browser itself has silently edited the input.

:: further experimentation ::

Any literal . immediately after the / seem to disappear. Question marks remain, but are ignored, as are .? combinations.

Now then...

#11 Redirect URLs containing valid characters to remove trailing punctuation
RewriteRule ^(.*)[^/0-9a-z]+$ http://www.example.com/$1 [NC,R=301,L]

Oh say it ain't so ;)

Never mind whether the rule works as intended: Do you even need it? There's a longish list of potential Rules You Don't Need Until You Need Them; others involve things like multiple directory slashes, or garbage after an .html extension (assuming you're not parsing html as something else), or the superfluous punctuation we were just talking about.

Unless you've got an enormous site with armies of fumble-fingered visitors, you probably don't need to clutter up your htaccess with rules aimed at malformed requests that may never actually occur.

#8 Redirect remaining .htm or .html request to extensionless URL if file exists as an .htm version
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+\.html?\ HTTP/ [NC]
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^(([^/]+/)*[^.]+)\.html?$ http://www.example.com/$1 [NC,R=301,L]


That will only work for .htm requests though, not html requests. I'm wondering if I can modify the file exists check to work for .html requests also? As far as I can tell though, it looks like I'll need to add a rule above #8 to convert .html requests to .htm first.

Seems like you should be able to say
RewriteCond %{REQUEST_FILENAME}l? -f

Here, again, you're getting into Things You Don't Need Until You Need Them territory. Just how many requests do you get that end in .html? but that refer to files which never existed in the first place?




By the time you're done with this, we may be able to compile a whole new set of boilerplate: Rules addressing every possible malformed request, all collected in one convenient location :)

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4587284 posted 7:07 am on Jul 26, 2013 (gmt 0)

I think I would be happier if the Rule 11 pattern were changed, especially replacing the (.*) part with a more specific pattern matching valid URLs.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4587284 posted 8:22 am on Jul 26, 2013 (gmt 0)

I think I would be happier if....

Dog Bites Man ;)

OK, it's a choice between
(a) look very closely at Rule Eleven
or
(b) look very closely at smooth-reader's report on 459-page Notes volume of Joseph Hall's Selections from Early Middle English, and/or same person's report on Volume I of Gairdner's 1904 edition of the Paston Letters
or
(c) continue hammering away at Zupitza's edition of Aelfric

Rule Eleven it is.

#11 Redirect URLs containing valid characters to remove trailing punctuation
RewriteRule ^(.*)[^/0-9a-z]+$ http://www.example.com/$1 [NC,R=301,L]


The rule is easiest to fine-tune if you're looking at an existing site with known URL patterns. What characters, other than alphanumerics, lowlines and hyphens, will actually occur in the body of an URL? (You may not even use _ lowlines, but they generally count as \w so they are no extra trouble.)

A lot of nasties like commas and periods are technically legal-- but if you don't use them, the whole thing becomes vastly easier. Here assuming that you don't have one of those hypothetical servers that kick up a fuss at the \w locution:

Pattern:
^([\w/-]+(\.\w+)?)?.+

RegEx merrily captures along until it meets something other than alphanumeric, lowline, hyphen or directory slash. If the something-else is a literal period, it can then pick up the period and any subsequent alphanumerics. (This is assuming the request doesn't contain two unrelated forms of garbage, such as a bogus extension or back-to-back directory slashes. Just how fumble-fingered are your human visitors?) Otherwise it's done. It is also done if the very first requested character is something unacceptable.

The special case of a request beginning with a literal period need not be considered, because the config file already has a rule blocking requests for any filename with leading period.

Everything up to this point is captured. If there is anything left over, it is ignored and a redirect is issued for the captured part.

Come to think of it, this rule will also redirect requests with extraneous path info after the extension. This is probably desirable.

MarkOly



 
Msg#: 4587284 posted 3:02 pm on Jul 26, 2013 (gmt 0)

Well I typed up a response last night, but got too bleary eyed to finish it. I was going to finish it this morning, then you guys snuck in and submarined me. ;) Well this is what I had minus the rule #11 stuff, which I'll add at the end.

Seems like you should be able to say
RewriteCond %{REQUEST_FILENAME}l? -f

OMG I thought of that but didn't think it could possibly work! I'll try it.

Here, again, you're getting into Things You Don't Need Until You Need Them territory. Just how many requests do you get that end in .html? but that refer to files which never existed in the first place?

Actually, that rule (#8) is there to catch requests for valid pages that acidentally get tagged with htm or html. I noticed that the bogus requests were getting 301'd, then 404'd, when I was testing with Xenu - so thought it would be a good idea for those to get the immediate 404.

Any literal . immediately after the / seem to disappear. Question marks remain, but are ignored, as are .? combinations.

But to Google, aren't those 3 separate URLs? What I was thinking about was a potential duplicate content issue that you can't really do anything about.

Okay new day:

Right now, I don't have any real problems with errors. But I did want to plan for the future when I grow up and start writing articles and people start linking to me from message boards. So I did want to put in rules for the more common errors. I was definitely drawing the line at the double slash rule. I thought about it. But resisted. That's the one I was teetering on, then pulled the plug.

The trailing punctuation errors seem obvious to me. I think about when I email a link to somebody, I usually paste the link, then immediately follow it with a period, comma, etc. I don't know how easy it is for a software to mess that up and lump the punctuation in with the link. But I think it's worth being prepared for the most common ones. So just the punctuation marks somebody may add immediately after pasting a link: . , ? ! ; :

^([\w/-]+(\.\w+)?)?.+

I didn't know about the \w option. That makes that easier. I'll just try this and see what happens. I'll take out #10 and #11 and plug it in and give it a go. Thanks Lucy!

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4587284 posted 9:52 pm on Jul 26, 2013 (gmt 0)

I think about when I email a link to somebody, I usually paste the link, then immediately follow it with a period, comma, etc. I don't know how easy it is for a software to mess that up and lump the punctuation in with the link.

That's a good point. It happens all the time with auto-generated links in some forums: once you've met http: or www. the auto-linking continues to the first space. And then you really do get a 404 if there happened to be punctuation at the end. But the problem tends to be with longer more complicated links, not the bare domain name. And you've got code for those longer links.

The googlebot does a lot of weird things but I don't think it ever asks for URLs with gratuitous trailing punctuation; when there is punctuation at the end, it's obviously from someone else's link.

I thought of that but didn't think it could possibly work

Well, don't take my unsupported word for it. Report back, one way or the other, and then everyone will know for sure.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4587284 posted 10:45 pm on Jul 26, 2013 (gmt 0)

Yes, the appending of trailing commas, periods and other punctuation from links found in forums and emails is a real world problem that should be catered for. It's a rare site that never encounters it.

MarkOly



 
Msg#: 4587284 posted 6:07 am on Jul 27, 2013 (gmt 0)

Reporting back. This one didn't work. But I think I see why.

RewriteCond %{REQUEST_FILENAME}l? -f

{REQUEST_FILENAME} can be either example.com/page.htm or example.com/page.html. If it's example.com/page.html, then adding l? would be adding a second l that's an optional l. So you'd be left with example.com/page.html or example.com/page.htmll. Neither is a valid file. I think that's how it works anyway.

So I should probably just leave it without the file exists check. There is still the possibility of a bogus htm/html request getting a 301, then 404. I thought I read that Google doesn't like it when they get a 301 redirect that then leads them to a 404. For me, this error would be rare. It would require somebody using a bad file name, plus adding htm/html - two separate mistakes in unison. For somebody using this code though, that might be a caveat. If you expect that dual error to happen frequently, maybe a malicious competitor, then it's a hole that might need to be plugged. And if Google doesn't care about the 301 to 404, then who cares.

This one here, I want to make sure I used it right:

# Redirect URL containing valid characters to remove trailing characters
RewriteRule ^([\w/-]+(\.\w+)?)?.+ http://www.example.com/$1 [R=301,L]

In that form, it's causing the last character of a valid url to get truncated. So example.com/page is 301 redirecting to example.com/pag

That rule is a different animal than what I'm used to so I didn't see anywhere to go. The .+ at the end is something I haven't seen before. I've seen it like (.+) I guess there's probably not a difference. It does look like maybe this (\.\w+)? is missing a ^ somewhere to tell it 'no'?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4587284 posted 7:17 am on Jul 27, 2013 (gmt 0)

it's causing the last character of a valid url to get truncated. So example.com/page is 301 redirecting to example.com/pag

Oh, ###. That means it's backtracking when it doesn't have to. (Where "doesn't have to" = Look, mod_rewrite, you don't have to execute this rule, you can just say it doesn't fit.)

Try this:

^([\w/-]+(\.\w+)?)?[^a-zA-Z\d].*

I would say simply \W in place of [^a-zA-Z\d]. Except, 1, I'm not sure all servers recognize the form, and 2, I kinda think you don't want _ as the last character in an extension. (Can it occur? You can definitely have numerals along with letters.)

If it's example.com/page.html, then adding l? would be adding a second l that's an optional l. So you'd be left with example.com/page.html or example.com/page.htmll. Neither is a valid file.

Ah, interesting. I wasn't sure if it would work at all. Sounds like it would have been the right approach if your filenames had the opposite configuration: Everything ends in .html but some people are asking for .htm

:: detour to test site's htaccess ::

Oh, that's why it works for me. Completely different pattern.
RewriteRule ^(([^./]+/)*[^./]+\.(html|php))/ http://www.example.com/$1 [R=301,L]

First, I limited the rule to specific extensions. Second, I named a specific character to come after the extension-- though I could have said .+ at this point. For some reason, only my test site has ever been plagued with bogus path info. Others may have tried-- but only from IPs that were already blocked anyway.

I generally constrain RewriteRules to requests for pages: patterns like

(^|\.html|/)$

This is admittedly easiest to achieve in access-control rules where you don't need to capture things; all you're deciding is whether the guy gets in at all. It's very rare for a malign robot to walk in off the street and start demanding image files, so all of those just whizz on past the server. (On my main site they're actually inside a FilesMatch envelope-- images inside, everything else outside-- but officially you are not supposed to do this. Nobody told me.)

MarkOly



 
Msg#: 4587284 posted 6:44 pm on Jul 27, 2013 (gmt 0)

This one is really close:

# Redirect URL containing valid characters to remove trailing invalid characters
RewriteRule ^([\w/-]+(\.\w+)?)?[^a-zA-Z\d].* http://www.example.com/$1 [R=301,L]

It redirects the trailing character requests with a 301. Works for trailing periods. Only problem is that valid page requests redirect to themselves.

I've been monkeying around with this in the meantime myself. This would all be really easy if I didn't care about redirecting trailing periods. #10 seems to do everything I want, except for trailing periods. So I thought about it. By the time it gets to this rule, the only valid requests containing a period are going to be images, css, etc. So if I were to rule out those things, then I could modify #10 to also work for periods (by removing the two periods from the pattern):

# Redirect URL containing valid characters to remove trailing invalid characters except for specific file types and folders
RewriteCond $1 !\.(css|gif|jpe?g|bmp|png|js|ico|xml|txt)$ [NC]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule ^([/0-9a-z_\-]*)[^/0-9a-z_\-]+$ http://www.example.com/$1 [NC,R=301,L]

It does work - and for multiple trailing chars after I got daring and added the + before the $ in the pattern. But am I skating on thin ice here? When I first tested the above RewriteRule without the RewriteCond ruling out css, etc, things went haywire. So this seems like it could be really dangerous. I haven't tested it extensively. But so far so good. The only issue I found so far is that something like example.com/bogus. gets a 301 first to example/bogus then example/bogus gets its 404.

This 77 message thread spans 3 pages: 77 ( [1] 2 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved