Forum Moderators: phranque

Message Too Old, No Replies

Cruft Free URLs for Dummies

How to implement extensionless URLs?

         

Asia_Expat

4:31 pm on Jan 28, 2009 (gmt 0)

10+ Year Member



I've spent the last three hours reading through loads of WW threads in an attempt to put together a htaccess file to set up cruft free URLs on my established website... I realise that those who have the answers appreciate people who make the effort to figure things out for themselves before posting questions...

... but I've mentally crashed I'm afraid and I need my hand holding. I've decided to move to cruft free but I want to plan it very carefully. I will test on a specific directory for a few weeks (to see how search engines react) before rolling out to the whole website. After the test, I will add the code into the httpd.conf file for efficiency, so I need something that will work there as well as if it was placed in a subdirectory.

My pages are a mixture of html, xhtml and php extensions, so those are the ones I need to 301 redirect to cruft free.

If someone can help me (and commentate in the code so I can learn as well) I think this would make a good 'cruft free' thread for dummy webmasters like me.

This is what I have so far... am I even getting close?...

RewriteCond %{REQUEST_URI} !\.[a-z0-9]+$
RewriteCond %{REQUEST_FILENAME}.(php(4¦5)?¦html?¦xhtml?) -f
RewriteRule ^(.*)$ /$1.html [L]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^.]+\.(php(4¦5)?¦html?¦xhtml?)\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(php(4¦5)?¦html?¦xhtml?)$ http://www.example.com/$1 [R=301,L]

jdMorgan

5:37 pm on Jan 28, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You really can't combine handling all these filetypes into two rules, unless the actual files are all of one type. In other words, you are rewriting all cruft-free URLs to .html files using this code:

RewriteCond %{REQUEST_URI} !\.[a-z0-9]+$
RewriteCond %{REQUEST_FILENAME}.(php(4¦5)?¦html?¦xhtml?) -f
RewriteRule ^(.*)$ /$1.html [L]

Is that really what you want to do? What about your php and xhtml files?

If not, consider which filetype is most-often-requested and which filetypes you may come to prefer over time. This will determine the order in which you want to check for existing files of each type.

Let's leave off the second rule (the redirect) for now, as it's best to develop and test one step at a time to avoid confusion at many levels.

Also, the code *can* be made to work in both httpd.conf, and .htaccess. But I'd suggest developing the code in your root .htaccess file, and then moving to httpd.conf after you've got it debugged. Doing it this way, the primary (if not only) change is that you'll need to add a slash to the beginning of your RewriteRule patterns for use in .httpd.conf. (Others prefer to start each pattern with "^/?" so that it works either way, and you can do that too if you prefer -- and then remove the "?" after you move the code.)

Jim

g1smd

12:21 am on Jan 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Don't forget that you'll also need to change the URLs in your internal links to point to the new URL formats.

Asia_Expat

9:07 am on Jan 29, 2009 (gmt 0)

10+ Year Member



For simplicity, I guess I could just rename all files to .html and be done with it. If the user/bots are only ever going to see the cruft free version anyway, it shouldn't matter if I do that, as long as everything else in in order technically.

I'm actually having trouble getting this to work, so I can't even test for bugs just yet. I wonder if there's anything in the existing htaccess conflicting... I guess the third and fourth lines will no longer be required if I get the cruft free set up working properly...

rewritecond %{http_host} ^example.com [nc]
rewriterule ^(.*)$ http://www.example.com/$1 [r=301,nc]
rewritecond %{the_request} ^[A-Z]{3,9}\ /(([^/]+/)*)index\.xhtml\ HTTP/
RewriteRule index\.xhtml$ http://www.example.com/%1 [R=301,L]
RewriteCond %{HTTP_REFERER} .
RewriteCond %{HTTP_REFERER} !^http://(www\.)?example\.com [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?beta.example\.com [NC]
RewriteRule \.(jpe?g¦gif¦bmp¦png)$ - [NC,F]

[edited by: Asia_Expat at 9:08 am (utc) on Jan. 29, 2009]

Asia_Expat

9:47 am on Jan 29, 2009 (gmt 0)

10+ Year Member



WAIT!...

I've now appended the following to the htaccess file listed above in the root directory and the cruft free version is now working...

RewriteCond %{REQUEST_fileNAME} !-d
RewriteCond %{REQUEST_fileNAME} !-f
rewriterule ^(([^/]+/)*[^./]+)$ /$1.html [L]

So... give me 30 mins or so and I'll try and figure out the 301 for myself...

Asia_Expat

10:01 am on Jan 29, 2009 (gmt 0)

10+ Year Member



OK... with the root directory htaccess, I've got cruft frees working AND got 301 headers being prodeced for the old .html extentions to the new cruft free versions and it looks like this... (I wonder if you could carefully analyse what I've done and tell me where I'm asking for trouble)...

RewriteCond %{REQUEST_fileNAME} !-d
RewriteCond %{REQUEST_fileNAME} !-f
rewriterule ^(([^/]+/)*[^./]+)$ /$1.html [L]
rewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^.]+\.html\ HTTP/
rewriteRule ^([^.]+)\.html$ http://www.example.com/$1 [R=301,L]

Also, can you see any potential conflicts with what is already in the htaccess (posted a couple of posts above).
Further, I read in a few posts around the forum that search engines can sometimes add a slash to the end of cruft free URLs. What is the potential for this to happen to me and how can I prevent it?

jdMorgan

4:19 pm on Jan 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Some of the rules are in the wrong order, and you are playing fast and loose with capitalization -- something that I just can't recommend. You've also got some unnecessary redundancy. Also, appended query strings would break your "index.xyz" and "cruft-remover" redirect rules, and hostnames with appended FQDN periods or port numbers would not be canonicalized.

I'd suggest:


# Externally redirect direct client requests for index.xyz to "/" in same directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*)index\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule /?index\.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect direct client requests for URLS with "page" file extensions
# to extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[^./]+)\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule \.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect requests for non-blank, non-canonical hostname to canonical hostname
RewriteCond %{HTTP_HOST} !^(www\.(beta\.)?example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
# Return 403-Forbidden response for included-object requests with non-blank off-site referrers
RewriteCond %{HTTP_REFERER} !^(https?://(www\.)?(beta\.)?example\.com(/.*)?)?$ [NC]
RewriteRule \.(jpe?g¦gif¦bmp¦png¦ico¦css¦js)$ - [NC,F]
#
# Internally rewrite extensionless URL requests to .html files unless
# the requested URL-path resolves to an existing directory
RewriteCond %{REQUEST_FILENAME}/ !-d
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.html [L]

Replace the broken pipe "¦" characters above with solid pipe characters before use; Posting on this forum modifies the pipe characters.

Generally, place external redirect rules first, in order from most-specific pattern (least URLs affected) to least-specific pattern, followed by internal rewrites, again in order from most- to least-specific.

Putting the external redirects first avoids having a redirect 'expose' the internal filepath resulting from a previous internal rewrite, and putting the most-specific rules first avoids chained or stacked redirects and rewrites.

Concise, accurate comments in the code are a very good thing.

Jim

Asia_Expat

8:25 am on Jan 30, 2009 (gmt 0)

10+ Year Member



([xs]?html?¦php[456]?)

.... This is elegant, I didn't realise I could achieve this by listing the multiple file types (i.e. shtml, xhtml, html) WITHIN the square brackets in conjuction with another multiple choice, rather than listing them seperately.

I've now tested the following on my server and as far as I can tell, everything is functioning as intended, EXCEPT that only the .html pages are being internally redirected due to the last line. I tried changing the 'html' to ([xs]?html?¦php[456]?) and fiddled around for a while but couldn't make it work...

# Externally redirect direct client requests for index.xyz to "/" in same directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*)index\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule /?index\.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect direct client requests for URLS with "page" file extensions
# to extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[^./]+)\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule \.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect requests for non-blank, non-canonical hostname to canonical hostname
RewriteCond %{HTTP_HOST} !^(www\.(beta\.)?example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
# Return 403-Forbidden response for included-object requests with non-blank off-site referrers
RewriteCond %{HTTP_REFERER} !^(https?://(www\.)?(beta\.)?example\.com(/.*)?)?$ [NC]
RewriteRule \.(jpe?g¦gif¦bmp¦png¦ico¦css¦js)$ - [NC,F]
#
# Internally rewrite extensionless URL requests to .html files unless
# the requested URL-path resolves to an existing directory
RewriteCond %{REQUEST_FILENAME}/ !-d
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.html [L]

---- Edit reason: Forgot to examplify doman name.

[edited by: Asia_Expat at 8:28 am (utc) on Jan. 30, 2009]

Asia_Expat

2:33 pm on Jan 30, 2009 (gmt 0)

10+ Year Member



Just though of something else too...
I'm presently going through all the files with a copy of the website on my local drive and I guess I should also be taking the .html from my common includes as well... for example...

<?php include($_SERVER['DOCUMENT_ROOT'] . "/head-insert.html"); ?>

change to...

<?php include($_SERVER['DOCUMENT_ROOT'] . "/head-insert"); ?>

Am I correct?

[edited by: Asia_Expat at 2:33 pm (utc) on Jan. 30, 2009]

jdMorgan

4:16 pm on Jan 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



 ([xs]?html?¦php[456]?) 

.... This is elegant, I didn't realise I could achieve this by listing the multiple file types (i.e. shtml, xhtml, html) WITHIN the square brackets in conjunction with another multiple choice, rather than listing them seperately.

The "multiple types" are not inside the square brackets. Rather, it reads, "Match if (this part of the client's HTTP request line) begins with an optional 'x' or 's', followed by 'htm', followed by an optional 'l' OR if this part of the request line begins with 'php' followed by an optional '4', '5', or '6'." The square brackets define a group (a list) of alternate acceptable characters, and the trailing question mark means that the preceding character, alternate group, or parenthesized sub-pattern is optional.

I've now tested the following on my server and as far as I can tell, everything is functioning as intended, EXCEPT that only the .html pages are being internally redirected due to the last line. I tried changing the 'html' to ([xs]?html?¦php[456]?) and fiddled around for a while but couldn't make it work...

I'm not sure what you mean here. If you mean the rewriterule in the "last line" of the code just posted, then you cannot put a regular expressions pattern into a substitution filepath -- that won't work. Only .html files are supported by this code, because of the structure of the code, not because of the regex pattern(s); All extensionless URL requests are rewritten to .html files, as documented by the comments.

Also, be clear on the "direction of action" here: An incoming HTTP client request for a URL matching <zero or more directory-levels>/<page-name> is rewritten to the internal filepath /<page-name>.html.

In other words, this code says the server should serve the physically-existing file <page-name>.html when the URL <page-name> is requested, as long as there is no directory at <zero or more directory-levels>/<page-name>/. So, there is no "choice" about the .html extension on the existing file. As I stated at the outset, the code gets more complicated (and less efficient, too) if multiple file types need to be supported. But we need to understand the requirements here, and we also need to keep a mental wall between what is a URL and what is a filepath, otherwise, things won't make sense.

 I guess I should also be [removing] the ".html" from my common includes as well... for example... 

Probably not. I assume that this PHP include is a "file read" and takes place entirely within the server. Code in .htaccess does not apply to anything except HTTP requests, and will (and can) have no effect on filesystem reads. Again, we're looking at the "wall" between HTTP URLs on the Web, versus filesystem activity within the server.

If it helps, you can think of mod_rewrite (and indeed, Apache itself) as one big URL-to-filepath translator. The input to mod_rewrite is a URL-path (everything after the domain name, except for appended query strings). If the rule is an internal rewrite, the output from mod_rewrite is a filepath; The server then enters the content-handling phase, and sends the contents of that file to the client in the response-body of the HTTP response, along with a 200-OK response code. (If the rule is a redirect, the mod_rewrite output is another URL, which is sent directly to the client with 301 or 302 response code and no response body {i.e. no "page content"}, and "tells" the client to ask for the resource again at this new URL, but we are discussing internal rewrites here right now, and of course, the above description is simplified.)

In the simplest case (without mod_rewrite or other modules), the server takes URLs requested via HTTP from the Web and translates them into filepaths so that the operating system's disk and file handlers can read the file associated with the requested URL. Then the server sends the content of that file to the client. The basic purpose of a server is to allow the use of URLs on the Web instead of filepaths, so that the client does not have to know the details of what operating system and directory architecture the server uses.

OK, with all that in mind, do you have existing filetypes on this server other than ".html" or not?

If so, which of them are the most prevalent, and which of them are requested the most often? (The accuracy of your answer will affect performance but not functionality; A precise answer based on logs is good, but an educated guess will do.)

Jim

[edited by: jdMorgan at 4:17 pm (utc) on Jan. 30, 2009]

Asia_Expat

4:40 pm on Jan 30, 2009 (gmt 0)

10+ Year Member



Jim, I know it's frustrating trying to explain air to vacuum, I do it daily in my field. Thank you for your patience.

That's quite a post... and it's 23:30 HRS here in Chiang Rai, so I'll digest it properly when I wake up tomorrow (+ I've had a couple of beers with my wife)...

... but to answer your question, 70'ish percent .html... 29'ish percent .xhtml... 1 percent .php
.html is requested most often.

jdMorgan

5:26 pm on Jan 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So, in response to an extensionless request, the server must look for a corresponding existant .html file, then a .xhmtl file, and finally, a .php file, and then rewrite to the appropriate file. Assuming you don't have any 'overlaps' between idnetically-named files on different types this will work. If you do have overlaps, you will have to eliminate them by changing the extensionless URLs, because otherwise, these overlaps present an unresolvable problem to the server:

# Externally redirect direct client requests for index.xyz to "/" in same directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*)index\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule /?index\.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect direct client requests for URLS with "page" file extensions
# to extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[^./]+)\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule \.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect requests for non-blank, non-canonical hostname to canonical hostname
RewriteCond %{HTTP_HOST} !^(www\.(beta\.)?example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
# Return 403-Forbidden response for included-object requests with non-blank off-site referrers
RewriteCond %{HTTP_REFERER} !^(https?://(www\.)?(beta\.)?example\.com(/.*)?)?$ [NC]
RewriteRule \.(jpe?g¦gif¦bmp¦png¦ico¦css¦js)$ - [NC,F]
#
### NEW/CHANGED STUFF STARTS HERE ###
#
# Skip the following three rules if the requested URL-path has a file extension, if it is
# blank (i.e. a "homepage" request), or if it exists as a directory when a slash is appended
RewriteCond $1 \.[a-z0-9]+$¦^$ [NC,OR]
RewriteCond %{REQUEST_FILENAME}/ -d
RewriteRule (.*) - [S=3]
#
# Internally rewrite extensionless URL request to existing .html file
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule (.+) /$1.html [L]
#
# Internally rewrite extensionless URL request to existing .xhtml file
RewriteCond %{REQUEST_FILENAME}.xhtml -f
RewriteRule (.+) /$1.xhtml [L]
#
# Internally rewrite extensionless URL request to existing .php file
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule (.+) /$1.php [L]

The purpose of the "skip" rule construct is to prevent unnecessary and very CPU-intensive filesystem checks for "directory-exists." The performance savings are usually well-worth the added complexity. The skip rule does not check the filesystem if the URL has a file extension or if the URL-path is blank. This prevents wasting CPU cycles on requests for images files or for your home page, which are usually the majority of requests. With that skip-rule in place, the individual "filetype rules" now no longer need to check for an extension, so their patterns have been simplified to improve performance as well.

As usual, replace all broken pipe "¦" characters with solid pipe characters before use; Posting on this forum modifies the pipe characters.

Jim

[edit] Corrected as noted in following post. [/edit]

[edited by: jdMorgan at 5:17 pm (utc) on Feb. 1, 2009]

Asia_Expat

5:13 pm on Feb 1, 2009 (gmt 0)

10+ Year Member



Do I get any points for spotting your error? :-D
... the section that set up the 'skip next 3 rules' has three RewriteConds, the third one should be a RewriteRule.

Thanks for this work of art htaccess file... however, there is a problem I didn't expect. Every page of my forum now redirect to the forum homepage.
http://www.example.com/forum/index.php?showtopic=123456789

I tried removing the last rule but that didn't fix it. If this is something to do with the attempt to cruft free the .php files, there are actually only two or three pages with that extension (except for the thousands of forum pages of course) and they can be changed to .html if necessary in order to make this work... but I can't figure it out at this stage.

[edited by: Asia_Expat at 5:14 pm (utc) on Feb. 1, 2009]

Asia_Expat

5:16 pm on Feb 1, 2009 (gmt 0)

10+ Year Member



I'm guessing the solution might be a rule to ignore the /forum/ directory?

jdMorgan

5:24 pm on Feb 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Both the first and second rule would redirect http://www.example.com/forum/index.php?showtopic=123456789, first because it's a request for "/index.php" and second because it's a request for a URL with a file extension.

So, either both rules must be modified to exclude the /forum/index.php URL-path, or the links must be changed to /forum/?showtopic=123456789

This is up to you to decide, based on what you can do (with that forum software) and what you want to do.

Well-spotted on the RewriteRule directive. I corrected it above to prevent drive-by copy-and-pasters from copying bad code. Thanks for noting it.

Jim

g1smd

5:26 pm on Feb 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yep. Either use a negative RewriteCond, or else make the left side of the Rule more specific.

Asia_Expat

5:31 pm on Feb 1, 2009 (gmt 0)

10+ Year Member



I'm unwilling to make changes to the forum software because the forum URLs rank so well for me (I have duplicate content under very tight control and the forum brings be a large chunk of my traffic). Also, the forum vendors are developing a version of their own friendly URLs on the next release of the software. I want to see what they com up with for that.

I'm trying to think for myself here... if I was to add the following before everything, would the offending rules be prevented from affecting the forum directory?...

RewriteCond %{REQUEST_URI} "/forum/"
RewriteRule (.*) - [S=2]

jdMorgan

5:58 pm on Feb 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You could do that, but as I noted in another recent thread "skip" rules are problematic to maintain: If you add another rule that also needs to be "skipped" but you forget to update the "skip count," then you can get unexpected and hard-to-find problems.

So I suggest just explicitly excluding the "forum/" URL-paths with a couple of new RewriteConds:


# Externally redirect direct client requests for index.xyz to "/" in same directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*)index\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteCond %1 !^forum/
RewriteRule /?index\.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect direct client requests for URLS with "page" file extensions
# to extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[^./]+)\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteCond %1 !^forum/
RewriteRule \.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]

Jim

Asia_Expat

7:06 am on Feb 2, 2009 (gmt 0)

10+ Year Member



With the addition of the above RewriteConds, everything appears to be in order (although I haven't yet specifically checked that in each and every instance a 301 header is required, it is being properly produced).

So, as this is a 'cruft free for dummies' thread, lets recap on where we're at...

* All three file types (html, xhtml, php) are redirecting externally to cruft free version
* All index pages are still being correctly 301'd to the / of that directory, as they should be.
* /forum/ directory is unaffected by cruft modifying rules. Any directory can be excluded in this way as necessary.
* Make a buckup of the website and start the mind numbing task of removing the extensions from all internal links, so that you're ready to upload and flick the swith to cruft free (make a second copy of the original site architecture to a seperate drive, just to protect yourself from being a dummy).

---------------------------------------

Regarding moving to cruft free URLs, have I covered every obvious eventuality? As most Apache servers will be set up in a very similar way to my own server, I guess this thread should help most people that want to do this... but is there anything, ANYTHING at all that anyone can think of, that I should consider carefully before making this move... it's a big decision that could bring a site to it's knees in search engines if you get it wrong.

---------------------------------------

I'm now reading up on the cPanel documentation for instructions on modifying the httpd.conf file, as I'm worried the changes would be reversed if the server was restarted, thus breaking everything. Perhaps someone can offer a brief idiots guide to making changes there, can I just paste it into the above htaccess rules into the http.conf file in exactly the same format?

jdMorgan

2:36 pm on Feb 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just remember that the URL-paths "seen" by RewriteRule in .htaccess are "localized" to the current .htaccess file's directory. So if the .htaccess file is located at example.com/.htaccess, then the "/" is removed, and if it is located at example.com/subdir/.htaccess, then "/subdir/" is removed. In httpd.conf or other server configuration files RewriteRule will always see the entire URL-path.

So this means that RewriteRule patterns in httpd.conf or othe server config file must start with a slash, as contrasted to those in /.htaccess, which must not include the leading slash.

As for server-config dependencies, the server must have mod_rewrite installed and loaded. If this "cruft-free URL" code is to reside in /.htaccess (or any .htaccess file), then AllowOverride Options FileInfo (or AllowOverride All) must be set in the server config file so that .htaccess files will be allowed to set options and modify the filepath, and Options +FollowSymLinks or +SymLinksIfOwnerMatch must be set to enable mod_rewrite. I'm sure there are other dependencies -- These are just the ones that come to mind immediately.

Jim

Asia_Expat

1:40 pm on Feb 3, 2009 (gmt 0)

10+ Year Member



I've now rolled out this change (I'm using htaccess for now) and everything appears to be in order and I'm testing every eventuality I can think of...
One problem I have noticed (and it was there even before I went 'cruft free') is that additional slashes are not triggering a 404 error. For example...

www.example.com/omg//googly-moogly

Produces exactly the same content (with 200 OK response) as...

www.example.com/omg/googly-moogly

The potential for competitors to link to these and tank the website is enourmous. I don't know what to do about it.
Maybe this should be a seperate thread? I didn't want to post my htaccess all over again. Mods feel free to split this topic if you feel it necessary.

[edited by: Asia_Expat at 1:41 pm (utc) on Feb. 3, 2009]

g1smd

2:02 pm on Feb 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Double slash, and trailing punctuation and/or port number canonicalisation, have been covered in many previous threads. It's just a couple of lines of code, but the trick is to try to incorporate this without making any redirection chains for a request that has multiple problems that need correcting.

Asia_Expat

2:16 pm on Feb 3, 2009 (gmt 0)

10+ Year Member



OK, just to complete the thread, here's a link to a great thread by Jim... thanks to both of you for your help...
[webmasterworld.com...]

Asia_Expat

5:09 pm on Feb 3, 2009 (gmt 0)

10+ Year Member



Actually, I've been working on this and want to post my entire htaccess file... Please see the 'New Stuff' in the middle and the # comments I added. As you can see, I'm attempting to remove // or ///// etc but it's only working on slashes immediately after the .com
Also, I'm attempting to remove ? from the end of URLs but it's not working for ?widgets or ?abc etc etc
Note that I need to prodect the /forum/ directory from being affected by attempts to remove ? due to the database driven forum...

See what you think...

AddType application/x-httpd-php .html .xhtml
Options -Indexes

<Limit GET POST>
order deny,allow
deny from all
allow from all
</Limit>
<Limit PUT DELETE>
order deny,allow
deny from all
</Limit>
AuthName example.com

RewriteEngine on
# Externally redirect direct client requests for index.xyz to "/" in same directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*)index\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteCond %1 !^forum/
RewriteRule /?index\.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect direct client requests for URLS with "page" file extensions
# to extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[^./]+)\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteCond %1 !^forum/
RewriteRule \.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect requests for non-blank, non-canonical hostname to canonical hostname
RewriteCond %{HTTP_HOST} !^(www\.(beta\.)?example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
########### NEW STUFF
#########################
# Here, I'm attempting to remove two or more contigous slashes from anywhere in the URL
# but it's only working for multiple slashes directly after the .com
# Slashes in deeper directories are not being fixed
RewriteRule ^(([^/]+/)*)/+(.*)$ /$1$3 [R=301,L]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ //+([^\ ]*)
RewriteRule .* /%1 [R=301,L]
#
# Remove ? from the end of URLs
# It works for a single ? but if anything else is addes... e.g. ? or ?fluff... it resolves duplicate content
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^?]*\?\ HTTP/
RewriteRule (.*) http://www.example.com/$1? [R=301,L]
########################
#####################
#
# Return 403-Forbidden response for included-object requests with non-blank off-site referrers
RewriteCond %{HTTP_REFERER} !^(https?://(www\.)?(beta\.)?example\.com(/.*)?)?$ [NC]
RewriteRule \.(jpe?g¦gif¦bmp¦png¦ico¦css¦js)$ - [NC,F]
#
# Skip the following three rules if the requested URL-path has a file extension, if it is
# blank (i.e. a "homepage" request), or if it exists as a directory when a slash is appended
RewriteCond $1 \.[a-z0-9]+$¦^$ [NC,OR]
RewriteCond %{REQUEST_FILENAME}/ -d
RewriteRule (.*) - [S=3]
#
# Internally rewrite extensionless URL request to existing .html file
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule (.+) /$1.html [L]
#
# Internally rewrite extensionless URL request to existing .xhtml file
RewriteCond %{REQUEST_FILENAME}.xhtml -f
RewriteRule (.+) /$1.xhtml [L]
#
# Internally rewrite extensionless URL request to existing .php file
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule (.+) /$1.php [L]

ErrorDocument 404 /404.php

jdMorgan

5:22 pm on Feb 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Canonical domain name is missing from first new rule, and the redirect rules are not in order from most-specific to least-specific pattern (the domain canonicalization redirect should almost always be the last one, since it affects *all* URL-paths unless the requested hostname is canonical).

You can add "?" to the end of the URLs in all preceding redirects to unconditionally remove query strings if you like; As long as you don't use query strings on *any* URL on the site (beware of host-provided server and script-support control panels possibly using query strings, though), this is safe and does not require a new rule.

There are better "double-slash removers" posted here. More later, if I get the time.

Jim

Asia_Expat

3:05 pm on Feb 5, 2009 (gmt 0)

10+ Year Member



I took a couple of days to update this thread because I was having real trouble figuring out a good set of rules to fix up double slashes (yes, I made a copy paste error in my above post). Every method I tried, including examples from WW either shifted the double slash up or down to the next directory, or even added another... it was bizarre.

Anyhooo... I found one that 'appears' to function perfectly... It removes double slashes, triple, 10, 20, 90 slashes... even if the extra slashes are added on multiple directories...

# 301 to fix double slash in URL path
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . http://www.example.com%1/%2 [R=301,L]
#
# 301 to fix multiple slashes before URL path
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ //+([^\ ]*)
RewriteRule .* http://www.example.com/%1 [R=301,L]

----------------------------------

Jim,
Regarding the blank query... I've been staring at my htaccess for about 45 minutes and I'm sorry to say I just don't understand what you mean, sorry, I tried. I have a /forum/ directory in which every URL has a ? mark... so if I figure out where to add the "?" as you suggest, will my forum directory still be protected by the...
RewriteCond %1 !^forum/
... I'm guessing it would... but I'm seeing stars where my infinity signs should be and &#3648; where my &#3652; should be :(

[edited by: Asia_Expat at 3:06 pm (utc) on Feb. 5, 2009]

g1smd

3:22 pm on Feb 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Note that
(.*)//(.*)
is brutally inefficient.

I would at least replace the first

(.*)//
with
 ([[^/]+/)+/
or similar.

Asia_Expat

3:30 pm on Feb 5, 2009 (gmt 0)

10+ Year Member



OK g1smd, I'll do some experimenting... but before I forget, I just noticed that...
www.example.com/holy/moly/index
... is resolving the same as...
www.example.com/holy/moly/
... and producing a 200 OK header.

... As far as I can tell, once this (and the query string issue) are fixed up, this will be the perfect htaccess!

[edited by: Asia_Expat at 3:31 pm (utc) on Feb. 5, 2009]

Asia_Expat

3:32 pm on Feb 5, 2009 (gmt 0)

10+ Year Member



... but it's only brutally inefficient if traffic is searching for those double slashes, right?... if so, is it really anything to be too concerned about?

g1smd

4:53 pm on Feb 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As coded, the rule is run for every URL request hitting your server, and likely does a few hundred trial matches until it gives up as a "no match" result and processing passes to the next rule.
This 41 message thread spans 2 pages: 41