Forum Moderators: phranque
... but I've mentally crashed I'm afraid and I need my hand holding. I've decided to move to cruft free but I want to plan it very carefully. I will test on a specific directory for a few weeks (to see how search engines react) before rolling out to the whole website. After the test, I will add the code into the httpd.conf file for efficiency, so I need something that will work there as well as if it was placed in a subdirectory.
My pages are a mixture of html, xhtml and php extensions, so those are the ones I need to 301 redirect to cruft free.
If someone can help me (and commentate in the code so I can learn as well) I think this would make a good 'cruft free' thread for dummy webmasters like me.
This is what I have so far... am I even getting close?...
RewriteCond %{REQUEST_URI} !\.[a-z0-9]+$
RewriteCond %{REQUEST_FILENAME}.(php(4¦5)?¦html?¦xhtml?) -f
RewriteRule ^(.*)$ /$1.html [L]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^.]+\.(php(4¦5)?¦html?¦xhtml?)\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(php(4¦5)?¦html?¦xhtml?)$ http://www.example.com/$1 [R=301,L]
RewriteCond %{REQUEST_URI} !\.[a-z0-9]+$
RewriteCond %{REQUEST_FILENAME}.(php(4¦5)?¦html?¦xhtml?) -f
RewriteRule ^(.*)$ /$1.html [L]
If not, consider which filetype is most-often-requested and which filetypes you may come to prefer over time. This will determine the order in which you want to check for existing files of each type.
Let's leave off the second rule (the redirect) for now, as it's best to develop and test one step at a time to avoid confusion at many levels.
Also, the code *can* be made to work in both httpd.conf, and .htaccess. But I'd suggest developing the code in your root .htaccess file, and then moving to httpd.conf after you've got it debugged. Doing it this way, the primary (if not only) change is that you'll need to add a slash to the beginning of your RewriteRule patterns for use in .httpd.conf. (Others prefer to start each pattern with "^/?" so that it works either way, and you can do that too if you prefer -- and then remove the "?" after you move the code.)
Jim
I'm actually having trouble getting this to work, so I can't even test for bugs just yet. I wonder if there's anything in the existing htaccess conflicting... I guess the third and fourth lines will no longer be required if I get the cruft free set up working properly...
rewritecond %{http_host} ^example.com [nc]
rewriterule ^(.*)$ http://www.example.com/$1 [r=301,nc]
rewritecond %{the_request} ^[A-Z]{3,9}\ /(([^/]+/)*)index\.xhtml\ HTTP/
RewriteRule index\.xhtml$ http://www.example.com/%1 [R=301,L]
RewriteCond %{HTTP_REFERER} .
RewriteCond %{HTTP_REFERER} !^http://(www\.)?example\.com [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?beta.example\.com [NC]
RewriteRule \.(jpe?g¦gif¦bmp¦png)$ - [NC,F]
[edited by: Asia_Expat at 9:08 am (utc) on Jan. 29, 2009]
I've now appended the following to the htaccess file listed above in the root directory and the cruft free version is now working...
RewriteCond %{REQUEST_fileNAME} !-d
RewriteCond %{REQUEST_fileNAME} !-f
rewriterule ^(([^/]+/)*[^./]+)$ /$1.html [L]
So... give me 30 mins or so and I'll try and figure out the 301 for myself...
RewriteCond %{REQUEST_fileNAME} !-d
RewriteCond %{REQUEST_fileNAME} !-f
rewriterule ^(([^/]+/)*[^./]+)$ /$1.html [L]
rewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^.]+\.html\ HTTP/
rewriteRule ^([^.]+)\.html$ http://www.example.com/$1 [R=301,L]
Also, can you see any potential conflicts with what is already in the htaccess (posted a couple of posts above).
Further, I read in a few posts around the forum that search engines can sometimes add a slash to the end of cruft free URLs. What is the potential for this to happen to me and how can I prevent it?
I'd suggest:
# Externally redirect direct client requests for index.xyz to "/" in same directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*)index\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule /?index\.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect direct client requests for URLS with "page" file extensions
# to extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[^./]+)\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule \.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect requests for non-blank, non-canonical hostname to canonical hostname
RewriteCond %{HTTP_HOST} !^(www\.(beta\.)?example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
# Return 403-Forbidden response for included-object requests with non-blank off-site referrers
RewriteCond %{HTTP_REFERER} !^(https?://(www\.)?(beta\.)?example\.com(/.*)?)?$ [NC]
RewriteRule \.(jpe?g¦gif¦bmp¦png¦ico¦css¦js)$ - [NC,F]
#
# Internally rewrite extensionless URL requests to .html files unless
# the requested URL-path resolves to an existing directory
RewriteCond %{REQUEST_FILENAME}/ !-d
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.html [L]
Generally, place external redirect rules first, in order from most-specific pattern (least URLs affected) to least-specific pattern, followed by internal rewrites, again in order from most- to least-specific.
Putting the external redirects first avoids having a redirect 'expose' the internal filepath resulting from a previous internal rewrite, and putting the most-specific rules first avoids chained or stacked redirects and rewrites.
Concise, accurate comments in the code are a very good thing.
Jim
.... This is elegant, I didn't realise I could achieve this by listing the multiple file types (i.e. shtml, xhtml, html) WITHIN the square brackets in conjuction with another multiple choice, rather than listing them seperately.
I've now tested the following on my server and as far as I can tell, everything is functioning as intended, EXCEPT that only the .html pages are being internally redirected due to the last line. I tried changing the 'html' to ([xs]?html?¦php[456]?) and fiddled around for a while but couldn't make it work...
# Externally redirect direct client requests for index.xyz to "/" in same directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*)index\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule /?index\.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect direct client requests for URLS with "page" file extensions
# to extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[^./]+)\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule \.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect requests for non-blank, non-canonical hostname to canonical hostname
RewriteCond %{HTTP_HOST} !^(www\.(beta\.)?example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
# Return 403-Forbidden response for included-object requests with non-blank off-site referrers
RewriteCond %{HTTP_REFERER} !^(https?://(www\.)?(beta\.)?example\.com(/.*)?)?$ [NC]
RewriteRule \.(jpe?g¦gif¦bmp¦png¦ico¦css¦js)$ - [NC,F]
#
# Internally rewrite extensionless URL requests to .html files unless
# the requested URL-path resolves to an existing directory
RewriteCond %{REQUEST_FILENAME}/ !-d
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.html [L]
---- Edit reason: Forgot to examplify doman name.
[edited by: Asia_Expat at 8:28 am (utc) on Jan. 30, 2009]
<?php include($_SERVER['DOCUMENT_ROOT'] . "/head-insert.html"); ?>
change to...
<?php include($_SERVER['DOCUMENT_ROOT'] . "/head-insert"); ?>
Am I correct?
[edited by: Asia_Expat at 2:33 pm (utc) on Jan. 30, 2009]
([xs]?html?¦php[456]?).... This is elegant, I didn't realise I could achieve this by listing the multiple file types (i.e. shtml, xhtml, html) WITHIN the square brackets in conjunction with another multiple choice, rather than listing them seperately.
The "multiple types" are not inside the square brackets. Rather, it reads, "Match if (this part of the client's HTTP request line) begins with an optional 'x' or 's', followed by 'htm', followed by an optional 'l' OR if this part of the request line begins with 'php' followed by an optional '4', '5', or '6'." The square brackets define a group (a list) of alternate acceptable characters, and the trailing question mark means that the preceding character, alternate group, or parenthesized sub-pattern is optional.
I've now tested the following on my server and as far as I can tell, everything is functioning as intended, EXCEPT that only the .html pages are being internally redirected due to the last line. I tried changing the 'html' to ([xs]?html?¦php[456]?) and fiddled around for a while but couldn't make it work...
I'm not sure what you mean here. If you mean the rewriterule in the "last line" of the code just posted, then you cannot put a regular expressions pattern into a substitution filepath -- that won't work. Only .html files are supported by this code, because of the structure of the code, not because of the regex pattern(s); All extensionless URL requests are rewritten to .html files, as documented by the comments.
Also, be clear on the "direction of action" here: An incoming HTTP client request for a URL matching <zero or more directory-levels>/<page-name> is rewritten to the internal filepath /<page-name>.html.
In other words, this code says the server should serve the physically-existing file <page-name>.html when the URL <page-name> is requested, as long as there is no directory at <zero or more directory-levels>/<page-name>/. So, there is no "choice" about the .html extension on the existing file. As I stated at the outset, the code gets more complicated (and less efficient, too) if multiple file types need to be supported. But we need to understand the requirements here, and we also need to keep a mental wall between what is a URL and what is a filepath, otherwise, things won't make sense.
I guess I should also be [removing] the ".html" from my common includes as well... for example... Probably not. I assume that this PHP include is a "file read" and takes place entirely within the server. Code in .htaccess does not apply to anything except HTTP requests, and will (and can) have no effect on filesystem reads. Again, we're looking at the "wall" between HTTP URLs on the Web, versus filesystem activity within the server.
If it helps, you can think of mod_rewrite (and indeed, Apache itself) as one big URL-to-filepath translator. The input to mod_rewrite is a URL-path (everything after the domain name, except for appended query strings). If the rule is an internal rewrite, the output from mod_rewrite is a filepath; The server then enters the content-handling phase, and sends the contents of that file to the client in the response-body of the HTTP response, along with a 200-OK response code. (If the rule is a redirect, the mod_rewrite output is another URL, which is sent directly to the client with 301 or 302 response code and no response body {i.e. no "page content"}, and "tells" the client to ask for the resource again at this new URL, but we are discussing internal rewrites here right now, and of course, the above description is simplified.)
In the simplest case (without mod_rewrite or other modules), the server takes URLs requested via HTTP from the Web and translates them into filepaths so that the operating system's disk and file handlers can read the file associated with the requested URL. Then the server sends the content of that file to the client. The basic purpose of a server is to allow the use of URLs on the Web instead of filepaths, so that the client does not have to know the details of what operating system and directory architecture the server uses.
OK, with all that in mind, do you have existing filetypes on this server other than ".html" or not?
If so, which of them are the most prevalent, and which of them are requested the most often? (The accuracy of your answer will affect performance but not functionality; A precise answer based on logs is good, but an educated guess will do.)
Jim
[edited by: jdMorgan at 4:17 pm (utc) on Jan. 30, 2009]
That's quite a post... and it's 23:30 HRS here in Chiang Rai, so I'll digest it properly when I wake up tomorrow (+ I've had a couple of beers with my wife)...
... but to answer your question, 70'ish percent .html... 29'ish percent .xhtml... 1 percent .php
.html is requested most often.
# Externally redirect direct client requests for index.xyz to "/" in same directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*)index\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule /?index\.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect direct client requests for URLS with "page" file extensions
# to extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[^./]+)\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteRule \.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect requests for non-blank, non-canonical hostname to canonical hostname
RewriteCond %{HTTP_HOST} !^(www\.(beta\.)?example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
# Return 403-Forbidden response for included-object requests with non-blank off-site referrers
RewriteCond %{HTTP_REFERER} !^(https?://(www\.)?(beta\.)?example\.com(/.*)?)?$ [NC]
RewriteRule \.(jpe?g¦gif¦bmp¦png¦ico¦css¦js)$ - [NC,F]
#
### NEW/CHANGED STUFF STARTS HERE ###
#
# Skip the following three rules if the requested URL-path has a file extension, if it is
# blank (i.e. a "homepage" request), or if it exists as a directory when a slash is appended
RewriteCond $1 \.[a-z0-9]+$¦^$ [NC,OR]
RewriteCond %{REQUEST_FILENAME}/ -d
RewriteRule (.*) - [S=3]
#
# Internally rewrite extensionless URL request to existing .html file
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule (.+) /$1.html [L]
#
# Internally rewrite extensionless URL request to existing .xhtml file
RewriteCond %{REQUEST_FILENAME}.xhtml -f
RewriteRule (.+) /$1.xhtml [L]
#
# Internally rewrite extensionless URL request to existing .php file
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule (.+) /$1.php [L]
As usual, replace all broken pipe "¦" characters with solid pipe characters before use; Posting on this forum modifies the pipe characters.
Jim
[edit] Corrected as noted in following post. [/edit]
[edited by: jdMorgan at 5:17 pm (utc) on Feb. 1, 2009]
Thanks for this work of art htaccess file... however, there is a problem I didn't expect. Every page of my forum now redirect to the forum homepage.
http://www.example.com/forum/index.php?showtopic=123456789
I tried removing the last rule but that didn't fix it. If this is something to do with the attempt to cruft free the .php files, there are actually only two or three pages with that extension (except for the thousands of forum pages of course) and they can be changed to .html if necessary in order to make this work... but I can't figure it out at this stage.
[edited by: Asia_Expat at 5:14 pm (utc) on Feb. 1, 2009]
So, either both rules must be modified to exclude the /forum/index.php URL-path, or the links must be changed to /forum/?showtopic=123456789
This is up to you to decide, based on what you can do (with that forum software) and what you want to do.
Well-spotted on the RewriteRule directive. I corrected it above to prevent drive-by copy-and-pasters from copying bad code. Thanks for noting it.
Jim
I'm trying to think for myself here... if I was to add the following before everything, would the offending rules be prevented from affecting the forum directory?...
RewriteCond %{REQUEST_URI} "/forum/"
RewriteRule (.*) - [S=2]
So I suggest just explicitly excluding the "forum/" URL-paths with a couple of new RewriteConds:
# Externally redirect direct client requests for index.xyz to "/" in same directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*)index\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteCond %1 !^forum/
RewriteRule /?index\.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect direct client requests for URLS with "page" file extensions
# to extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[^./]+)\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteCond %1 !^forum/
RewriteRule \.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
So, as this is a 'cruft free for dummies' thread, lets recap on where we're at...
* All three file types (html, xhtml, php) are redirecting externally to cruft free version
* All index pages are still being correctly 301'd to the / of that directory, as they should be.
* /forum/ directory is unaffected by cruft modifying rules. Any directory can be excluded in this way as necessary.
* Make a buckup of the website and start the mind numbing task of removing the extensions from all internal links, so that you're ready to upload and flick the swith to cruft free (make a second copy of the original site architecture to a seperate drive, just to protect yourself from being a dummy).
---------------------------------------
Regarding moving to cruft free URLs, have I covered every obvious eventuality? As most Apache servers will be set up in a very similar way to my own server, I guess this thread should help most people that want to do this... but is there anything, ANYTHING at all that anyone can think of, that I should consider carefully before making this move... it's a big decision that could bring a site to it's knees in search engines if you get it wrong.
---------------------------------------
I'm now reading up on the cPanel documentation for instructions on modifying the httpd.conf file, as I'm worried the changes would be reversed if the server was restarted, thus breaking everything. Perhaps someone can offer a brief idiots guide to making changes there, can I just paste it into the above htaccess rules into the http.conf file in exactly the same format?
So this means that RewriteRule patterns in httpd.conf or othe server config file must start with a slash, as contrasted to those in /.htaccess, which must not include the leading slash.
As for server-config dependencies, the server must have mod_rewrite installed and loaded. If this "cruft-free URL" code is to reside in /.htaccess (or any .htaccess file), then AllowOverride Options FileInfo (or AllowOverride All) must be set in the server config file so that .htaccess files will be allowed to set options and modify the filepath, and Options +FollowSymLinks or +SymLinksIfOwnerMatch must be set to enable mod_rewrite. I'm sure there are other dependencies -- These are just the ones that come to mind immediately.
Jim
www.example.com/omg//googly-moogly
Produces exactly the same content (with 200 OK response) as...
www.example.com/omg/googly-moogly
The potential for competitors to link to these and tank the website is enourmous. I don't know what to do about it.
Maybe this should be a seperate thread? I didn't want to post my htaccess all over again. Mods feel free to split this topic if you feel it necessary.
[edited by: Asia_Expat at 1:41 pm (utc) on Feb. 3, 2009]
See what you think...
AddType application/x-httpd-php .html .xhtml
Options -Indexes
<Limit GET POST>
order deny,allow
deny from all
allow from all
</Limit>
<Limit PUT DELETE>
order deny,allow
deny from all
</Limit>
AuthName example.com
RewriteEngine on
# Externally redirect direct client requests for index.xyz to "/" in same directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*)index\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteCond %1 !^forum/
RewriteRule /?index\.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect direct client requests for URLS with "page" file extensions
# to extensionless URLs
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([^/]+/)*[^./]+)\.([xs]?html?¦php[456]?)(\?[^\ ]*)?\ HTTP/
RewriteCond %1 !^forum/
RewriteRule \.([xs]?html?¦php[456]?)$ http://www.example.com/%1? [R=301,L]
#
# Externally redirect requests for non-blank, non-canonical hostname to canonical hostname
RewriteCond %{HTTP_HOST} !^(www\.(beta\.)?example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
########### NEW STUFF
#########################
# Here, I'm attempting to remove two or more contigous slashes from anywhere in the URL
# but it's only working for multiple slashes directly after the .com
# Slashes in deeper directories are not being fixed
RewriteRule ^(([^/]+/)*)/+(.*)$ /$1$3 [R=301,L]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ //+([^\ ]*)
RewriteRule .* /%1 [R=301,L]
#
# Remove ? from the end of URLs
# It works for a single ? but if anything else is addes... e.g. ? or ?fluff... it resolves duplicate content
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^?]*\?\ HTTP/
RewriteRule (.*) http://www.example.com/$1? [R=301,L]
########################
#####################
#
# Return 403-Forbidden response for included-object requests with non-blank off-site referrers
RewriteCond %{HTTP_REFERER} !^(https?://(www\.)?(beta\.)?example\.com(/.*)?)?$ [NC]
RewriteRule \.(jpe?g¦gif¦bmp¦png¦ico¦css¦js)$ - [NC,F]
#
# Skip the following three rules if the requested URL-path has a file extension, if it is
# blank (i.e. a "homepage" request), or if it exists as a directory when a slash is appended
RewriteCond $1 \.[a-z0-9]+$¦^$ [NC,OR]
RewriteCond %{REQUEST_FILENAME}/ -d
RewriteRule (.*) - [S=3]
#
# Internally rewrite extensionless URL request to existing .html file
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule (.+) /$1.html [L]
#
# Internally rewrite extensionless URL request to existing .xhtml file
RewriteCond %{REQUEST_FILENAME}.xhtml -f
RewriteRule (.+) /$1.xhtml [L]
#
# Internally rewrite extensionless URL request to existing .php file
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule (.+) /$1.php [L]
ErrorDocument 404 /404.php
You can add "?" to the end of the URLs in all preceding redirects to unconditionally remove query strings if you like; As long as you don't use query strings on *any* URL on the site (beware of host-provided server and script-support control panels possibly using query strings, though), this is safe and does not require a new rule.
There are better "double-slash removers" posted here. More later, if I get the time.
Jim
Anyhooo... I found one that 'appears' to function perfectly... It removes double slashes, triple, 10, 20, 90 slashes... even if the extra slashes are added on multiple directories...
# 301 to fix double slash in URL path
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . http://www.example.com%1/%2 [R=301,L]
#
# 301 to fix multiple slashes before URL path
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ //+([^\ ]*)
RewriteRule .* http://www.example.com/%1 [R=301,L]
----------------------------------
Jim,
Regarding the blank query... I've been staring at my htaccess for about 45 minutes and I'm sorry to say I just don't understand what you mean, sorry, I tried. I have a /forum/ directory in which every URL has a ? mark... so if I figure out where to add the "?" as you suggest, will my forum directory still be protected by the...
RewriteCond %1 !^forum/
... I'm guessing it would... but I'm seeing stars where my infinity signs should be and เ where my ไ should be :(
[edited by: Asia_Expat at 3:06 pm (utc) on Feb. 5, 2009]
... As far as I can tell, once this (and the query string issue) are fixed up, this will be the perfect htaccess!
[edited by: Asia_Expat at 3:31 pm (utc) on Feb. 5, 2009]