Forum Moderators: phranque

Message Too Old, No Replies

Remove html from url

Tried many things but none work to remove html from url

         

Sawhorse

12:42 am on Jan 21, 2009 (gmt 0)

10+ Year Member



Hi,
I believe that this question has been asked and answered (probably 1000 times) but I can not
get .htaccess to work correctly.

I would like to do two things.
1) Remove the index.html file and extension when visiting my site.
2) Remove the html file extension when viewing any page on my site.

I believe that I have correctly removed the index.html file display when accessing my site.
below are the line of code used.

RewriteEngine on

# For index.html and .htm .shtml .php .php4 .php5 in the root or in any folder
# Works for requests with or without parameters, and preserves original folders:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.(s?html?¦php[45]?)(\?[^\ ]*)?\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(s?htm¦?php[45]?)$ http://example.com/$1 [R=301,L]

However, I have read (I believe most posts) and tried most responses and have yet been able to correctly remove the html extension.

This is a very, very simple site. But, it would be nice, not necessary that the html extension be removed from the url.

Does anyone have a specific solution to this issue?

[edited by: jdMorgan at 3:15 am (utc) on Jan. 21, 2009]
[edit reason] example.com [/edit]

g1smd

12:57 am on Jan 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Removing the extension from the URL means using extensionless URLs in the links on your pages. It is the links that 'define' the URL.

Next, you set up a rewrite so that when the user asks for www.example.com/somepage the server gets the content from /somepage.html without revealing that filename to the user.

Additionally, you'll want a redirect such that if a client directly asks for example.com/somepage.html or for www.example.com/somepage.html they are redirected to make a new request for www.example.com/somepage instead.

That combination of redirect and rewrite allows there to be just one URL for the content, and the URL to be different to the filename.

You're right, this question gets covered several times per week, sometimes several times per day, and there are hundreds of prior examples to choose from in the forum.

List the redirect before the rewrite and add [L] to the end of each of those rules.

You'll also need a site-wide 301 redirect from non-www to www to make sure that the content cannot be directly accessed at non-www URLs.

Check the sticky thread at the top of the forum for some examples, and post your best effort code here.

Sawhorse

1:11 am on Jan 21, 2009 (gmt 0)

10+ Year Member



I think I know what you are saying.

I am using extensionless URLs in the links of my pages.

I tried to rewrite so that when the user asks for www.example.com/somepage the server gets the content from /somepage.html without revealing that filename to the user. However, I only succeed some of the times.

I think I listed the redirect before the rewrite and add [L] to the end of each of those rules.

Below is my last try.

RewriteEngine on

# For index.html and .htm .shtml .php .php4 .php5 in the root or in any folder
# Works for requests with or without parameters, and preserves original folders:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.(s?html?¦php[45]?)(\?[^\ ]*)?\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(s?htm¦?php[45]?)$ http://example.com/$1 [R=301,L]

RewriteCond %{REQUEST_fileNAME} !-d
RewriteCond %{REQUEST_fileNAME} !-f
rewriterule ^(([^/]+/)*[^./]+)$ /$1.html [L]

Most of the time it works, but sometimes I get the html extension. It is odd.

Any ideas?

[edited by: jdMorgan at 3:16 am (utc) on Jan. 21, 2009]
[edit reason] example.com [/edit]

Sawhorse

1:40 am on Jan 21, 2009 (gmt 0)

10+ Year Member



I may have a solution. I added RewriteBase /

RewriteEngine on
RewriteBase /

# For index.html and .htm .shtml .php .php4 .php5 in the root or in any folder
# Works for requests with or without parameters, and preserves original folders:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.(s?html?¦php[45]?)(\?[^\ ]*)?\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(s?htm¦?php[45]?)$ http://example.com/$1 [R=301,L]

RewriteCond %{REQUEST_fileNAME} !-d
RewriteCond %{REQUEST_fileNAME} !-f
rewriterule ^(([^/]+/)*[^./]+)$ /$1.html [L]

What do you think?

[edited by: jdMorgan at 3:16 am (utc) on Jan. 21, 2009]
[edit reason] example.com [/edit]

jdMorgan

3:14 am on Jan 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The logic of your second rule is suspect; I'd suggest checking to see if the requested URL-path exists as a file when ".html" is added, rather than checking that the extensionless URL-path doesn't resolve to an existing directory or file. For one thing, this latter test will never pass, because a physical file needs an extension so that a MIME-type can be assigned for HTTP transmission.

RewriteCond %{DOCUMENT_ROOT}/$1.html -f
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.html [L]

"RewriteBase /" is the default behavior, and adding this line changes nothing unless you've got a previous RewriteBase in the code which points elsewhere, and you now want to set it back to default...

Jim

Sawhorse

5:41 pm on Jan 21, 2009 (gmt 0)

10+ Year Member



Great point on the logic of the second rule. I will change my original to your new suggestion.

As to the RewriteBase / I added. I felt that on some occasions that the path was lost so I added it and it seemed to help the situation. Since it is default behavior my thinking was "it won't hurt, might help" and it did seem to help. I have no idea why it did.

So those that follow this discussion the following is now in my .htaccess file.
----------------------

RewriteEngine on
RewriteBase /

# For index.html and .htm .shtml .php .php4 .php5 in the root or in any folder
# Works for requests with or without parameters, and preserves original folders:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.(s?html?¦php[45]?)(\?[^\ ]*)?\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(s?htm¦?php[45]?)$http://example.com/$1 [R=301,L]

RewriteCond %{REQUEST_fileNAME} !-d
RewriteCond %{DOCUMENT_ROOT}/$1.html -f
rewriteRule ^(([^/]+/)*[^./]+)$ /$1.html [L]

---------------------------------
Thank you for your help.

jdMorgan

6:08 pm on Jan 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Note that the first RewriteRule pattern is inconsistent with the RewriteCond pattern, and the rule is also missing a space. It should be:

RewriteRule ^(([^/]*/)*)index\.(s?htm[b]l?[/b]¦php[45]?)[b]$ h[/b]ttp://example.com/$1 [R=301,L]

Also, in the first RewriteCond of the second rule, I suggest you follow the documentation standard and use "%{REQUEST_FILENAME}" instead of using mixed-case.

Jim

[edited by: jdMorgan at 8:24 pm (utc) on Jan. 21, 2009]

Sawhorse

7:06 pm on Jan 21, 2009 (gmt 0)

10+ Year Member



Yes, Good catch about the space. When I keyed in the http://example.com/$1 instead of my url I took the space out.

I also noticed that you (corrected?) s?htm to s?html? Is this a style change or does it make a difference?

My Bad about case, you are quite correct about making sure that "%{REQUEST_fileNAME}" should be "%{REQUEST_FILENAME}"

Let's see if it works better.

Thanks - good eyes!

Sawhorse

7:16 pm on Jan 21, 2009 (gmt 0)

10+ Year Member



A problem (I get a 404) when I use
RewriteCond %{DOCUMENT_ROOT}/$1.html -f
when I commit it out all is well.

Below is my code.

---------------------------
RewriteEngine on
RewriteBase /

# For index.html and .htm .shtml .php .php4 .php5 in the root or in any folder
# Works for requests with or without parameters, and preserves original folders:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.(s?html?¦php[45]?)(\?[^\ ]*)?\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(s?html?¦?php[45]?)$ http://example.com/$1 [R=301,L]

RewriteCond %{REQUEST_FILENAME} !-d
#RewriteCond %{REQUEST_FILENAME} !-f
#RewriteCond %{DOCUMENT_ROOT}/$1.html -f
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.html [L]

--------------------------------------------
This does not make sense. (just to let you know I was getting some intermittent 404's earlier but now it is consistent.)

[edited by: Sawhorse at 7:18 pm (utc) on Jan. 21, 2009]

g1smd

8:01 pm on Jan 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Note that
¦s?html?¦
matches shtm, shtml, htm, html, so the question marks are essential.

The snippet

¦[b]?[/b]php[45]?¦
is incorrect. The leading question mark should be removed.

jdMorgan

8:13 pm on Jan 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That would indicate that the filepath constructed by taking DocumentRoot plus a slash, plus the requested URL-path, plus ".html" is *not* the correct path to the .html file in the filesystem.

Your server error log should come in handy here, perhaps indicating an obvious problem when trying to convert the URL-path to a filepath using that method (the server error log shows filepaths, not URLs, and the problem may be obvious to you if the filepath it shows is incorrect).

If the error log file isn't available, then another way to find the problem is to so something like this as a temporary test:


RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(([^/]+/)*[^./]+)$ http://www.example.com/$1.html?constructed-filepath=%{DOCUMENT_ROOT}/$1.html [R=302,L]

This invokes a temporary (302) redirect, exposing the constructed filepath as a query string appended to the URL in your browser's address bar. Note that the query string is appended for your benefit only, and won't actually do anything if passed to an html file.

On some shared servers, you have to include an additional path-part in after the document_root -- for example, "%{DOCUMENT_ROOT}/public/$1.html" or some such thing. I would say that such a server is mis-configured, but there must be some reason to do this, as I've seen it occasionally... Unfortunately, it's a bit difficult to debug, and staring at the error log or using the temporary code are the only two debugging methods that are relatively expedient.

Jim

Sawhorse

8:39 pm on Jan 21, 2009 (gmt 0)

10+ Year Member



I am guessing that it is the following problem "%{DOCUMENT_ROOT}/public/$1.html" I will do some testing.

Sawhorse

9:22 pm on Jan 21, 2009 (gmt 0)

10+ Year Member



Since error logs are not available I added your temporary code.

Would you believe the following:

http://example.com/page_name.html?constructed-filepath=/services/webpages//page_name.html

Assuming the above is real I did this.

I committed out the temporary test above and instituted the code below.

#RewriteCond %{DOCUMENT_ROOT}/services/webpages//$1.html -f

I still get 404.

However, if I leave the temporary test code active I do not get a 404.

Now that does not make sense.

jdMorgan

9:43 pm on Jan 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That looks backwards. According to what you posted, the value of DOCUMENT_ROOT is "/services/webpages/". Note that it includes the trailing slash, so we need to remove the "extra" slash from the file-exists check:

RewriteCond %{DOCUMENT_ROO[b]T}$1[/b].html -f

...yielding

# Externally redirect index.html, .htm, .shtml, .php, .php4, or .php5 in root or in any
# subdirectory to "/" in that same directory, preserving appended query string (if any)
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.(s?html?¦php[45]?)(\?[^\ ]*)?\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(s?html?¦php[45]?)$ http://example.com/$1 [R=301,L]
#
# If URL *does not* resolve to an existing directory
RewriteCond %{REQUEST_FILENAME} !-d
# and *does* resolve to an existing file with ".html" appended
RewriteCond %{DOCUMENT_ROOT}$1.html -f
# then internally rewrite extensionless URL to .html file
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.html [L]

Jim

Sawhorse

10:36 pm on Jan 21, 2009 (gmt 0)

10+ Year Member



Well, its acting funny.

I corrected my code with your code (copied your code).

On two pages it work fine. On two other pages I got a 404.
(Please note that all the html files are in the root.)

I then added your temporary code:
RewriteRule ^(([^/]+/)*[^./]+)$ http://www.example.com/$1.html?constructed-filepath=%{DOCUMENT_ROOT}/$1.html [R=302,L]

I did not get any 404's on any page.

But interesting - I had a clean URL browser's address bar for the first two pages - (it showed the correct address - no .html extension - these are the two pages that worked earlier). However, on the last two pages I received the same URL browser address information as before( http://example.com/page_name.html?constructed-filepath=/services/webpages//page_name.html) that originally had the 404's.

I do not see how adding the temp redirect code would have any effect, but it seems to be doing something.

Sawhorse

10:41 pm on Jan 21, 2009 (gmt 0)

10+ Year Member



I think that last bit was do web caching, but still... Still testing.

Sawhorse

10:55 pm on Jan 21, 2009 (gmt 0)

10+ Year Member



When I add your temporary code:
RewriteRule ^(([^/]+/)*[^./]+)$ http://www.example.com/$1.html?constructed-filepath=%{DOCUMENT_ROOT}/$1.html [R=302,L]

and leave in the following code:
RewriteCond %{DOCUMENT_ROOT}$1.html -f

In IE I get a blank page.
In FF I get the requested page.

If I do not add the temporary code:
In IE I get a 404.
In FF I get a 404.

jdMorgan

11:52 pm on Jan 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The server does not care what browser you use. Please completely-flush (or disable) your browser caches.

There is no further need for the temporary redirect rule; Its only purpose was to reveal the correct filepath for the RewriteCond to test. And of course, since the temporary redirect code still has the 'extra' slash in it, you will still see the double-slashed "constructed-filepath" value...

Jim

Sawhorse

1:47 am on Jan 22, 2009 (gmt 0)

10+ Year Member



Caches flushed

I know that I do not need the temporary redirect rule. And when I have removed the extra slash it was of course removed from the "constructed-filepath" value.

What I was trying to indicate (not very well) was that when I use the temporary redirect rule I do not get a 404. When I remove the temporary redirect rule I get a 404. And because the redirect rule is unnecessary and of no use, other than indicating the constructed filepath I have no explanation why this occurs.

If I can not correct this I will need to remove the following:

RewriteCond %{DOCUMENT_ROOT}$1.html -f

Which I do not want to do. Thoughts?

jdMorgan

2:21 am on Jan 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm thinking the answer is very simple; Either Document_Root + Req_URI + ".html" is the correct path to the actual html file, or it isn't. If it isn't, then the RewriteCond will stop the rule from being invoked.

So you might want to compare the filepath as reported by the temporary rule to the filepaths you see (for example) when using FTP to upload files, and see if you can spot the discrepancy. Otherwise, all I can say is that you need a better host if you're going to use complex config code or scripts on your site, because not having access to the server error log files is fairly unacceptable in today's hosting market.

It's also telling that your Document_Root included that trailing slash, because that indicates a server misconfiguration, in that it should be possible to build a valid filepath using
%{DOCUMENT_ROOT}%{REQUEST_URI} on any server (even in the absence of Mod_Rewrite, just speaking generally here). But %{REQUEST_URI} *always* includes a leading slash, so with your server including a trailing slash on Document_Root, trying to build a path using
%{DOCUMENT_ROOT}%{REQUEST_URI} would give us the same double-slash problem that we've already been through using the $1 back-reference method above. As a result, you may also have a lot of trouble with off-the-shelf scripts on this server. :(

Jim

Sawhorse

5:41 pm on Jan 22, 2009 (gmt 0)

10+ Year Member



Good to know that if Document_Root + Req_URI + ".html" is the incorrect path to the actual html file then the RewriteCond will stop the rule from being invoked. However, I do not understand why when you do the temporary redirect rule to see the path and then use this same information to process the RewriteCond how it could be incorrect.

You suggest that I compare filepath reported by the temporary rule to the filepaths you see using FTP to upload files. Well the filepath I see using FTP is the following:
example.com@ftp.example.com:/public/page.html

I agree that "not having access to the server error log files is fairly unacceptable." The host that is being used is a local telephone company (Windstream). I have asked to get access to the error logs.

Since this is a very, very basic site we probable will not be using many off-the-shelf scripts. Just a note that I am able to use Google maps on this site.

So bottom line. What problem will I have if I just do not use RewriteCond %{DOCUMENT_ROOT}$1.html -f

If I have not said this before I have certainly thought it. I really appreciate all the time you have spent with me on this issue.

jdMorgan

10:35 pm on Jan 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, you don't have to use the RewriteCond if it causes problems. But the thing is, it is the logically correct thing to do. I can't predict what other 'weird' problems you may run into if you don't use it. Just be aware that requests for extensionless URL-paths that don't resolve to a directory (when a trailing slash is added by mod_dir) will be rewritten to an HTML-file filepath -- even if no such HTML file exists.

However, I want to emphasize that your server *is* mis-configured -- and in a way that is specifically warned-about in the Apache documentation: Including the trailing slash on DocumentRoot triggers a known bug in mod_dir, and that may be what is causing this RewriteCond to behave so oddly. Refer your host to the DocumentRoot directive [httpd.apache.org] description in the "Apache core" documentation, and ask them to fix your DocumentRoot declaration by removing the trailing slash. If they argue, ask them to read the last line of that section to you over the phone, and to tell you again that they don't see a problem... :)

If you can't get them to fix this or you can't get access to your log files, then I suggest you run --do not walk-- to the nearest exit. Since you haven't been around long, I'll repeat one of my favorite phrases: "Cheap hosting is the most expensive hosting you can buy!" (Think about how long you've been working on this one problem, and imagine my bill if this were a paid consultancy). There are simply too many nice fish in the hosting-services pond to put up with one that is sick or emaciated...

Jim

Sawhorse

4:05 am on Jan 23, 2009 (gmt 0)

10+ Year Member



Glad you feel that I do not have to use the RewriteCond if it causes problems, but I agree that logically it is the correct thing to do.

From all the problems we are having I must *strongly* agree with you that the server is mis-configured. Very nice information about the DocumentRoot directive description in the "Apache core" documentation!

I really doubt that a large phone company will listen to me. When I asked for the error logs, the low level tech said that he would put in a ticket for the request. I asked why the logs were not readily available? He responded that he had never been asked for access to the error logs before. So, you see I believe that this might be a loosing battle.

I love your "favorite phrase" and I agree. I have another site that I built that is on a nice server that runs cPanel Version 11 as the WebHost Manager Interface. I can get to everything.

Again, I appreciate your sharing your knowledge. And I am sure others that read this thread will also. Thank you.

Caterham

1:23 pm on Jan 23, 2009 (gmt 0)

10+ Year Member



The DOCUMENT_ROOT isn't a variable to rely on. You don't know if the URI-to-filename translation was done by the core or by another module. If it's done by another module, the document_root isn't used and won't contain your path, of course. But that is not a mis-configuration, otherwise that would imply that all other URI-to-filename translations not done by the core are mis-configurations...

The DOCUMENT_ROOT is not (never was) and never will be a *reliable* way to determine the filesystem path to your web folder.

Are you hosted on apache 1.3? Apache 2 normalizes all paths, so you shouldn't fall into issues with multiple slashes (in the mapping phase, mod_rewrite's -d/-f checks or wherever).

Sawhorse

5:07 pm on Jan 23, 2009 (gmt 0)

10+ Year Member



In answer to your question about Apache version. That info was not available on the site management software. I called support. At first they said they did not use Apache. Then he said noone has asked that question before. When I pressed the tech he talked to others and said that since I had PHP5 that they must be using Apache 2. They also said something to the effect that the organization they were using use a hybrid server system. Windows on a unix box.