Forum Moderators: phranque

Message Too Old, No Replies

Content Negotiation

everything but .htm is working..

         

youfoundjake

12:46 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Here is my current .htaccess:

Options +FollowSymlinks
RewriteEngine on
rewritecond %{http_host} ^example.com [nc]
rewriterule ^(.*)$ http://www.example.com/$1 [r=301,nc]

RewriteCond %{THE_REQUEST} ^.*\/index\.htm?
RewriteRule ^(.*)index\.html?$ http://www.example.com/$1 [R=301,L]

RewriteRule ^([a-z0-9\-]+)$ /$1.htm [L]

I found a program that renamed all my local *.htm files to just *, with out any extension.
I uploaded them to the webserver, and of course now i have two versions of the same file:
www.example.com/filename.htm
www.example.com/filename

There are a few external links pointing to the .htm files, so I can't delete them off the server, but getting it to 301 to an extensionless file is proving to be difficult..

On a side note, before I added just the

RewriteRule ^([a-z0-9\-]+)$ /$1.htm [L]
and renamed filename.htm to just filename, I was able to view it using IE7, but Chrome and FF both showed the page as source code instead of rendering it...

Anyways, any tips?

encyclo

1:28 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You seem to have some of this backwards - firstly, the physical files on the web server should all have the appropriate file extensions, as the server assigns the appropriate mime types for the files in relation to the extension (hence the issues displaying extensionless files). Secondly, this isn't really content negotiation, which is a separate module from mod_rewrite and an entirely different technique, in fact one which often can enter in conflict with mod_rewrite rules.

So you should start by re-renaming the files, then you need to define exactly what your aims are - do you want all files to be served without extensions, or just the HTML ones? Are you looking to do actual negotiation (multiple versions of the same content in different versions)?

youfoundjake

1:32 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Heh, yeah i figured there was a way to do it without physically renaming the files, good thing i make backups..
I'm wanting the files to be served without extensions, (no directories exist with the same same as a file)
As for why I'm doing it, i want it to look professional, www.example.com/about-us instead of www.example.com/about-us.htm, plus the .htm just looks horrible in the SERPS...
Concerning internal linking, I took off the .htm on all links, should that be ok?

encyclo

1:39 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You should read jdMorgan's post #:3372175 in this earlier thread:

[webmasterworld.com...]

youfoundjake

1:42 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



danke...

g1smd

1:46 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Use extensionless URLs in the links on your pages. Links 'define' URLs.

Set up a rewrite such that when an extensionless URL is requested, the server serves the content of the matching .html file. Since it is a rewrite, the true name of the file will not be exposed out to the web.

Set up a redirect, such that if .html URLs are requested, the user is redirected to the appropriate matching extensionless URL.

Make sure you also have the usual index to / and non-www to www redirects also in the same file.

.

Quick pointers.

Do all of this with RewriteRule and RewriteCond. Do not mix any Redirect or RedirectMatch code in with this.

List the redirects from most specific (index files) first to most general (non-www to www) last.

The redirects must have the full domain name in the target URL, and R=301. Force www at the same time in every one.

List all of the redirect(s) before the rewrite.

All rules (redirects or rewrites) must end with [L].

Don't feel free to mess with the capitalisation of the syntax, unless you're prepared for some future incompatibility with the code.

The index redirect is very inefficient due to the use of .* in the pattern. There's much better ways to code that.

Note that ^(.*)$ simplifies to (.*) here too.

youfoundjake

1:57 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It seems as if its being done backwards,
Then add some mod_rewrite code to internally rewrite extensionless *URL* requests to .html *files*, if those files exist.
Finally, add some mod_rewrite code to externally redirect any direct client (user or robot) requests for .html-extension URLs to the corresponding extensionless URLs.

g1smd

2:04 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A rewrite does not 'make' a URL. URLs are defined by links on the web. A rewrite changes nothing on the web - it only changes the association of requested URLs and the filenames that will match those URL requests.

A rewrite takes a URL request and finds the file on the server to pull the content from - a filename that is different to the one that might have been hinted at by the path in the URL.

You need an additional redirect to stop the files being directly accessed by their 'true' URL. This redirects the user to make a new request. This request is for the URL that you want users to 'see' and 'use' to access that content on the web.

youfoundjake

2:09 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ugh, ok, let me chew on it for awhile and I'll post what I came up with, looks like I'm reading a bunch of threads tonight, better get some coffee and aspirin..friggin apache..

g1smd

2:30 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



[webmasterworld.com...] and others explain some details.

youfoundjake

2:03 am on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



K, here is what I currently have come up with..

Options All Indexes
IndexOptions FancyIndexing

Options +FollowSymlinks
RewriteEngine on

# If requested URL-path plus ".htm" exists as a file
RewriteCond %{DOCUMENT_ROOT}/$1.htm -f
# Rewrite to append ".htm" to extensionless URL-path
RewriteRule ^(([^/]+/)*[^.]+)$ /$1.htm [L]

## Internally rewrite extensionless file requests to .htm files ##
#
# If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI} !(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME} !-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME} !-f
# then add .html to get the actual filename
RewriteRule (.*) /$1.htm [L]

## Externally redirect clients directly requesting .html page URIs to extensionless URIs
#
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+htm\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.htm$ http://www.example.com/$1 [R=301,L]

rewritecond %{http_host} ^example.com [nc]
rewriterule ^(.*)$ http://www.example.com/$1 [r=301,nc]

RewriteCond %{THE_REQUEST} ^.*\/index\.htm?
RewriteRule ^(.*)index\.html?$ http://www.example.com/$1 [R=301,L]

I updated the links on all my pages to no longer point to .htm even though the page names still end with .htm

i structured the .htaccess to
1. convert extensionless paths to .htm internally
2. convert external .htm requests to extensionless urls
3. 301 non-www and index.htm to www and root...

Hows that look?

[edited by: youfoundjake at 2:50 am (utc) on Mar. 9, 2009]

g1smd

2:07 am on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You have to do all the external redirects before you can start doing internal rewrites - otherwise you will expose the internal filepath back out into the URL (and that is NOT what you want).

Likewise, list the most-specific redirects first and the most-general redirects last, otherwise some requests will go through a redirection chain, instead of direct to target in just one hop.

Put [L] on every Rule. You missed one.

The

^(.*)$
simplifies to
(.*)
too.

Your index redirect uses (.*) which is very inefficient. You'll need the one from [webmasterworld.com...] - do note the correction on Page 3 of that thread.

Check my 'quick pointers' list again for all the steps.

youfoundjake

3:06 am on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmm, ok, so I should now have

Options All Indexes
IndexOptions FancyIndexing
Options +FollowSymlinks
RewriteEngine on

## Externally redirect clients directly requesting .html page URIs to extensionless URIs
#
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+htm\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.htm$ http://www.example.com/$1 [R=301,L]

# If requested URL-path plus ".htm" exists as a file
RewriteCond %{DOCUMENT_ROOT}/$1.htm -f
# Rewrite to append ".htm" to extensionless URL-path
RewriteRule ^(([^/]+/)*[^.]+)$ /$1.htm [L]

## Internally rewrite extensionless file requests to .htm files ##
#
# If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI} !(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME} !-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME} !-f
# then add .html to get the actual filename
RewriteRule (.*) /$1.htm [L]

rewritecond %{http_host} ^example.com [nc]
rewriterule (.*)http://www.example.com/$1 [R=301,nc,L]

RewriteCond %{THE_REQUEST} ^.*\/index\.htm?
RewriteRule ^(.*)index\.html?$ http://www.example.com/$1 [R=301,L]

Now, as far as the index redirect being ineffecient, I'm not using any subdirectories and all the pages are .htm so do I really need to add the code for php and asp?

youfoundjake

3:22 am on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



K, So I've published the code..
Here is the response codes for http://example.com/page.htm

1. Requesting: http://example.com/page.htm
GET /page.htm HTTP/1.1
Connection: Keep-Alive
Keep-Alive: 300
Accept:*/*
Host: example.com
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)
Server Response: 301 Moved Permanently
Date: Mon, 09 Mar 2009 03:16:10 GMT
Server: Apache/1.3.41 (Unix) Resin/2.1.13 mod_fastcgi/2.4.6 mod_log_bytes/1.2 mod_bwlimited/1.4 mod_auth_passthrough/1.8 FrontPage/5.0.2.2635 mod_ssl/2.8.31 OpenSSL/0.9.7a
Location: http://www.example.com/page
Keep-Alive: timeout=5, max=149
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1
Redirecting to http://www.example.com/page ...

2. Requesting: http://www.example.com/page
GET /page HTTP/1.1
Connection: Keep-Alive
Keep-Alive: 300
Accept:*/*
Host: www.example.com
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)
Server Response: 200 OK
Date: Mon, 09 Mar 2009 03:16:10 GMT
Server: Apache/1.3.41 (Unix) Resin/2.1.13 mod_fastcgi/2.4.6 mod_log_bytes/1.2 mod_bwlimited/1.4 mod_auth_passthrough/1.8 FrontPage/5.0.2.2635 mod_ssl/2.8.31 OpenSSL/0.9.7a
Last-Modified: Mon, 09 Mar 2009 01:54:48 GMT
ETag: "4ae34ea-1cd9-49b476e8"
Accept-Ranges: bytes
Content-Length: 7385
Keep-Alive: timeout=5, max=150
Connection: Keep-Alive
Content-Type: text/html

g1smd

7:52 pm on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You still haven't got the redirects before the rewrites. Only one is. They all should be. Check the list ordering of those as per the list above.

You can omit the .php and .asp stuff from the index rule, but you do need to replace the

.*/
part with the
/([^/]+/)*
pattern, and the
.*
part with the
(([^/]+/)*)
pattern.

youfoundjake

2:43 am on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Options All Indexes
IndexOptions FancyIndexing
Options +FollowSymlinks
RewriteEngine on

## Externally redirect clients directly requesting .html page URIs to extensionless URIs
#
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+htm\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.htm$ http://www.example.com/$1 [R=301,L]

# If requested URL-path plus ".htm" exists as a file
RewriteCond %{DOCUMENT_ROOT}/$1.htm -f
# Rewrite to append ".htm" to extensionless URL-path
RewriteRule ^(([^/]+/)*[^.]+)$ /$1.htm [L]

## Internally rewrite extensionless file requests to .htm files ##
#
# If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI} !(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME} !-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME} !-f
# then add .html to get the actual filename
RewriteRule (.*) /$1.htm [L]

RewriteCond %{THE_REQUEST} ^.*\/index\.htm?
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1 [R=301,L]

rewritecond %{http_host} ^example.com [nc]
rewriterule ^(.*)$ http://www.example.com/$1 [r=301,nc]

My brain hurts..

Do I even want the non-www to www redirect before the file based rewrites?

Thanks for the help Ian..

youfoundjake

3:02 am on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



uh oh,
webmastertools is reporting to me:
When we tested a sample of URLs from your Sitemap, we found that some URLs redirect to other locations. We recommend that your Sitemap contain URLs that point to the final destination (the redirect target) instead of redirecting to another URL
oops, heeh..i corrected it..

g1smd

9:00 am on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, you MUST put the redirects first, because they are fixing up the URL out on the web that users are going to be using to access the content. Only when they have the correct URL can you rewrite the request to get the file from a different place in the server to that suggested by the URL path part.

Google is already seeing exposed URLs, because you have the instructions in the wrong order. For some URLs, it is .htaccess that needs to be fixed, not the sitemap.

That is, if you do the rewrite first, the internal file pointer will be updated the show the internal filepath location of the content and then a following redirect will expose that filepath back out into the URL. You do not want that to happen. That is why you do redirects first. Fix the URL the user sees before doing a rewrite to get the content from a different location.

Please check very carefully the complete list of what to do at the top of this thread. You do need all of those steps, otherwise you will get the above behavior as well as some requests going through a redirection chain.

RewriteCond %{THE_REQUEST} ^.*\/index\.htm?
- in this part, htm? should be html? and the / before index does NOT need escaping (that is, \/ should be / only). See also, above, for a more efficient replacement for the .* part of this line. I would use this as it is way more efficient:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?\ HTTP/ 
- it works for both .htm and .html extensions.

I always use the full version with the .php and .a/jsp(x) checks on all sites, so that the code is completely portable, and so that I don't have to think about it. I did that after accidentally using the .php version on a site that used index.htm files - and didn't notice until Google had indexed some of the index.htm URLs.

Capitalisation of Apache Keywords in the non-www redirect is messed up, but I'm just repeating myself now. All of the steps are listed in this thread, and all of them are necessary.

[edited by: g1smd at 9:42 am (utc) on Mar. 10, 2009]

g1smd

9:42 am on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Example: I'd expect that a request for example.com/index.htm doesn't do what you want, or does a double redirect, as you have things coded right now. Use Live HTTP Headers to be sure. The rule order is very important.

youfoundjake

6:02 pm on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



G1smd, thanks..
What I have written in post 3866008 takes http://example.com/pagename.htm and sends it to [example...] and from what I understand, the .htm is converted to extensionless, and the 301 redirect to www works..
For google complaining, my sitemap.xml contained .htm extensions, and they didn't like being redirected since I guess that the paths in the sitemap should be final, and not lead to a redirect..
I'm, for the time being, gonna put everything back as the way it was until I understand apache syntax better, so as to not frustrate those around me.. Including my buddy Ian.. :)
I'll start digging through the library here, but do you any other recommended readings?

g1smd

6:39 pm on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Heh, you're just a short check list and a couple of edits away from finishing the job here.

No need to be putting it back how it was.

jdMorgan

8:26 pm on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Simplify:

IndexOptions FancyIndexing
Options All
RewriteEngine on
#
# Externally redirect client /index page requests to "/"
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.html?
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1 [R=301,L]
#
# Externally redirect client requests contains htm/html extension to extensionless URL
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*[^.]+\.html?
# externally redirect to extensionless URI
RewriteRule ^(([^/]+/)*[^.]+)\.html?$ http://www.example.com/$1 [R=301,L]
#
# Externally redirect non-blank non-canonical hostname requests to canonical hostname
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
# If requested extensionless URL-path does not resolve to an existing directory
RewriteCond %{REQUEST_FILENAME} !-d
# and if requested extensionless URL-path plus ".htm" does resolve to an existing file
RewriteCond %{REQUEST_FILENAME}.htm -f
# then append ".htm" to resolve the actual filename
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.htm [L]

Jim

[edit] Corrected as noted below. [/edit]

[edited by: jdMorgan at 12:02 am (utc) on Mar. 11, 2009]

youfoundjake

10:29 pm on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Jim,
Thanks, now, i get to study in detail the syntax and how it applies..

youfoundjake

11:33 pm on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmm, i got a 500 internal server error on that..

The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator, webmaster@example.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.

Additionally, a 500 Internal Server Error error was encountered while trying to use an ErrorDocument to handle the request.

g1smd

11:39 pm on Mar 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Go look in the Error Log file to see what it says. I am guessing it is the typo in this line:

RewriteRule ^(([^/]+/)*[^.]+\.html?$ http://www.example.com/$1 [R=301,L]

Unmatched brackets, two opening, but only one closing.

RewriteRule ^(([^/]+/)*[^.]+[b])[/b]\.html?$ http://www.example.com/$1 [R=301,L]

The bold shows the addition. Always start by looking for obvious logic errors.

[Heh. Do I get extra points for spotting a jd typo?] :)

jdMorgan

12:02 am on Mar 11, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nah, I post more typos than anyone!

Jim

youfoundjake

12:09 am on Mar 11, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That worked, (you both get points in my book and yes, I do keep score..)

youfoundjake

2:20 am on Mar 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



On a side note, google is bombing out on the site verification for WMT because the google***.html file is returning a 404 since .html is being rewritten...
I just added the meta verification tag, but just thought i'd share...

g1smd

10:17 am on Mar 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you make your rules 'too wide' they can sweep up robots.txt, stylesheets, javascript, images, SE verification files, and many other things that you didn't intend to be rewritten.

You can add a negative match RewriteCond to stop them being rewritten, or make your script deal with the request and serve the expected content. Your choice.