Forum Moderators: phranque
RewriteCond %{the_REQUEST} ^[A-Z]{3,9}\ /index\.html\ HTTP/
RewriteRule ^index\.html$ [%{HTTP_HOST}...] [R=301,L]
RewriteCond %{the_REQUEST} ^[A-Z]{3,9}\ /index\.html\ HTTP/
RewriteRule ^index\.htm$ [%{HTTP_HOST}...] [R=301,L]
- I have no previous experience of using .htaccess before today!
1. It doesn't work for index files in folders. It only works for the root.
2. It doesn't fix the domain to be www. So, in combination with your other rule that does, it will create a redirection chain. This rule should fix the URL to be www as well as dealing with the removal of the index file filename at the same time.
The code you need is in a thread active yesterday and today, here in this forum.
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html.*\ HTTP/
RewriteRule ^(([^/]*/)*)index\.html$ [domain.com...] [R=301,L]
It still doesn't want to work!
Here's the current .htaccess file:
RewriteEngine on
RewriteOptions MaxRedirects=40
RewriteBase /
RewriteCond %{QUERY_STRING} ^a=j$
RewriteRule ^(.*)$ /joinus.htm? [L,R=301]
RewriteCond %{QUERY_STRING} ^a=g$
RewriteRule ^(.*)$ /example.html? [L,R=301]
RewriteRule ^(.+\.html?$) /cgi-bin/example.cgi [NC,L]
RewriteRule ^sitemap.xml$ /cgi-bin/example.cgi?a=sX [NC,L]RedirectMatch 301 /links/(.*) http://www.example.com/notfound.shtml
Redirect gone /$myEmail
Redirect gone /&usg=ALkJrhgz6MRDcFkp-kcCoKgNS9ERG-CLtQ### Redirect the non www version to the www version ###
Options +FollowSymLinks
RewriteCond %{HTTP_HOST} ^example.com [NC]
RewriteRule ^(.*)$ http://www.example.com/ [L,R=301]RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html?.*\ HTTP/
RewriteRule ^(([^/]*/)*)index\.html?$ http://www.example.com/$1? [R=301,L]
Any idea what I'm missing here?
In all, it seems you may be copying and pasting code without fully understanding it. Be warned that since this is server configuration code, that is a recipe for disaster.
Did you completely flush your browser cache before testing?
Options +FollowSymLinks
RewriteEngine on
RewriteOptions MaxRedirects=3
# RewriteBase /
#
# Return 410-Gone for myEmail URLs
RewriteRule myEmail - [G]
#
# Return 410-Gone for specific query string
RewriteCond %{QUERY_STRING} &usg=ALkJrhgz6MRDcFkp-kcCoKgNS9ERG-CLtQ
RewriteRule .* - [G]
#
# Internally rewrite links URLs to non-existent path to force a 404-Not Found response
RewriteRule ^links/ /some-path-that-does-not-exist [L]
#
# Externally redirect request with specific query strings
RewriteCond %{QUERY_STRING} ^a=j$
RewriteRule .* /joinus.htm? [R=301,L]
RewriteCond %{QUERY_STRING} ^a=g$
RewriteRule .* /example.html? [R=301,L]
#
# Externally redirect direct client requests for "<any-directory>/index.html" and
# "<any-directory>/index.htm" to "<any-directory>/"
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html?.*\ HTTP/
RewriteRule ^(([^/]*/)*)index\.html?$ http://www.example.com/$1? [R=301,L]
#
# Internally rewrite specific URLs to example.cgi
RewriteRule ^([^/]*/)*[^.]+\.html?$ /cgi-bin/example.cgi [L]
RewriteRule ^sitemap\.xml$ /cgi-bin/example.cgi?a=sX [L]
#
# Externally redirect the non www hostname to the www hostname
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
# Externally redirect to fix up FQDN and appended port numbers
RewriteCond %{HTTP_HOST} ^www\.example\.com(\.¦:[0-9]*) [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
In several cases, you've got internal rewrites that accept variant URLs -- for example, the "htm" and "html" variations. Be advised that best practices indicate that you should canonicalize these variant URLs before rewriting them. In other words, if possible, externally redirect all .htm URLs to .html URLs, and then internally rewrite only .html URLs. This will avoid the so-called "duplicate-content penalties" that result when a single resource (e.g. file) can be reached by more than one URL.
Replace any and all broken pipe "¦" characters above with solid pipe characters before use; Posting on this forum modifies the pipe characters.
Jim
Thank you for the above, it works perfectly.
Can I clarify one thing:
(\.¦:[0-9]*)
Do I need to modify the above in some way? If so, to what?
I have a second question too: the site in question uses a perl CMS to spit out .htm and .html pages.
I've noticed that if you enter www.example.com/example, the correct .404 is returned.
However, if that is changed to either www.example.com/example.htm or www.example.com/example.html, a .200 is returned.
Is this something that can be addressed in the .htaccess file?
Can I clarify one thing:
(\.¦:[0-9]*)Do I need to modify the above in some way? If so, to what?
From my post above:
Replace any and all broken pipe "¦" characters above with solid pipe characters before use; Posting on this forum modifies the pipe characters.
I've noticed that if you enter www.example.com/example, the correct .404 is returned.However, if that is changed to either www.example.com/example.htm or www.example.com/example.html, a .200 is returned.
We need more details here: What specifically is wrong with the www.example.com/example.html URL? Does it not exist as a "physically-existing" (static) file, or is there no data in the CMS to generate a page for that URL?
If the latter is true, then the CMS must be modified so that it handles this condition by returning a 404-Not Found or 410-Gone response -- 404 if the data to generate the page never existed, and 410 if it once did exist but has been obsoleted or removed.
An opportunity also exists here for further improvement: If the page once did exist but you intentionally obsoleted it, then the CMS database could support an entry that contains a replacement URL for use when the page has been obsoleted, but you now have another page that addresses the same or a very similar subject. Adding this kind of meta-data to your database allows centralized administration and provides really good support for bogus, obsoleted, or replaced resources.
What you're likely talking about here is a fairly common problem: On script-based sites where almost all URL requests are handled by the script, it is up to the script to take over 404/410/301 response handling for any and all URLs that it accepts for handling.
It's fairly common that this is not properly implemented, even on "professionally-written" CMS packages and shopping carts. This results in potentially-massive duplicate content problems and "shallow" spidering; Once a search engine discovers that your site returns a 200-OK response for just about *any* requested URL, it will artificially limit its spidering depth to avoid your "infinite URL-space" and this can result in sparse/stale search results listings and less-than-optimal search results ranking for your site(s). It is, in short, a very serious problem.
Jim
Let me clarify my post above:
www.example.com/example.html does not exist - either physically or in the CMS - but a .200 is returned, not a (correct) .410.
Both the duplicate content and shallow spidering are problems I'm trying to resolve right now.
The CMS was custom written a while back and my current web guy is still not too familiar with it.
Short of commissioning a new CMS, is there way to resolve the response handling issue via .htaccess, or am I out of luck on this one?
It should not be "too big of a deal" to a programmer familiar with CMSes, databases, and sending server response headers. Find the spot in the CMS code where you can decide if "the requested db entry exists." If it exists, proceed as normal. If not, release the db connection, send an error response to the client, and exit.
Jim
Many threads here reflect a rush past requirements specification to coding, and this often leads to the correct answer to the wrong problem; For example, code is sometimes posted that will work, but that will implement a function that should not be done for other reasons, such as search engine ranking or usability.
One of my favorite cartoons from a previous career showed the lead programmer heading upstairs from the "software engineering room." The caption read, "I'll go upstairs and find out what they want. The rest of you start coding!" :)
Jim
As they have no idea what they are doing, most people just jump in and try stuff in the hope that it will work.
I see many things that appear to work (done this myself, several times) but which will cause major problems with other unexpected URL inputs that weren't tested when implemented.
I have learnt to try a huge range of both expected and unexpected URLs when testing code out, but still manage to miss some on occasions.
I have recently revisited some of my old code from a few years ago and spotted some howlers. I have rewritten much of it to be more efficient, and expect to revisit it again later on.