homepage Welcome to WebmasterWorld Guest from 50.19.33.5
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
coding for .htaccess file
When can you remove old rewrite pages from the .htaccess file?
librarian




msg:4564745
 7:33 pm on Apr 14, 2013 (gmt 0)

Hi,
I have a couple of questions about the .htaccess file on a web site I created some time ago. It was a rebuild of a site the owner had created. Those pages had been in the search engine databases for several years when I took over. I created new pages and redirected the old pages (id12.html example) to the new pages that are .htm based. What I am wondering is, can I safely delete all the rewrite commands for pages (like id12.html) for the old site?

This is why. I have been looking at commands I can use in the .htaccess file to remove the index.htm leaving only the /. I did find some examples in a thread that I could use.

# Redirect index in any directory to root of that directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index(\.[a-z0-9]+)?[^\ ]*\ HTTP/
RewriteRule ^(([^/]+/)*)index(\.[a-z0-9]+)?$ http://www.example.com/$1? [R=301,L]

It is more then I need since I only use the index.htm on the homepage. But the command does seem to work on the examples I tried.

In the same thread was a recommendation for a command to change html to htm on pages. This command should go after the index.htm removal command. Then the www to non www command (see below). This order made sense to me.

The rewrites for the old id12.html are below all the changes. So they come up as 404 now. It did work correctly before when the html to htm command was at the end of the rewrite list. Would it be safe to remove the old .html rewrites after 2 to 3 years? I haven't seen anything come through for a long time. Then I could place html to htm command at the near the beginning.

I also have another question. Years ago, and I do mean years ago (1998), on another website I created a .htaccess file. At the time there was a lot of discussion if you should go non www to www or the reverse. I chose www to non www. Now every thing I've been reading in the threads today seem to say I need to change things around. Would there be a real reason to do this? Problems? Even Google at one time in the Webmaster Tools account let me chose which way I wanted them to display the domain name. I chose with out the www. To this day they seem to be doing this.

Any suggestion you might have would be very helpful.

Thank you.
Rhoda

 

phranque




msg:4565805
 12:19 am on Apr 18, 2013 (gmt 0)

generally your directives should go in order of most specific redirects to least specific, with your hostname canonicalization redirect last, followed by your internal rewrite directives, ordered from most specific to most general.

if you have any access control directives, these should precede the redirects.

do you still have any inbound links to the old urls?
if so you should keep those redirects in place.
i'm sure you can get your new rules to work with the old directives in place.

the www vs non-www discussion is ongoing and depends on a lot of things.
have you looked at previous WebmasterWorld threads on this subject?
site:webmasterworld.com www vs non-www - Google Search:
http://www.google.com/search?num=100&q=site%3Awebmasterworld.com%20www%20vs%20non-www [google.com]

lucy24




msg:4565888
 4:19 am on Apr 18, 2013 (gmt 0)

In the same thread was a recommendation for a command to change html to htm on pages.

Uh... Why? Are you sure you didn't blunder across an article written in 1998, when it was possible that some browsers on That Other Platform couldn't deal with filenames other than 8+3?

If you include the extension in your URL, use the form that your page files actually have. There are valid reasons for rewriting from static to dynamic-- an URL in .html might serve content from .php?long-icky-query for example --but I can't think of any earthly reason for switching from html to htm or vice versa.

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index(\.[a-z0-9]+)?[^\ ]*

Waaaaayy overkill. You're fine up to "index". No reason even to code for the possibility of /directory/index and-that's-all unless you actually did once have :: shudder :: URLs in this form. And no reason to accept requests for extensions you don't use; what human would do that? You certainly don't need to help out robots who come in asking for bad extentions :) At most it's index\.(php|html?)\ HTTP and that's enough. You can add php\S+ --or php[^\ ]+ depending on how grumpy your server is --if-and-only-if you actually do get requests containing queries. (This is frankly not likely, because where would they come from?)

BUT WAIT! On most Apache installations you don't need a %{THE_REQUEST} element at all. A simple [NS] flag attached to the body of the rule will do the job just fine. The idea is simply to exclude any mod_dir activity.

generally your directives should go in order of most specific redirects to least specific

There are really two layers of nesting. First go from most severe to least severe:
access control (the ones ending in [F])
files that no longer exist (flag [G])
redirects (flag [R=301])
and finally the bare rewrites (generally [L], sometimes something fancier like [P])
and then finalfinally ;) you may have some just-passing-through rules with flags like [CO] or [E].

Then, within each of those groups, list requests from most specific to most general. The idea is to avoid redirecting anyone twice, and also to avoid redirecting someone who is destined to get locked out within mod_rewrite. Sometimes you'll need to put individual rules somewhere other than their default location, but start with this pattern.

g1smd




msg:4565924
 8:24 am on Apr 18, 2013 (gmt 0)

You've stumbled upon code for a universal index redirect.

(\.[a-z0-9]+)? -- this part matches .php5 and so on.

[^\ ]* -- this part matches any attached parameters.

There's nothing wrong in using that code, but if you only ever see requests for a single extension such as .html or .php then the code can be simplified a little as it will then perform slightly faster.

In place of [A-Z]+\ I use [A-Z]{3,9}\

librarian




msg:4566092
 6:52 pm on Apr 18, 2013 (gmt 0)

Hi,
I want to thank everyone for replying. Finding commands by researching in the forums may have made things more complex than what I really need. In spite of all the reading many of the commands still seem like greek. The site is simple, small in comparison to sites today. So I'm asking what are the simplest commands to use for the following commands.

example.com/index.htm to example.com/ (there is no other index file on the site.)

For the command to move from html to htm this was a simple a command that I found (years ago) and it seems to have worked though it runs at the end of the .htaccess file.

RewriteBase /
RewriteRule ^(.*)\.html$ $1.htm [R=permanent]

I found this command below recently and tried it. It also worked but not at the beginning area of the .htaccess file.

Redirect all .html requests to .htm on canonical host
RewriteRule ^(.*)\.html$ http://www.example.com/$1.htm [R=301,L]

After the above command at the top of the .htaccess file I placed:

RewriteCond %{HTTP_HOST} ^www.example.com [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^(.*) http://example.com/$1 [R=301,L]

Would it confuse the search engines if I change things around to go from non www to www?

The order of the above commands made sense in the .htaccess file but the rewrites that followed caused a problem. Because the html to htm was now at the top the .htaccess file the old original site files followed. They were rewrites from the original pages to the new pages on the new site:

RewriteRule ^id1.html$ / [R=301,L]
RewriteRule ^id2.html$ /xxxx/filea.htm [R=301,L]
RewriteRule ^id3.html$ /xxxx/fileb.htm [R=301,L]
RewriteRule ^id4.html$ /xxxx/fileb.htm [R=301,L]

the html was changed to htm so it came up as a 404. There were 20 of these type of pages. So I was forced to put the html to htm command back at the end of the .htaccess file. This is why I thought to remove the old html files after several years.

I found my answer to the question of removing the files. I've been trying to get into the Bing Webmaster Tools account for a long time. It was complicated by an old account but I finally got in on Monday. What I found were close to 500 301 and 404 files that Bing says are 404s. The 301s are mostly 404s and the 404s are mixed directories for actual files on the site. Among these 404s are lots of the old html files. So I have to find a way to clean up this mess including the old htmls in a .htaccess file. Google has never presented problems like this.

I apologize if this was too long. I wanted to explain as much as possible. I've been reading in Webmaster World since 1998 (I think) and have learned a lot but there is always another problem to research.

Thank you for your help.

Rhoda

g1smd




msg:4566103
 8:21 pm on Apr 18, 2013 (gmt 0)

Make sure you escape literal periods in RegEx patterns. You have missed a few.

The target of a redirect should include the canonical hostname too.

If you have redirected from www to non-www for many years then I would not be tempted to suddenly reverse the direction. If you did, I would expect significant traffic loss for several months.

For consistency, replace
[R=permanent] with [R=301,L]
lucy24




msg:4566130
 9:35 pm on Apr 18, 2013 (gmt 0)

For the "index.html" redirect:

Comment-out the RewriteCond and instead add the NS flag to your existing [R=301,L] package. If the rule works as intended, you can dump the Condition. Never put something in a Condition if it can go in the body of the Rule. (In theory, NS is all you ever need in an index-redirect rule. In practice, there have been weird exceptions.)

In the Rule itself, you don't need to consider parameters, so all you need is a final
\.(html?|php)
with a pipe-separated list of only the extensions that your site actually uses. All others can take a 404 and lump it-- unless you've got desirable links giving the wrong name, and you can't persuade them to change.

The html/htm redirect is only needed if, again, you are getting actual requests from humans using the wrong extension. Or if a search engine has fallen in love with the wrong form and keeps requesting it year after year. Come to think of it, if you never use .html and the only requests are from search engines, you may be better off slapping on a general 410. It might persuade them to stop bugging you.

librarian




msg:4566431
 5:38 pm on Apr 19, 2013 (gmt 0)

Hi,
I want to thank everyone for your suggestions. They all made sense to me. I've applied many of them and they have worked in three different browsers. Here are what I changed:

# Redirect index in any directory to root of that directory
# RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index(.[a-z0-9]+)?[^\ ]*\ HTTP/
RewriteRule ^(([^/]+/)*)index(.[a-z0-9]+)?$ http://example.com/.(html?|htm) [R=301,L, NS]

# Redirect all .html requests to .htm on canonical host
RewriteRule ^(.*).html$ http://www.example.com/$1.htm [R=301,L]

RewriteCond %{HTTP_HOST} ^www.example.com [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^(.*) http://example.com/$1 [R=301,L]

RewriteRule ^">Home</a>$ http://example.com/ [R=301,L]
RewriteRule ^_wsn/page4.html$ http://example.com/xxxx/dddd-page-4.htm [R=301,L]
RewriteRule ^id1.html$ http://example.com/ [R=301,L]
RewriteRule ^id10.html$ http://example.com/xxxx/dddd-page-1.htm [R=301,L]

These commands are at the top of the file. Hopefully this is all correct. But it did seem to work no matter what I tested.

About the 500 Bing 404s. I went back today and they were all gone. But Bing said there were 808. I downloaded the cvs file so I still have them. In their place Bing now shows a new one. It is garbled like the others with more then one incorrect directory listed before the correct directory and file. This one is similar to the missing ones. I've decided to do nothing with the old 404s and add the new ones as they appear.

I don't plan to change from non www to www. Thanks for the opinion on that change.

Your help is much appreciated Lucy24, Glsmd, and Phranque.

Rhoda

g1smd




msg:4566447
 7:12 pm on Apr 19, 2013 (gmt 0)

Your first rule cannot work. The rule target
http://example.com/.(html?|htm) should be http://example.com/$1

The last three rules can never work, as previous rules will have already redirected those requests elsewhere. The four "most specific" rules at the end of your list must be listed first.

Escape all literal periods in RegEx patterns. A "." matches ANY character whereas you need "\." to match only a literal period.

Your non-www/www redirect doesn't cater for all non-canonical cases.
Replace
^www.example.com [NC] with !^(www\.example\.com)?$

The rules should be in this order:
- specific "one page" rules (all four)
- index redirect
- html to htm
- non-www/www
Every rule needs the [R=301,L] flag, and the rule target must always include the canonical protocol and hostname.

librarian




msg:4566468
 8:33 pm on Apr 19, 2013 (gmt 0)

Thank you Glsmd for finding the errors and explaining why. I reversed all the specific pages and put the three general rules at the end followed by the error document 404 page. Here is this command:

ErrorDocument 404 http//example.com/404errorpage.htm

I've never seen it with this included [R=301,L]. Is it required?

Here a part of the file with changes I made:

RewriteRule ^xxxx/yyyy.htm$ http://example.com/xxxx/zzzz.htm [R=301,L]
RewriteRule ^xxxx/ssss.htm$ http://example.com/ [R=301,L]

# Redirect INDEX in any directory to root of that directory
RewriteRule ^(([^/]+/)*)index(.[a-z0-9]+)?$ http://www.example.com/$1 [R=301,L, NS]

# Redirect all .HTML requests to .htm on canonical host
RewriteRule ^(\.*).html$ http://example.com/$1.htm [R=301,L]

# Redirect all WWW to NON WWW
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^(.*) http://example.com/$1 [R=301,L]

ErrorDocument 404 http//example.com/404errorpage.htm

A question. If these three files were in the wrong place so some of the request shouldn't have worked, why did every sample I tried worked?

Another question. I looked at the .htaccess file for my site built in 1998-. That .htaccess file had a long command to change all capitals in a request to lower case. I had forgotten that there had been a problem I had needed a solution for. So I tried a request in capitals on the newer site and it locked up. It gave a 500 internal server error. Is there a better/easier solution?

Here is what I found in WebmasterWorld after much searching years ago:
# Uppercase-to-lowercase URL redirect - .htaccess-only solution
#
# Modified with work-around for Apache 1.3 mod_rewrite path bug
# (See [archive.apache.org...]
#
# If no uppercase characters in current URL-path, skip next 28 rules
RewriteRule ![A-Z] - [S=28]
#
# Else replace first instance of each uppercase letter present
RewriteRule ^([^A]*)A([^<]*) $1a$2<
RewriteRule ^([^B]*)B([^<]*) $1b$2<
RewriteRule ^([^C]*)C([^<]*) $1c$2<
RewriteRule ^([^D]*)D([^<]*) $1d$2<
RewriteRule ^([^E]*)E([^<]*) $1e$2<
RewriteRule ^([^F]*)F([^<]*) $1f$2<
RewriteRule ^([^G]*)G([^<]*) $1g$2<
RewriteRule ^([^H]*)H([^<]*) $1h$2<
RewriteRule ^([^I]*)I([^<]*) $1i$2<
RewriteRule ^([^J]*)J([^<]*) $1j$2<
RewriteRule ^([^K]*)K([^<]*) $1k$2<
RewriteRule ^([^L]*)L([^<]*) $1l$2<
RewriteRule ^([^M]*)M([^<]*) $1m$2<
RewriteRule ^([^N]*)N([^<]*) $1n$2<
RewriteRule ^([^O]*)O([^<]*) $1o$2<
RewriteRule ^([^P]*)P([^<]*) $1p$2<
RewriteRule ^([^Q]*)Q([^<]*) $1q$2<
RewriteRule ^([^R]*)R([^<]*) $1r$2<
RewriteRule ^([^S]*)S([^<]*) $1s$2<
RewriteRule ^([^T]*)T([^<]*) $1t$2<
RewriteRule ^([^U]*)U([^<]*) $1u$2<
RewriteRule ^([^V]*)V([^<]*) $1v$2<
RewriteRule ^([^W]*)W([^<]*) $1w$2<
RewriteRule ^([^X]*)X([^<]*) $1x$2<
RewriteRule ^([^Y]*)Y([^<]*) $1y$2<
RewriteRule ^([^Z]*)Z([^<]*) $1z$2<
#
# Set the redirect-required flag since at least one
# uppercase letter must have been replaced to get here
RewriteRule . - [E=Redirect:Yes]
#
# If any uppercase letters remain in the URL-path,
# then restart the mod_rewrite code from the top
RewriteRule [A-Z][^<]*< - [N]

When tested this has changed upper case to lower case on the older site.

Thank you for your help.

Rhoda

g1smd




msg:4566484
 10:00 pm on Apr 19, 2013 (gmt 0)

You are still missing many places where you need to ESCAPE LITERAL PERIODS in RegEx patterns.

ErrorDocument 404 http//example.com/404errorpage.htm
This will generate a 302 redirect. The Apache manual warns that the ErrorDocument directive must NOT include a protocol or hostname.

lucy24




msg:4566509
 1:33 am on Apr 20, 2013 (gmt 0)

If these three files were in the wrong place so some of the request shouldn't have worked

Putting rules in the wrong order doesn't always prevent them from working. Sometimes it simply means that you can end up with more than one redirect, while if the rules were in the optimal order there would be only one.

Simple example. Rules in this order:

RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

RewriteRule ^(([^/]+/)*)index\.html http://www.example.com/$1 [R=301,NS,L]

Someone comes along and requests
http://example.com/directory/index.html
First rule kicks in; they are redirected to
http://www.example.com/directory/index.html
This time they sail past the first rule and slam into the second one, resulting in a THIRD browser request, this time for
http://www.example.com/directory/

If the rules had been in most-specific-to-least-specific order, the request would first have run into the "index.html" redirect. And this rule by itself would also have taken care of the non-canonical domain name.

Now add your html-to-htm redirect to the list-- but add it in the wrong place. That's yet another browser request before the poor machine is finally allowed to receive a page. And if your human user is on satellite or dialup, they may start noticing delays as the browser has to make four separate round trips before anything shows up onscreen.

Another and worse result of rules in the wrong order is that something will work-- but it will work too well, and will end up redirecting things that were already supposed to be finished, wrapped up, taken care of.

So I tried a request in capitals on the newer site and it locked up. It gave a 500 internal server error. Is there a better/easier solution?

Better/easier than a 500 error? I sure hope so :) Without seeing your error logs it's impossible to know what exactly the problem was. All it takes is a comma in the wrong place.

Before reading this post I detoured to investigate the set of built-in RewriteMaps: toupper, tolower, escape, noescape. A resounding 500 from MAMP was all it took to draw my attention to the bit I'd previously overlooked: these nifty rules can only be used in a config file. That explains why they're not discussed more often in this forum. Ouch. But that illustrates the other extreme of Things That Can Yield A 500 Error: trying a rule that simply isn't allowed. Apache is not HTML; it can be terminally unforgiving.

librarian




msg:4566874
 9:44 pm on Apr 21, 2013 (gmt 0)

Hi,
It's Sunday afternoon and nothing is working. I had to go back to an earlier .htaccess file.

Lucy: I understood what you explained about the order the rules should be placed in. Not sure the order is correct but here is what I came up with:

# Redirect all WWW to NON WWW
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^(.*) http://example.com/$1 [R=301,L]

# Redirect INDEX in any directory to root of that directory
RewriteRule ^(([^/]+/)*)index(.[A-Z0-9]+)?$ http://example.com/$1 [R=301,L, NS]

(In order to have the rules follow in the order I changed the target above from www.example.com to example.com because this is what we chose year ago. Not sure if it's a problem to remove the www but I didn't want to put the www back after the previous rule removed it.)

# Redirect all .HTML requests to .htm on canonical host
RewriteRule ^(.*).html$ http://example.com/$1.htm [R=301,L]

This is what I had been using:

# RewriteBase /
RewriteRule ^(.*)\.html$ $1.htm [R=301,L]

Just tried this and it is not working. I need to reboot and clear the old tries. Thankfully the site is working. It's all these odd requests for html, index.htm that aren't. The single page redirects do work. The worst part is I can't see the problems/problem.

The INDEX rule seemed to work for a while but it seems to have a problem with capitals. It didn't remove the index.htm page. Instead it put it in capitals.

I spent the afternoon reading Apache tutorials. I now know more about the \. literal period but I'm not sure if I have too many or not enough at the moment.

Tomorrow has to be better.

Rhoda

lucy24




msg:4566918
 2:34 am on Apr 22, 2013 (gmt 0)

RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteCond %{HTTP_HOST} !^$

The second Condition is not needed, because the ? in the first Condition does the same job. "Either exactly such-and-such or exactly nothing."

# Redirect all WWW to NON WWW

This needs to be your LAST redirect. It picks up only those requests that have not already been intercepted by specific redirects earlier.

It doesn't matter whether you personally choose to go with or without www. But it's absolutely essential for all your rules to use the same form. At best, inconsistency can result in multiple redirects. At worst, it creates an infinite loop.

RewriteBase /

Most people in most circumstances never need to think about RewriteBase at all. It is applied by Apache ONLY when the target of a RewriteRule begins in a raw directory name-- no leading slash, let alone protocol-plus-domain. And even then, plain / is the default. Since all your RewriteRules will have targets beginning in either http:// or / depending on whether they are meant to produce redirects or rewrites, the RewriteBase will never apply.

Are you actually getting requests in "html" for pages that really end in "htm"? You do need to code for all reasonable problems; that's why every site includes the www. and "index.html" redirects. But there's a long list of RewriteRules that you don't need to include unless, well, it turns out you need them. There are a heck of a lot of ways a request can be malformed: unwanted path after ".html", multiple directory slashes and so on. If you coded for every last possibility, your visitors would be in htaccess all day.

The INDEX rule seemed to work for a while but it seems to have a problem with capitals. It didn't remove the index.htm page. Instead it put it in capitals.

Uh-oh. Something somewhere else is doing this.

Now, speaking of index: You've got a potential ordering-of-rules issue with the pair of redirects, "index.xtn" and ".html", since the two possibilities overlap. So what you'll need is FIRST

RewriteRule {blahblah}index\.html? {target here}

and THEN

RewriteRule ({blahblah}\.htm)l {target ending in $1}

This way, whether the request is for "index.html" or "index.htm" you pick them up in the same rule. Throw in "index.php" and other extensions only if you actually use them. (This goes on the long list of Rules You Don't Need Unless You Need Them ;) Requests involving extensions that you've simply never used can be left to pick up a 404-- unless they come from desirable links that you want to hold on to.)

librarian




msg:4567102
 7:22 pm on Apr 22, 2013 (gmt 0)

Hi Lucy,
Thank you for your reply. After reading through your post this is what I came up with for order:

# 301 permanent redirect index.html(htm) to ONLY the root (not all folders)
RewriteRule ^index.html$ http://example.com/ [R=301,L]
RewriteRule ^index.htm$ http://example.com/ [R=301,L]

I wasn't sure what you meant should replace blahblah should be replaced with. What I did might be wrong. It did not work for Opera. A "moved permanently" came up. There was a link indicating you could go to the new page. There was no new page. This comes up most of the time.

# Redirect all .HTML requests to .htm on canonical host
RewriteRule ^(.*).html$ http://example.com/$1.htm [R=301,L]

"Moved permanently" comes up here in Opera. Chrome did not work. FireFox didn't work. index.html together did not work. When I try html alone Chrome says it encountered a redirect loop.

# Redirect all WWW to NON WWW
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule ^(.*) http://example.com/$1 [R=301,L]

This rule seemed to work the best. Opera and FireFox removed the www but Chrome left it there.

I would be willing to give up on the html to htm but think the index.htm to / would be good to consolidate page rank. I occasionally see them both listed in a list of site pages.

I removed the html and index.htm from the .haccess file. Now the homepage with index.htm comes up as a 404 page. Is it possible all the old commands are still in the memory? Could it clear up in time?

Bing also presents problems. There are 304 new error pages with 50 some 301 redirects most of which are 404s. HTML does not show up really with this new batch of 404s. There are a few html pages in the 301 file. Mostly the error pages list the domain name/directory 2/directory 4/real directory/real file name.htm. Some errors have more incorrect directories then the example. Bing has taken names of directories that exist on the site and then mixed them in the middle of the real domain/directory1/index.htm or any other correct page address.

Is this possible to fix in the .htaccess file or should I just forget it? I would like to work on getting more pages in Bing.

Thank you for all the help. I appreciate all your time.

Rhoda

phranque




msg:4567133
 10:00 pm on Apr 22, 2013 (gmt 0)

are you clearing your cache for each browser when testing your configuration changes?

lucy24




msg:4567134
 10:03 pm on Apr 22, 2013 (gmt 0)

# 301 permanent redirect index.html(htm) to ONLY the root (not all folders)
RewriteRule ^index.html$ http://example.com/ [R=301,L]
RewriteRule ^index.htm$ http://example.com/ [R=301,L]

You don't need the two separate rules. All you need is
RewriteRule ^index\.html?$ http://example.com/ [R=301,L]
This will pick up both. The question mark means the final "l" is optional, so the rule will apply to both html and htm.

I wasn't sure what you meant should replace blahblah should be replaced with. What I did might be wrong.

You may have noticed that one difference between WebmasterWorld and That Other Forum is that we're not supposed to write your code for you. Unless we get lazy or get tired of explaining that this is where you put the hook, and that is where you hold the pole, and over there is where you put the fish after you've caught it, and ... et cetera.

Yes, OK, so "et cetera" is not significantly more informative than "blahblah".

It did not work for Opera. A "moved permanently" came up. There was a link indicating you could go to the new page. There was no new page. This comes up most of the time.

Urk. Well, that definitely shouldn't be happening. Unless you've got some exceedingly weird browser settings. Note that a lot of browser-specific issues can be solved by the Universal Fix For What Ails You: empty your cache. If all this is happening somewhere other than your live site, you can also set page caching to expire immediately; an obedient browser will then never cache anything, and will make a fresh request every time. This can save a lot of aggravation.

# Redirect all .HTML requests to .htm on canonical host
RewriteRule ^(.*).html$ http://example.com/$1.htm [R=301,L]

Whoops! Forgot to escape the literal period again.

Now, you will not get a lot of requests for "blahblahxhtml" and "blahblah/html" and all the other things that a single . can represent. In fact what's most likely to happen is a type-in where they forgot the . entirely, so someone asks for
www.example.com/pagenamehtml
and then thanks to the unescaped . they get redirected to
www.example.com/pagenam.htm
when they're better off getting a 404 in the first place.

# Redirect all WWW to NON WWW
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule ^(.*) http://example.com/$1 [R=301,L]

This rule seemed to work the best. Opera and FireFox removed the www but Chrome left it there.

Oh, dear. This rule should not work at all-- unless there was a typo in your post and the rule itself says something different. As you've entered it here, the rule says "if the request is for anything other than www.example.com, then redirect to example.com". This will obviously create an infinite-redirect loop, since "example.com" is not "www.example.com". But there's got to be some other rule involved, or else all requests would lead to either an infinite loop or the wrong domain name.

I would be willing to give up on the html to htm but think the index.htm to / would be good to consolidate page rank. I occasionally see them both listed in a list of site pages.

Yes, it's one of the standard redirects. The conventional form for htaccess is
RewriteRule ^(([^/]+/)*)index\.html? http://www.example.com/$1 [R=301,L]

Watch out!
In all my examples-- haha-- I've been saying "www.example.com" because that's the form I use. If you prefer "example.com" then make sure you leave off the "www." (target) or "www\." (pattern) anywhere I've got it.

I removed the html and index.htm from the .haccess file. Now the homepage with index.htm comes up as a 404 page. Is it possible all the old commands are still in the memory? Could it clear up in time?

Ouch. Start by emptying all browser caches. Although browsers don't download the htaccess file, they sometimes remember 301 instructions. So if you put in a request for a page that got redirected two seconds ago, the browser may proceed directly to the new request even though the redirect is now gone.

Bing also presents problems. There are 304 new error pages with 50 some 301 redirects most of which are 404s. HTML does not show up really with this new batch of 404s. There are a few html pages in the 301 file. Mostly the error pages list the domain name/directory 2/directory 4/real directory/real file name.htm. Some errors have more incorrect directories then the example. Bing has taken names of directories that exist on the site and then mixed them in the middle of the real domain/directory1/index.htm or any other correct page address.

Is this possible to fix in the .htaccess file or should I just forget it? I would like to work on getting more pages in Bing.

Almost everything is fixable. But this one sounds like a brand-new problem. Before you start working on it, make sure you know where the problem originated. It might be a bad link from outside; it might be a bad link from elsewhere on your own site, or-- worst case-- it might be an htaccess booboo. Doesn't bing usually say where the bad URL links from? If the mixed-up URL is because of a mistake at your end, make sure you fix the mistake before you try to deal with its consequences. If it's a transitory error that you caught within a few days, it may be simplest just to ignore the resulting 404s; they'll go away in time. And if it's a bad link from someone you don't especially care about, it may even be counterproductive to try to deal with it.

librarian




msg:4567280
 1:58 pm on Apr 23, 2013 (gmt 0)

Hi phranque,
Thank you for your reply. I have been clearing the cache regularly. Opera, Chrome and FireFox are all set to clear the cache when they close. I've also rebooted a couple of times when all the browsers are closed.

It has been a complete surprise to see how differently each browser behaves with the same .htaccess file. Even with only the www to non www running it will work in one and not another. Typing in the domain/index.htm still causes odd happenings. Chrome says a redirect loop was created. Another seems to hang up for a long time before opening. Opera is the one I use the most. FireFox seems to work with the .htaccess file the best.

Time may help clear out the problems when index.htm is not asked for again in tests.

Thanks again,

Rhoda

librarian




msg:4567298
 3:34 pm on Apr 23, 2013 (gmt 0)

Hi Lucy,
I don't know what to say. I want to thank you for all your help but I've decided I have to cool it for a while to give the browsers a chance to forget all tests. The caches have been emptied and the computer rebooted.

Opera seemed to be working with the www to non www which was the only rewrite in the .htaccess file besides the specific ones which had been working. It did have one problem that didn't seem related. The graphics were not showing but the pages were loading.

The other two browsers showed the graphics but with domain/indirect.htm the redirect loop error showed up. I gave up and put the old .htaccess file back again. Suddenly Opera was showing the graphics. The other two lost their redirect loop. The actual site pages were showing in all three browsers.

Over the years I've done all my learning and reseach in WebmasterWorld but I'm going to try and find some good tutorials. There were a couple on Sunday and they had more links to follow. After a while I will try working on the .htaccess file again. I hate defeat.

I want to thank everyone who took the time to help.

Rhoda

phranque




msg:4568494
 1:51 pm on Apr 27, 2013 (gmt 0)

install the Live HTTP Headers add-on for firefox and you will be able to see the response status chain, which may help you understand the problem.

librarian




msg:4569005
 7:48 pm on Apr 29, 2013 (gmt 0)

Hi Phranque,
Thank you for the information about the Live Headers. I installed it right away. I'm sure I will get use from it when I start work on the .htaccess file of my other really old site.

Your reply came at the right time. I was planning to update the progress with my work on the example.com .htaccess file. Nothing was going right in any of my three browsers. Opera wouldn't even display the site's graphics. The other two, FireFox and Chrome, did show them. Not good when the heart of the site are the graphics. They reappeared as soon as I put the old .htaccess file back.

It was good to go back and start over. I'm happy to say all is working now.

Here is what I did. I replaced the original version of the .htaccess file with my last version that didn't work. Trying a few example.com/file.htm searches nothing worked for www to non-www, example.com/index.htm/l, html to htm. Then I tried some specific rewrite searches. Now they weren't working. So I removed all the extra commands and ran only the specific rewrite searches. They worked.

Usually I try to follow the less is more idea. The first area I tackled was having the index.htm/l resolve to example.com. Since I only have one index page on the whole site I decided to put it in the specific rewrites group. When I did this it worked great. Looking at Bing's and Google's Webmater Tools accounts I saw both had a couple of 404s for index.htm and index.html. They are both in the specific area now. Bing shows they are now pooling all the links to example.com/.

Next I tried to work with the html to htm. Bing seems to be the engine that shows a lot of html files. Many are the files from the original build of the website. But there are other garbled pages they applied the html to. So I tried to use the command to change all the endings to htm. It was at the bottom area for the .htaccess file. Right away there were problems. What I saw in Firefox was that the command looked at the original file names like id2.html in the specific area above. It hung up for a bit and then picked the target file name from another id5.html which was not right. So forget that command. I wanted to leave those original file rewrites. The html to htm is now gone.

Last I tried the www to non-www command. It follows the specific rewrites and is before the 404errorpage.htm. This command works.

The .htaccess file now works in all three browsers. You don't have to clear the cache. You can make a change and try it out right away. Lucy was right when she said leave some of it out.

Thank you to everyone who took the time to help.

Rhoda

lucy24




msg:4569029
 9:25 pm on Apr 29, 2013 (gmt 0)

Bing seems to be the engine that shows a lot of html files.

This may be a general bing characteristic. I similarly see them picking up a lot of without-to-with www redirects even though I have never allowed both forms. Someone in an unrelated thread recently described bing as "whitelisting" in search, while That Other Search Engine "blacklists". Wonder if one of the things they look for is a site that's diligent about redirecting to a single form of each page name?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved