Forum Moderators: phranque

Message Too Old, No Replies

Stuck and questions about my .htaccess file 301 redirect

         

KeesB

4:34 pm on Dec 22, 2014 (gmt 0)

10+ Year Member



I am stuck on my htaccess for a very long time now and i can't seem to get the right tweak.
I will first explain what i want to do with the 301 redirect and then explain the errors i run into.

I made a html responsive website so i wanted new search friendly urls:

Old web pages:
www.example.com/?lang=nl&page=pagename
www.example.com/?lang=nl&page=hello
www.example.com/?lang=us&page=testing

New webpage:
www.example.com/pagename/
www.example.com/hello/
www.example.com/testing/

I have added 1 redirect rule that redirects all old pages to the new url with:

RewriteCond %{QUERY_STRING} lang=(fr|uk|us|nl)&page=([^&]+)
RewriteRule ^ http://www.example.com/%2/? [L,R=301]

It works, but is this the correct way? and doesn't this make multiple redirects?

Then i have added the same rule but for alfabet pages:

Old:
www.example.com/?lang=nl&letter=a
www.example.com/?lang=nl&letter=b

New:
www.example.com/letter-a/
www.example.com/letter-b/

RewriteCond %{QUERY_STRING} letter=([a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z])$
RewriteRule ^ http://www.example.com/letter-%1/? [R=301,L]

Again, it works but doesn't this create multiple redirects?


With these 2 rules i can redirect most of my pages. I have some other pages that i have changed the name and i do them manual.

Old web pages:
www.example.com/lang=nl&page=about

New webpage:
www.example.com/aboutus/

These are the errors i am stuck in:

1) Requesting this url: http://example.com/pagename redirects to: www.example.com/pagename.html/

2) Page loads the same content:

www.example.com/pagename.html
www.example.com/pagename/
www.example.com/pagename

These 3 all load the same page
But i would like only to be www.example.com/pagename/


3) I am not sure wether the redirects and my htaccess is good because sometimes (ones every 20 minutes) i see in my data that a page or file is requested repeatedly

4) I am not sure if the placement of things in htaccess is good.
Errorpage first,
then removing html and adding / (whitch doesn't work proper)
then redirects,
then redirect non to www
then addtype for php
then cache

This is what i have in my htaccess right now:


ErrorDocument 400 /error/
ErrorDocument 401 /error/
ErrorDocument 403 /error/
ErrorDocument 404 /error/
ErrorDocument 500 /error/

RewriteEngine On

RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^([^/]+)/$ $1.html
RewriteRule ^([^/]+)/([^/]+)/$ /$1/$2.html
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !(\.[a-zA-Z0-9]{1,5}|/)$
RewriteRule (.*)$ /$1/ [R=301,L]

RewriteCond %{QUERY_STRING} letter=([a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z])$
RewriteRule ^ http://www.example.com/letter-%1/? [R=301,L]

RewriteCond %{QUERY_STRING} lang=nl&page=about$
RewriteRule ^(.*)$ http://www.example.com/aboutus/? [L,R=301]

RewriteCond %{QUERY_STRING} lang=nl&page=beer$
RewriteRule ^(.*)$ http://www.example.com/wine/? [L,R=301]

RewriteCond %{QUERY_STRING} lang=nl&page=drinks$
RewriteRule ^(.*)$ http://www.example.com/food/? [L,R=301]

RewriteCond %{HTTP_HOST} ^example.com [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]

AddType application/x-httpd-php .php .html

<IfModule mod_expires.c>
ExpiresActive on
ExpiresByType text/html "access plus 1 days"
ExpiresByType image/gif "access plus 1 month"
ExpiresByType image/png "access plus 1 month"
ExpiresByType image/jpg "access plus 1 month"
ExpiresByType image/jpeg "access plus 1 month"
ExpiresByType image/x-icon "access plus 1 years"
ExpiresByType text/css "access plus 1 month"
ExpiresByType application/javascript "access plus 1 month"
ExpiresByType text/javascript "access plus 1 month"
ExpiresByType application/x-shockwave-flash "access plus 1 month"
</IfModule>




Any help with my htaccess file is very welkom.

lucy24

7:34 pm on Dec 22, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It works, but is this the correct way? and doesn't this make multiple redirects?

It looks perfectly OK to me. You've remembered the ? in the target to strip off the query string. That eliminates any risk of multiple redirects.

That is, oops: OK except that where's the pattern? Are you redirecting everything to the root, or is this rule only meant to apply to root requests?

RewriteCond %{QUERY_STRING} letter=([a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z])$

Yikes. Is it possible you've misunderstood the use of a hyphen? All you need there is [a-z]. Just one letter, right?

It looks as if some of your htaccess is backward.

On the plus side: The most important thing is to keep each module separate. Not for Apache's sake but for yours. You've done this. Personally I like to save mod_rewrite for last, because it's the largest section. But that's completely a matter of individual preference; the server doesn't care. You never need <IfModule> envelopes. Either you've got a mod or you haven't. Find out, and write the rules accordingly.

On the not-so-plus side: Within mod_rewrite, the general order is:
first group your rules in order of severity-- start with rules that end in [F], then [G] (if any), then [R] (i.e. normally [R=301,L]), and finally [L] without [R]. There may be individual exceptions, but that's the general layout
then within each of those groups, list rules from most specific to most general. So if you have a rule that applies to /page-one\.html it has to go before rules that apply to /page-[a-z]+\.html. Or, in your case,
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !(\.[a-zA-Z0-9]{1,5}|/)$
RewriteRule (.*)$ /$1/ [R=301,L]

should be the last external redirect.

But let's talk a little about this rule. First of all: You're not doing anything with images and stylesheets, are you? Make sure the pattern of the rule applies only to requests for pages. Here, it looks as if you're working with extensionless URLs. That means that if you never have literal periods in directory names, the pattern can be simply
^([^.]+[^./])$

meaning "no dots anywhere, and the last character isn't a slash". This in turn means you can drop the third Condition, and also the first (because your real, physical files always do have an extension).

That leaves only the second condition. The -d and -f tests are an absolute last resort, because it means the server has to go physically look for the file an extra time on every request. (In htaccess, you can't rely on the server remembering anything from one nanosecond to the next.) Ordinarily you would have to keep the !-d condition. But here, the rule happens to do exactly what mod_dir does on real, physical directories: add a / final slash if it was absent. So you're left with a single conditionless rule: "If a request contains no . periods, and doesn't end in a / slash, then add the slash". You should give the full protocol-plus-domain in the target, though:

RewriteRule ^([^.]+[^./])$ http://www.example.com/$1/ [R=301,L]





Now, there's one part I don't see, and it's important. Where does your rewriting take place? Did you leave it out from the quoted material? Ordinarily I'd expect all the external redirects to be followed by something like

RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^.]+)/$ /$1.html [L]


or possibly

RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^.]+)/$ /index.php?pagename=$1 [L]


Normally we encourage people to quote only the relevant parts of their code (htaccess, stylesheet, whatever) rather than doing a massive code dump. But we need to know whether you left it out because it wasn't relevant to the question, or because it doesn't exist. In the redirects section, I'm not seeing an index redirect (/directory/index.html TO /directory/) and a domain-name-canonicalization redirect (http://example.com TO http://www.example.com). Those would normally be your last two external redirects.




1) Requesting this url: http://example.com/pagename redirects to: www.example.com/pagename.html/

2) Page loads the same content:

www.example.com/pagename.html
www.example.com/pagename/
www.example.com/pagename

These 3 all load the same page
But i would like only to be www.example.com/pagename/

The third version-- /pagename without final slash --should be taken care of by now. For the .html part, you'll need an additional rule that redirects .html requests to the same thing minus .html plus / slash. We can come back to that.

3) I am not sure wether the redirects and my htaccess is good because sometimes (ones every 20 minutes) i see in my data that a page or file is requested repeatedly

How often is "repeatedly"? If it's 30 requests for the same thing, that's an infinite loop and we'll need to find the problem. Generally the fix involves a RewriteCond looking at %{THE_REQUEST} but we'll have to take a closer look at details. If the identical request only happens 3 or 5 or 10 times, it's a problem at the user's end. Without more information, there's no telling whether their browser has the hiccups or you've got a nasty robot doing its thing. For some reason it's very common for robots to make the identical request three times in a row.

KeesB

1:55 pm on Dec 23, 2014 (gmt 0)

10+ Year Member



Oops, posted it twice. This one delete.

[edited by: KeesB at 1:57 pm (utc) on Dec 23, 2014]

KeesB

1:56 pm on Dec 23, 2014 (gmt 0)

10+ Year Member



Thanks for your detailed explanation.

That is, oops: OK except that where's the pattern? Are you redirecting everything to the root, or is this rule only meant to apply to root requests?


This rule is not mean to go to redirect to the root and mean to apply for &page=pagename requests
That will redirect to www.example.com/pagename/
or whatever name that page has. (with a few exceptions of pagename change that i did manual)
So I am guessing it's not correct after all?

Yikes. Is it possible you've misunderstood the use of a hyphen? All you need there is [a-z]. Just one letter, right?


I have a dictionary with words from a to z on the website, so i thought i need to put them all in for all 26 pages. If I change that to [a-z] it works the same?

You never need <IfModule> envelopes. Either you've got a mod or you haven't. Find out, and write the rules accordingly.

Didn't know that, I will try and find out.


First of all: You're not doing anything with images and stylesheets, are you?

No I am not, should I?

Here, it looks as if you're working with extensionless URLs.

I used to, but now it's .html files.


Where does your rewriting take place? Did you leave it out from the quoted material?

I didn't left it out because it doesn't exist. The htaccess i posted is my complete file. I have got multiple headache of this and because i was not sure, i posted this question. I am very happy you answered my question, but it also makes me more confused then before :-)

Should all redirects, or all individuel redirects have:

RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^.]+)/$ /$1.html [L]


How often is "repeatedly"?


There is no straight answer to this. One time it requested over 500 times, and another time it was requested 60 times. I think it keeps requesting untill the visitor leaves the page. It happens on all pages, except the homepage. 9 out of 10 it's firefox users.

This is the reason i got stuck on my htaccess file.

Here is a screenshot of my log: [imgur.com ]

This happens to pages, css file, images and icons.

wilderness

4:26 pm on Dec 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One time it requested over 500 times, and another time it was requested 60 times. I think it keeps requesting untill the visitor leaves the page.


Generally speaking, this is most likely a timeout by the server caused by a 'loop' (endless search for a file that does not exist. Most common is Error Documents).

KeesB

4:57 pm on Dec 23, 2014 (gmt 0)

10+ Year Member



But the files do exist.

Most common is Error Documents


I don't see my error documents being wrong.

Should they perhaps be in full path: http://www.example.com/error/

I never had this problem until i changed the the website and htaccess.

wilderness

5:47 pm on Dec 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But the files do exist.


Than, perhaps the path is incorrect (same as not existing) due to actual path error or some type of incorrect rewrite.

Should they perhaps be in full path: http://www.example.com/error/


Standard as applied in htaccess:

ErrorDocument 403 /yourFileName.html
ErrorDocument 404 /yourFileName.html

lucy24

7:24 pm on Dec 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Generally speaking, this is most likely a timeout by the server caused by a 'loop' (endless search for a file that does not exist. Most common is Error Documents).

But 500 times? Repeated requests in logs means a browser setting, not server. Is there a glitch in some version of Firefox, where they keep requesting the same thing 500 times instead of stopping at 30? (I don't know if 30 is the actual number. It's intentionally set to be higher than could ever really occur.)

KeesB, the ErrorDocument issue comes about like this:

-- visitor requests a file
-- server checks its information and learns that visitor isn't allowed to receive files
-- instead of the requested file, server puts in an internal request for, let's say, /forbidden.html (that's what the ErrorDocument directive does)
-- server checks its information and learns that visitor isn't allowed to receive files
-- instead of the requested file (in this case, "forbidden.html"), server puts in an internal request for /forbidden.html
-- server checks its information and learns that visitor isn't allowed to receive files
-- instead of the requested file ... et cetera

See how that works? After the cycle has repeated itself 30 times, the server says "This is going nowhere fast so we're going to stop trying" and instead sends out a 503 error.

For this reason, your htaccess should always have a line saying something like
<Files "forbidden.html">
Order Allow,Deny
Allow from all
</Files>
If you are on shared hosting and they tell you to use a particular name for error documents, these lines are already present in the config file.

Since each module is an island, this process has to be repeated for every module that can issue a 403. For example
RewriteRule forbidden\.html - [L]

This line goes before any RewriteRules that end in [F]. (This is one of the exceptions I talked about earlier. Or, ahem, earlier in some other thread recently.)


This rule is not mean to go to redirect to the root and mean to apply for &page=pagename requests
That will redirect to www.example.com/pagename/
or whatever name that page has. (with a few exceptions of pagename change that i did manual)
So I am guessing it's not correct after all?

Um, not sure. I got lost in the "not"s. If your ordinary URL looks like
example.com/pagename=blahblah
and you want to redirect to
example.com/blahblah/
then the rule should say
RewriteCond %{QUERY_STRING} lang=(fr|uk|us|nl)&page=([^&]+)
RewriteRule ^$ http://www.example.com/%2/? [R=301,L]

That way, the server only looks at conditions when the request is for the index page-- and the rule only executes if the condition is met. mod_rewrite works on a "two steps forward, one step back" system, where conditions are only evaluated if the pattern of the rule potentially fits.

I have a dictionary with words from a to z on the website, so i thought i need to put them all in for all 26 pages. If I change that to [a-z] it works the same?

Yes, exactly. It means "any one character in the list a-z".

You never need <IfModule> envelopes. Either you've got a mod or you haven't. Find out, and write the rules accordingly.

Didn't know that, I will try and find out.

If you don't have the relevant module, the rule will simply be ignored. (This is true even if the contents of the <IfModule envelope have nothing to do with the module being invoked.) If the rule currently works as intended, then you know you've got the mod.

You're not doing anything with images and stylesheets, are you?

No I am not, should I?

No, generally not. And if the rule only applies to pages, then don't ask the server to look at Conditions the rest of the time. See above about two steps forward, one step back.

Here, it looks as if you're working with extensionless URLs.

I used to, but now it's .html files.

Uh-oh, now wait. Do your final URLs -- the ones seen by the user -- end in / or in .html? My impression from all other rules is that everything is supposed to end in a / slash. Again, that's the visible URL. The physical file is a different matter.

Where does your rewriting take place? Did you leave it out from the quoted material?

I didn't left it out because it doesn't exist. The htaccess i posted is my complete file. I have got multiple headache of this and because i was not sure, i posted this question. I am very happy you answered my question, but it also makes me more confused then before :-)

You are not the first person to have this complaint :( But if you are redirecting to URLs that end in / and there's no rewriting, that means that every single one of your pages is called "index.html" and each one lives in its own directory. Somehow I don't think that's what is happening.

Should all redirects, or all individuel redirects have:
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^.]+)/$ /$1.html [L]

No, absolutely not. As I said before, the -d or -f test is an absolute last resort in a RewriteCond. You only need it when there is no other way to distinguish between requests that match the rule, and requests that don't. In addition, the quoted rule is not a redirect. It's an internal rewrite. It means:
"If there's a request ending in / (such as example.com/subdir/name/) and I don't really have a directory by this name, then find the file /subdir/name.html and secretly serve that instead. But if there really is a directory /subdir/name/, then send back /subdir/name/index.html".

Here is a screenshot of my log:

Oh, dear. That's amazingly unhelpful. Did you edit parts of the log before taking the screenshot? If not, it looks as if everything-- including non-page files-- is being redirected to /pagename/

If you need to quote logs, paste in a few lines. If your site name is visible --most often in the Referer slot for non-page files-- replace it with "example". The TLD (.com, .org, .ca etc) doesn't matter. Also obfuscate part of the requester's IP, like "12.34.56.abc".

KeesB

5:23 pm on Dec 24, 2014 (gmt 0)

10+ Year Member



Um, not sure. I got lost in the "not"s. If your ordinary URL looks like
example.com/pagename=blahblah
and you want to redirect to
example.com/blahblah/
then the rule should say
RewriteCond %{QUERY_STRING} lang=(fr|uk|us|nl)&page=([^&]+)
RewriteRule ^$ http://www.example.com/%2/? [R=301,L]
That way, the server only looks at conditions when the request is for the index page-- and the rule only executes if the condition is met. mod_rewrite works on a "two steps forward, one step back" system, where conditions are only evaluated if the pattern of the rule potentially fits.


Yes, this is correct, and this rule is working. The only problem here is that if a user request example.com/blabla/ That page does not excist and should be redirect to example.com/error/ But instead, the error page get's loaded and the URL remains the same. The response header is: HTTP/1.1 404 Not Found.
This is giving some 404s in webmastertools, which i then have to manual redirect again. There should be a better way for this.

Uh-oh, now wait. Do your final URLs -- the ones seen by the user -- end in / or in .html? My impression from all other rules is that everything is supposed to end in a / slash. Again, that's the visible URL. The physical file is a different matter.


Yes, I want the final URLs to end in /


You are not the first person to have this complaint :( But if you are redirecting to URLs that end in / and there's no rewriting, that means that every single one of your pages is called "index.html" and each one lives in its own directory. Somehow I don't think that's what is happening.


Sorry about that. This sounds like a problem. I don't have the files in there own directory. How do we make this so that rewriting does make place?


Oh, dear. That's amazingly unhelpful. Did you edit parts of the log before taking the screenshot? If not, it looks as if everything-- including non-page files-- is being redirected to /pagename/


Oops, didn't figure out it was that unhelpful. I did remove a couple of lines. But those where requests from other visitors. I thought it would be easyer to see a pattern if I removed them. If you notice the time you can see it all happens in just a couple of secondes. This screenshot is 1/5 of the actual log of this request.

I can send the complete log file if that helps?

Do you think this would prefent the multiple redirects? I am not feeling to good about this rule.

RewriteCond %{ENV:REDIRECT_END} =1
RewriteRule ^ - [L,NS]


I am having difficulties with setting up the new htaccess. Could you help me with the correct rules in order?

lucy24

7:01 pm on Dec 24, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The only problem here is that if a user request example.com/blabla/ That page does not excist and should be redirect to example.com/error/

No it should not. A request for a nonexistent page should receive a 404 response, after which the human user will be shown your custom 404 page, for example /error.html The ErrorDocument directive should give the path to a real, physical page, not an URL* such as /error/

But instead, the error page gets loaded and the URL remains the same. The response header is: HTTP/1.1 404 Not Found.

That's what is supposed to happen when a page doesn't exist. What are you complaining about?

This is giving some 404s in webmastertools, which i then have to manual redirect again. There should be a better way for this.

There is a better way; see previous post. The rule has to be constrained to requests for the root: not
RewriteRule ^ etcetera

which applies to all requests all the time, but
RewriteRule ^$ etcetera

which applies only to requests for the root. You can also express the pattern as
RewriteRule ^(index\.html)?$ etcetera

because search engines will sooner or later ask for "index.html" out of plain cussedness.

How do we make this so that rewriting does make place?

You need to add a rule after all redirects so requests ending in / are rewritten (not redirected). But in order to hammer out this rule, we need to know what the real, physical filename is for each URL. In an earlier post I tossed out a couple of possibilities, but those were just guesses.

Since you're using URLs ending in / slash, we need to know if you've got any real, physical directories. Those have to be excluded from rewriting. As previously noted, the !-d condition is a last resort. It's better if you can say something like
RewriteCond %{REQUEST_URI} !^/(images|stylesheets|mystuff)/

where you list your real directories by name.

Do you think this would prevent the multiple redirects?

No, it wouldn't help at all, because environmental variables only apply to the present request. They disappear on each new request. Instead you need-- probably-- a RewriteCond looking at %{THE_REQUEST}. But we're not there yet. In fact, so far there's no evidence that multiple redirects are even happening. They will start happening if you add a RewriteRule without appending a Condition to one or more of your existing redirects, though.

:: detour to fish png out of Trash ::

In the snippet you posted-- which you could perfectly well have included as plain text within the post, not an image link-- every single request receives a 200, except one place where there are four consecutive identical requests each receiving the same 301. So the problem is not with infinite redirects. In fact: Are you sure the quoted snippet isn't an artifact of selecting-and-pasting, or a hiccup in the logs themselves? The one thing I'm definitely not seeing is anything that would suggest an infinite loop.


* Yes, I do know how it's pronounced. But in my mind it will always sound like "the duke of URL".

KeesB

9:30 pm on Dec 24, 2014 (gmt 0)

10+ Year Member



That's what is supposed to happen when a page doesn't exist. What are you complaining about?


Not complaining, just double checking. After a couple of errors like this I start to doubt things.

But in order to hammer out this rule, we need to know what the real, physical filename is for each URL. In an earlier post I tossed out a couple of possibilities, but those were just guesses.


I have 3 folders.(img, includes, downloads) Except for a robots.txt, sitemap.xml, favicon.ico and style.css eveything else is .html and put in the public_html folder.

So the rule should be:
RewriteCond %{REQUEST_URI} !^/(img|includes|downloads)/


every single request receives a 200, except one place where there are four consecutive identical requests each receiving the same 301. So the problem is not with infinite redirects. In fact: Are you sure the quoted snippet isn't an artifact of selecting-and-pasting, or a hiccup in the logs themselves? The one thing I'm definitely not seeing is anything that would suggest an infinite loop.


I only removed a few lines from the snippet that had nothing to do with the page/ files requesting. I am possitive it is not a hiccup or artifact as it is happening for 9 day's already. I have multiple log files where this 'event' is happening. Each with different IPs (no blacklist) and browsers. (most firefox)
Also the .js of google analytics is triggered everytime, what i don't think they like very much.

Is it oke if i post a complete 'event' log in here? About 500 lines if i can't find a shorter one.
Perhaps i can provide a download like to the file if needed?

lucy24

10:44 pm on Dec 24, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So the rule should be:
RewriteCond %{REQUEST_URI} !^/(img|includes|downloads)/

That's just the condition. If all your files are .html located in the root, the body of the rule will be something like

RewriteRule ^([^./]+)/$ /$1.html [L]


meaning "if the request contains no literal periods, and it ends in a slash, then serve content from the same thing minus slash plus '.html'". The RewriteCond goes immediately before (not after!) the Rule.

But wait! Did the site ever use URLs in the form
/pagename.html

? If yes, you'll need to redirect those. If not, just glance at logs periodically and make sure search engines aren't requesting these forms directly. Which reminds me...

Any time you have URLs ending in / slash like /filename/ search engines will ask for two other things:
/filename
/filename/index.html
The first form is the one we've already dealt with. The second form will require a pair of redirects that look something like this

RewriteRule ^(img|includes|downloads)/index\.html http://www.example.com/$1/ [R=301,L]

RewriteRule ^([^./]+/)index.html http://www.example.com/$1 [R=301,L]


First rule is for the three directories that really exist. Second rule is for anything left over. Note that if thosee three directories-- img, includes, downloads-- don't happen to have an accessible index.html page, then these redirected requests will end up getting a 403. So if you wanted to, you could replace the first [R=301,L] with an immediate [F] response, skipping one step.