homepage Welcome to WebmasterWorld Guest from 23.22.179.210
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Redirect 404 as permanent, with www to non www
Redirecting 404 as permanent, stripping www with args
jmichaels




msg:4000249
 11:11 pm on Oct 2, 2009 (gmt 0)

I have tried this a few ways, end up in loops. I am pretty sure it is not as simple as using the error docuement, as that will not return a permanent 301, but a 200 ok. If I set the 404 error doc to a local path like /index.html I lose a lot of my images and css files loading.

# Rewriting rules
RewriteEngine on

# Redirect if www to non-www, carry url arguments (case-insensitive)
RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]
RewriteRule ^/(.*) http://example.com/$1 [R,L]

So that is the basic, take the www off, and keep the arguments. Simple. Now, what I want to add is that any and all 404 pages also get the same treatment.

So a 404 would come in as for example:
www.example.com/a-page-that-is-gone/
- or -
example.com/a-page-that-is-gone/

That needs to go to example.com/index.html but as a 301 permanent, with no www. I basically want to have all old url's be valid and working, when they resolve normally, those that do not, I want a 301 permanent.

 

jd01




msg:4000430
 2:13 pm on Oct 3, 2009 (gmt 0)

I'm totally confused by some of your statements, so let me ask some questions... Maybe I'm totally missing or misunderstanding something?

I am pretty sure it is not as simple as using the error docuement, as that will not return a permanent 301, but a 200 ok.

1.) Why would an error doc return a 200 ok, when it's by definition an error document and is served with the appropriate error code, not a 200? A properly functioning 404 error document, though it parses and displays does not return a 200. It returns a 404.

2.) Why would the requested URL not be redirected from www.example.com to example.com prior to any 404 error being served when the .htaccess file, including all rewrite rules must be processed prior to any page being served, or the .htaccess file would do absolutely no good?

The .htaccess file should be processed on any request, prior to any requested location being served. IOW: Every 'external' request is run through the httpd.conf, then .htaccess before the server can even know if the resource exists, so if your server is functioning correctly, then it should absolutely redirect to example.com prior to serving a 404 error.

That needs to go to example.com/index.html but as a 301 permanent, with no www. I basically want to have all old url's be valid and working, when they resolve normally, those that do not, I want a 301 permanent.

3.) Why would you want to serve a 301 and redirect everything to the homepage?

(AFAIK it is Not recommended to do this. If the resource is not found it should serve a 404 (Not Found), and if it has been removed it should technically serve a 410 (Gone). If you are going to redirect a resource which has been removed or you know cannot be found it should be redirected to the new location of the resource, or the location of a similar resource on the same topic. Again AFAIK as far as search engine rankings go it is Not recommended to redirect every 'not found' request to the home page, or any other page for that matter. It is much wiser to serve a custom 404 or 410 error page with a mini-sitemap, sitemap or the 'most used and visited links'. It could even be done dynamically with a .php paged served as your 404 error page and show different links based on the REQUEST_URI.)

NOTE:
Your rule Does Not serve a 301 redirect. It serves a 302, or 'undefined' redirect, because the R is not followed by =301, which means it is undefined, or a 302.

Also, if you are going to use a 'catch all' on the left side of the rule, the / on the left side of the rule is unnecessary, simply remove the start anchor, and if the ruleset is in the httpd.conf, remove the / preceding $1 on the right side of the rule, if the ruleset is in the .htaccess, then leave the / where it is. NOTE: If the ruleset is in the .htaccess it won't match anything anyway, because the / is not present on the left side of the rule.

The two different versions follow.

# .htaccess Version

# Rewriting rules
RewriteEngine on

# Redirect if www to non-www, carry url arguments (case-insensitive)
RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]
RewriteRule (.*) http://example.com/$1 [R=301,L]

##### ### #####

# httpd.conf Version

# Rewriting rules
RewriteEngine on

# Redirect if www to non-www, carry url arguments (case-insensitive)
RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]
RewriteRule (.*) http://example.com$1 [R=301,L]

Sorry if I didn't really answer your question, but I hope this helps a little bit and gives you some ideas.

If I set the 404 error doc to a local path like /index.html I lose a lot of my images and css files loading.

What do you mean by lose? Where do they go? The page served should be exactly the same as what you see on the screen when you access the error page URL directly... Only the code served to the browser changes, so I'm not sure how you are losing images or files if they are present when you visit the page.

jd01




msg:4000439
 2:45 pm on Oct 3, 2009 (gmt 0)

What do you mean by lose? Where do they go? The page served should be exactly the same as what you see on the screen when you access the error page URL directly... Only the code served to the browser changes, so I'm not sure how you are losing images or files if they are present when you visit the page.

I figured the answer to this one out... The only possibility I can think of is you're using directory relative URLs as links, which is not the best idea... It's best to use server relative links, or absolute links. (You could also be using some version of .. which is about the same a directory relative and not recommended.)

Here are the differences:

Directory Relative:
the/path/to/the/file.html

Starts from the directory you are in:
So if you are viewing http://www.example.com/test/ the directory relative link above will attempt to take you to:

http://www.example.com/test/the/path/to/the/file.html

Server Relative:
/the/path/to/the/file.html

Starts from the server root:
So if you are viewing http://www.example.com/test/ the server relative link above will attempt to take you to:

http://www.example.com/the/path/to/the/file.html

(The same as below... The last two make it so your links always work, even if you move a page from one directory to another. The reason you are not seeing your graphics (my guess) is your are trying to view a page with directory relative links on the domain root as a 404 error page at http://www.example.com/error/test.html which means your server starts looking for the file located @ http://www.example.com/the-file.css in the directory /error/ or in long hand: http://www.example.com/error/the-file.css )

Absolute:
http://www.example.com/the/path/to/the/file.html

Either of the second two are recommended and if you opt for Server Relative, the a base href is also recommended as a general rule on linking.

jdMorgan




msg:4000485
 5:25 pm on Oct 3, 2009 (gmt 0)

The above is way too long for me to read through right now, but:

1) A URL which will not resolve to an existing resource should be 404ed immediately. There is no need to 301 it anywhere, and doing so will only confuse the search engines to the point that they may continue to think that URL is "good." And even if not, it forces them to take an extra processing step (they have to follow the 301 and issue a second HTTP request) in order to find out that the (now redirected) URL is still no good. Then they have to do additional back-end processing to figure out that the first "bad" URL redirected to a second correct-domain-but-still-bad URL, and that both are therefore "bad" and should be removed. Since this kind of back-end 'tidying-up' process takes time and compute resources, there's actually no guarantee that the search engines will ever get around to doing it, which could leave your site's listings in search results in a messy state for months.

2) If a resource is missing, serve a concise, clearly-worded, friendly, and somewhat-apologetic custom error page. Include text links to your home page, major category pages, HTML sitemap, and search facility - as applicable. Keep this list short (5 to 10 entries), but try to help the visitor find what they were looking for directly, and do not redirect them to the home page with indicating/acknowledging the error or you will confuse them.

If you redirect bad URL requests, you will also create an 'infinite URL-space' which will cause search engines to arbitrarily-limit the depth to which they are willing to spider your site.

As with all error pages, your 404 error document should have the fewest number of external dependencies; Keep image, script, and stylesheet 'includes' to an absolute minimum. This reduces the chances of an 'infinite' 404 error loop should one of those included resources go missing, or should an included script malfunction. Despite any preference for an error page that matches the 'look and feel' of your other pages, it's wise to keep error documents as simple as possible...

In handling missing pages, stick with these conventions or your search rankings will suffer.

I hope this saves you from some future trouble...

Jim

jmichaels




msg:4001909
 5:06 am on Oct 6, 2009 (gmt 0)

I wanted to reply, and say thank you for all your input, and try to address most of what was asked.

As to why an error doc can return a 200, I found that info here:
[webmasterworld.com...]
That was the behavior I was seeing as well.

Why I wanted to 301 everything back to the home page? This is an old site, 45 pages to it, in an old programming language. It could not easily be ported, so I ripped the pages out with a http downloader. The downloader massaged the html and links to new pages, so the site worked. However, it used to work in page=1 format, where now that was page_random.html.

I did not much care about the SEO aspects of this site, so all the 404's were to go to the /index.html page. Google and the rest could pick up on the new named pages as they see fit.

@jd01 thanks for the suggestions, hope that clears up some of the issues I brought up. And yes, the reason I was getting the busted images on the error doc was that the resources were all relative. This is how the http downloader did it's work. Not a big deal, I was torn to even try to keep the site alive, so for all 404's to hit the home page, was good enough.

I ended up doing it the right way. I set a 404 page, made it simple. I then created about 50 rules for each of the pages, to send them to the correct new pages. It took about a hour, but the site does now work correct, and google will now what to do when it hits the old page=x based url's

Thanks for the suggestions.

TheMadScientist




msg:4002109
 2:02 pm on Oct 6, 2009 (gmt 0)

As to why an error doc can return a 200, I found that info here:
[webmasterworld.com...]
That was the behavior I was seeing as well.

I was just reading through and noticed this, so I thought I would point out: The reason the ErrorDocument returned a 200 is because of a the configuration in the .htaccess file, which caused an external request for the document to be generated by the browser rather than the location being found internally by the server.

On any website, including: www.example.com this will generate a 200 OK, assuming the page is served properly:
ErrorDocument 404 http://www.example.com/404.htm

On any website, including: www.example.com this will serve a 404 Not Found as expected, assuming there is a page at 404.htm:
ErrorDocument 404 /404.htm

The difference is where the request for 404.htm is made:
In the first example the request is made by the browser.
In the second example the request is made by the server.

jdMorgan




msg:4002136
 2:41 pm on Oct 6, 2009 (gmt 0)

Nit-picking in the interest of clarity...

> ErrorDocument 404 http://www.example.com/404.htm

This will generate a 302-Found redirect, which may or may not result in a 200-OK response when the client follows that redirect and issues a new HTTP request for the /404.htm page.

Jim

g1smd




msg:4002525
 10:53 pm on Oct 6, 2009 (gmt 0)

Yes, always omit the protocol and domain name from directives such as these:

ErrorDocument 404 /404.htm

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved