Forum Moderators: phranque

Message Too Old, No Replies

Setting the 404 pages to a level up until it finds 200 OK

         

Imaster

4:09 pm on Jun 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello,

Recently, I redesigned my site and this has led to a lots of 404 error pages.

To quote an example, the url http://www.example.com/biology/students/john.html on my site is a 404 now.

What I would like to do is to redirect the user one level higher until there is an existing page (200 OK).

For eg: when he visits the url http://www.example.com/biology/students/john.html (which is a 404), the server would redirect him to the url http://www.example.com/biology/students/

In case the url http://www.example.com/biology/students/ is also a 404 error page, then it would take him one level up further to http://www.example.com/biology/ and so on until it comes across a real existing page.

Now I do not want to do a 301 redirect as this may reflect negatively in the search engines.

Using 404 directive, is there any way I can do this or is there any other solution which is safe with the search engines.

Thanks.

[edited by: jdMorgan at 4:42 pm (utc) on June 11, 2006]
[edit reason] Example.com [/edit]

jdMorgan

5:04 pm on Jun 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A 301 to the correct URL is the proper way to handle this. Otherwise, you'll have persistent duplicate-content issues.

However, to address your question directly, you may be able to use the "check for file exists" function of RewriteCond [httpd.apache.org]. Because this is an interesting and unusual question, I'll post some code -- something we try not to do here, since our focus [webmasterworld.com] is on helping you to write your own code, rather than providing a (clearly unsustainable) free coding service:


# If requested URL exists with one directory level removed
RewriteCond %{DOCUMENT_ROOT}/$1 -f
# serve contents from higher-directory file (Note that this is an internal rewrite, not an external redirect)
RewriteRule ^[^/]+/(.+)$ /$1 [L]
#
# Else if requested URL exists with two directory levels removed
RewriteCond %{DOCUMENT_ROOT}/$1 -f
# serve contents from higher-directory file
RewriteRule ^[^/]+/[^/]+/(.+)$ /$1 [L]
#
# Else if requested URL exists with three directory levels removed
RewriteCond %{DOCUMENT_ROOT}/$1 -f
# serve contents from higher-directory file
RewriteRule ^[^/]+/[^/]+/[^/]+/(.+)$ /$1 [L]

Whether this works or not will be determined by your server configuration; You may need to use RewriteBase in some cases where the URL-to-filepath mapping of the server is not direct. If you have problems, carefully inspect your server error log to determine how the expected filepaths differ from the ones the server is actually testing or trying to serve.

However, as I stated, the proper response is a 301-Moved Permamently. This will eliminate the old URLs from the search indexes over time, and avoid duplicate-content issues. Taking the first piece of code above as an example, modify it to:


# If requested URL exists with one directory level removed
RewriteCond %{DOCUMENT_ROOT}/$1 -f
# redirect to higher-directory URL (external redirect)
RewriteRule ^[^/]+/(.+)$ http://www.example.com/$1 [R=301,L]

Modify each of the rewrites to this redirect form, adding as many levels as you actually need.

The above code is not tested. And again, whether it works will also be determined by your existing server URL-to-filepath mapping configuration.

Jim

Imaster

6:18 pm on Jun 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi JD,

Thanks very much.

I tried the code that you mentioned with the 301 redirect. It's behaving oddly.

In case of 404's, it's not doing any action. However in case of any other existing 200 OK page, it directly redirects to the home page.

Any inputs.

jdMorgan

5:31 am on Jun 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here,

I forgot a step in the logic. Also, in the following code, I've added a query string that will help you debug this if it doesn't work. The server filepath corresponding to the originally-requested URL will be appended to the redirect URL, so you can read it in your address bar. If it's correct and the rule works, then comment out the first RewriteRule and use the second. Otherwise, that query string will give a good indication of how the filepath needs to be tweaked in order for the 'file exists' checks to function properly.


# If requested URL does not exist
RewriteCond %{DOCUMENT_ROOT}/$1/$2 !-f
# but it does exist with one directory level removed
RewriteCond %{DOCUMENT_ROOT}/$2 -f
# redirect to higher-directory URL (external redirect)
# (rule with debug data in query string)
RewriteRule ^([^/]+)/(.+)$ http://www.example.com/$2?original_path=%{DOCUMENT_ROOT}/$1/$2 [R=301,L]
# (rule with debug data removed)
# RewriteRule ^([^/]+)/(.+)$ http://www.example.com/$2 [R=301,L]

Jim

ronburk

6:10 am on Jun 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Now I do not want to do a 301 redirect as this may reflect negatively in the search engines.

Interesting approach. Seems like you plan to tell the search engines "well, the URL still exists, but the content has completely changed -- in fact, it's a duplicate of another page you already have in your index". No worries that that "may reflect negatively"?

Imaster

11:42 am on Jun 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Jim,

I tried all this. Still it's not working.

I have modified the code in the sense : It should be $1 and not $2 as I understand. (see in bold)

###################################
# If requested URL does not exist
RewriteCond %{DOCUMENT_ROOT}/$1/$2!-f
# but it does exist with one directory level removed
RewriteCond %{DOCUMENT_ROOT}/$1 -f
# redirect to higher-directory URL (external redirect)
# (rule with debug data in query string)
RewriteRule ^([^/]+)/(.+)$ http://www.example.com/$1?original_path=%{DOCUMENT_ROOT}/$1/$2 [R=301,L]
# (rule with debug data removed)
# RewriteRule ^([^/]+)/(.+)$ http://www.example.com/$1 [R=301,L]
#######################################

For eg:

url http://www.example.com/abc/xyz.html is a 404 error.
url http://www.example.com/abc/ is a 200 status ok page.

When I load the page http://www.example.com/abc/xyz.html using the following code in .htaccess file

################################################
CODE 1
# If requested URL does not exist
RewriteCond %{DOCUMENT_ROOT}/$1/$2!-f
# but it does exist with one directory level removed
RewriteCond %{DOCUMENT_ROOT}/$1 -f
# redirect to higher-directory URL (external redirect)
# (rule with debug data in query string)
RewriteRule ^([^/]+)/(.+)$ http://www.example.com/$1?original_path=%{DOCUMENT_ROOT}/$1/$2 [R=301,L]
# (rule with debug data removed)
# RewriteRule ^([^/]+)/(.+)$ http://www.example.com/$1 [R=301,L]
################################################

I get the following in the rewrite.log file

############################################
REWRITE LOG FOR OUTPUT OF CODE 1
xx.xx.xx.xx - - [date] [deleted] (3) [per-dir /www/htdocs/example/] strip per-dir prefix: /www/htdocs/example/abc/xyz.html -> abc/xyz.html
xx.xx.xx.xx - - [date] [deleted] (3) [per-dir /www/htdocs/example/] applying pattern '^([^/]+)/(.+)$' to uri 'abc/xyz.html'
xx.xx.xx.xx - - [date] [deleted] (4) RewriteCond: input='/www/htdocs/example/abc/xyz.html' pattern='!-f' => matched
xx.xx.xx.xx - - [date] [deleted] (4) RewriteCond: input='/www/htdocs/example/abc' pattern='-f' => not-matched
xx.xx.xx.xx - - [date] [deleted] (1) [per-dir /www/htdocs/example/] pass through /www/htdocs/example/abc/xyz.html
xx.xx.xx.xx - - [date] [deleted] (3) [per-dir /www/htdocs/example/] strip per-dir prefix: /www/htdocs/example/abc/xyz.html -> abc/xyz.html
xx.xx.xx.xx - - [date] [deleted] (3) [per-dir /www/htdocs/example/] applying pattern '^([^/]+)/(.+)$' to uri 'abc/xyz.html'
xx.xx.xx.xx - - [date] [deleted] (4) RewriteCond: input='/www/htdocs/example/abc/xyz.html' pattern='!-f' => matched
xx.xx.xx.xx - - [date] [deleted] (4) RewriteCond: input='/www/htdocs/example/abc' pattern='-f' => not-matched
xx.xx.xx.xx - - [date] [deleted] (1) [per-dir /www/htdocs/example/] pass through /www/htdocs/example/abc/xyz.html
############################################

When I load the page http://www.example.com/abc/xyz.html using the following code in .htaccess file
################################################
CODE 2
# If requested URL does not exist
RewriteCond %{DOCUMENT_ROOT}/$1/$2!-f
# but it does exist with one directory level removed
RewriteCond %{DOCUMENT_ROOT}/$2 -f
# redirect to higher-directory URL (external redirect)
# (rule with debug data in query string)
RewriteRule ^([^/]+)/(.+)$ http://www.example.com/$2?original_path=%{DOCUMENT_ROOT}/$1/$2 [R=301,L]
# (rule with debug data removed)
# RewriteRule ^([^/]+)/(.+)$ http://www.example.com/$2 [R=301,L]
################################################

I get the following in the rewrite.log file

############################################
REWRITE LOG FOR OUTPUT OF CODE 2
xx.xx.xx.xx - - [date] [deleted] (3) [per-dir /www/htdocs/example/] strip per-dir prefix: /www/htdocs/example/abc/xyz.html -> abc/xyz.html
xx.xx.xx.xx - - [date] [deleted] (3) [per-dir /www/htdocs/example/] applying pattern '^([^/]+)/(.+)$' to uri 'abc/xyz.html'
xx.xx.xx.xx - - [date] [deleted] (4) RewriteCond: input='/www/htdocs/example/abc/xyz.html' pattern='!-f' => matched
xx.xx.xx.xx - - [date] [deleted] (4) RewriteCond: input='/www/htdocs/example/xyz.html' pattern='-f' => not-matched
xx.xx.xx.xx - - [date] [deleted] (1) [per-dir /www/htdocs/example/] pass through /www/htdocs/example/abc/xyz.html
xx.xx.xx.xx - - [date] [deleted] (3) [per-dir /www/htdocs/example/] strip per-dir prefix: /www/htdocs/example/abc/xyz.html -> abc/xyz.html
xx.xx.xx.xx - - [date] [deleted] (3) [per-dir /www/htdocs/example/] applying pattern '^([^/]+)/(.+)$' to uri 'abc/xyz.html'
xx.xx.xx.xx - - [date] [deleted] (4) RewriteCond: input='/www/htdocs/example/abc/xyz.html' pattern='!-f' => matched
xx.xx.xx.xx - - [date] [deleted] (4) RewriteCond: input='/www/htdocs/example/xyz.html' pattern='-f' => not-matched
xx.xx.xx.xx - - [date] [deleted] (1) [per-dir /www/htdocs/example/] pass through /www/htdocs/example/abc/xyz.html

Nonetheless, I tried all the codes that you provided me and tried working around by debugging, but no sucess.

Imaster

12:06 pm on Jun 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Jim,

I just got it. In the second Rewritecond statement if I used a "-d" instead of a "-f", it worked. Since it's a directory and not a file.

Now the case is such that there may be a file or a directory at any level which may not exists and may have to be replaced by a higher level directory (301 would always be done to a directory, but the 404 may be a file or a directory).

I have around 5 levels of categories like

www.example/a/b/c/d/e/xyx.html OR
www.example/a/b/c/d/abc.html and so on.

How do I prepare a perfect code which will look for a 404 directory or a file. (I understand that we may use only one option at a time. For eg: "-f" or "-d" but not "-fd" together.

Any idea as to how I may proceed?

Thanks again.

Imaster

12:15 pm on Jun 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Now I seem to have an odd problem. All the requests to any pages under any level are being redirected to http://www.example.com/abc/

jdMorgan

2:23 pm on Jun 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try this rule first, only as an experiment. We need to know if your server is configured in a 'flat' manner, in order to find out if we need to adjust the RewriteCond variables:

# (debug rule with file-exists path in query string)
RewriteCond %{QUERY_STRING} !docroot_path=
RewriteRule ^([^/]+)/(.+)$ http://www.example.com/$1/$2?docroot_path=%{DOCUMENT_ROOT}/$1/$2 [R=301,L]

Adding the logic to test for directory-exists, the original code would look like this:

# If requested URL does not exist as a file
RewriteCond %{DOCUMENT_ROOT}/$1/$2 !-f
# and does not exist as a directory
RewriteCond %{DOCUMENT_ROOT}/$1/$2 !-d
# but it does exist as a file or directory with one directory level removed
RewriteCond %{DOCUMENT_ROOT}/$2 -f [OR]
RewriteCond %{DOCUMENT_ROOT}/$2 -d
# Then redirect to higher-directory URL (external redirect)
# RewriteRule ^([^/]+)/(.+)$ http://www.example.com/$2 [R=301,L]

The problem with code like this is often that the server filepath has additional path info injected by an 'Alias' or rewrite directive at the httpd.conf level, and using the query string test is the easiest way to find out if this is the case. Otherwise, you cannot see the file-path info compared to the URL-path, and you can't be sure what paths the RewriteConds are testing. If those paths are wrong, the whole rule fails. And if the server adds additional path info, then a more-complex ruleset will be required.

But the key is to look at the filepaths that are being tested by the RewriteConds, and to be sure they resolve to the correct directories and files for each URL tested. That's what the query string trick is for. If not, then adjust the path construction of the RewriteConds until they produce the correct paths to be tested.

A tested-path error might be obvious to you from looking at the RewriteLog info, but since I'm not familiar with your site or server and what the correct paths are, and since I have maybe 30 minutes a day to devote to posting here at WebmasterWorld, it's not easy for me to spot path errors. The data from the query string trick should make it obvious to you, though.

Also, remember that this is only one part of the solution. The above code is only intended to work on URLs that specify a file or directory one level below the Web root. Once this is tested and working, then additional rules or modifications can be developed to handle deeper directories.

Jim

Imaster

3:03 pm on Jun 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks a lot for your help, Jim.

I am testing the latest information that you provided. By doing the query string check, I found out that the url matches. Also there are no other special rewrites that we are doing through httpd.conf or other method.

For example, the query string showed the following info:

http://www.example.com/abc/xyz.html?docroot_path=/www/htdocs/example/abc/xyz.html

I actually almost succeeded in doing the 301 redirect, but there is a problem. For eg, using the code below:

############################################
RewriteEngine ON

# 6th Level Category

# If requested URL does not exist
RewriteCond %{DOCUMENT_ROOT}/$1/$2/$3/$4/$5/$6!-f
# but it does exist with one directory level removed
RewriteCond %{DOCUMENT_ROOT}/$1/$2/$3/$4/$5/ -d
# redirect to higher-directory URL (external redirect)
# (rule with debug data removed)
RewriteRule ^([^/]+)/([^/]+)/([^/]+)/([^/]+)/([^/]+)/(.+)$ http://www.example.com.com/$1/$2/$3/$4/$5/ [R=301,L]

######################################

I am able to do a 301 redirect to the correct path.

For eg: If there is a 404 error page at www.example/a/b/c/d/e/xyx.html , then it perfectly redirect to the path www.example/a/b/c/d/e/ (which is a 200 ok page).

But if I add one more condition like the following :

################################
RewriteEngine ON

# 6th Level Category

# If requested URL does not exist
RewriteCond %{DOCUMENT_ROOT}/$1/$2/$3/$4/$5/$6!-f
# but it does exist with one directory level removed
RewriteCond %{DOCUMENT_ROOT}/$1/$2/$3/$4/$5/ -d
# redirect to higher-directory URL (external redirect)
# (rule with debug data removed)
RewriteRule ^([^/]+)/([^/]+)/([^/]+)/([^/]+)/([^/]+)/(.+)$ http://www.example.com.com/$1/$2/$3/$4/$5/ [R=301,L]

# 5th Level Category

# If requested URL does not exist
RewriteCond %{DOCUMENT_ROOT}/$1/$2/$3/$4/$5!-f
# but it does exist with one directory level removed
RewriteCond %{DOCUMENT_ROOT}/$1/$2/$3/$4/ -d
# redirect to higher-directory URL (external redirect)
# (rule with debug data removed)
RewriteRule ^([^/]+)/([^/]+)/([^/]+)/([^/]+)/(.+)$ http://www.example.com.com/$1/$2/$3/$4/ [R=301,L]
##########################################

In such a case, it is following the conditions which I have given for the # 5th Level Category. It does not obey the conditions for the 6th level category.

So when I go to the url www.example/a/b/c/d/e/xyx.html (404), then instead of redirecting to www.example/a/b/c/d/e/ (200 ok page), it redirects to www.example/a/b/c/d/ (even though the page www.example/a/b/c/d/e/ exists)

Moral of the story is that it is following the last set of conditions and applying the 301 redirect.

Is there any way to stop it as soon as it encounters a 200 ok page?

jdMorgan

4:01 pm on Jun 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think you're dropping the wrong parts of the URL.

You're also doing this the 'hard way' by using so many variables, but we'll get to that later after you get it working.

if you test $1/$2/$3/$4/$5 and it doesn't exist, then the next step is to check (and possibly redirect to $1/$2/$3/$5, not $1/$2/$3/$4 -- You'll need to check/redirect to the same file, but one directory up. The fact that you're accessing the file "/" doesn't matter -- it's still the index file of a directory.

So, taking a URL-view, if
/a/b/c/d/e/foo.html doesn't exist, then check/redirect to
/a/b/c/d/foo.html, and not to
/a/b/c/d/e

I hope that's clear.

Now in case I'm not back here for awhile, let me just demonstrate an easier way to do this, with the warning that it will be easier to do this later after you get all the bugs worked out:

Instead of working all the way through this with one set of rules for each possible directory depth, you can 'nest' the parenthesis and do it with just one rule. Something like:


# If requested URL does not exist
RewriteCond %{DOCUMENT_ROOT}/$1$3/$4 !-f
# but it does exist with one directory level removed
RewriteCond %{DOCUMENT_ROOT}/$1$4 -f
# redirect to higher-directory URL (external redirect)
RewriteRule ^[b]([/b]([^/]+)/[b])*[/b]([^/]+)/([^/]*)$ http://www.example.com.com/$1$4 [R=301,L]

Note that in order to determine the correct back-reference number $1-$9, you should count left-parenthesis. Therefore, $2 never appears as a back-reference, because it is re-used in the pattern, and all you need to back-reference is the end result, contained in $1. Variable $1 will contain *all* directory levels except for the last (bottom) one, which will be in $3. And $4 will contain the filename, if there is one. Also, note that $1 will contain it's own trailing slash, so that slash is removed from the RewriteConds and the substitution URL expressions.

Again, I suggest not trying this until you get everything else working with say, up to three levels deep, because it may be *much* harder to debug. Get all issues with correct redirect URLs and with missing trailing slashes resolved first, then go for the 'elegant' solution.

Jim

Imaster

5:45 pm on Jun 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



if you test $1/$2/$3/$4/$5 and it doesn't exist, then the next step is to check (and possibly redirect to $1/$2/$3/$5, not $1/$2/$3/$4 -- You'll need to check/redirect to the same file, but one directory up. The fact that you're accessing the file "/" doesn't matter -- it's still the index file of a directory.

So, taking a URL-view, if
/a/b/c/d/e/foo.html doesn't exist, then check/redirect to
/a/b/c/d/foo.html, and not to
/a/b/c/d/e

Hi Jim,

Thanks for the information. The reason that I am doing so is because of the file structure that I have.

For eg: If /a/b/c/d/e/foo.html doesn't exist, then I need to first check whether there is an index.html file existing at the path /a/b/c/d/e/index.html (which is also equal to /a/b/c/d/e/). If it does not exists, then it should look for the path /a/b/c/d/index.html (which is also equal to /a/b/c/d/).

Hence I was doing a $1/$2/$3/$4/$5 to $1/$2/$3/$4/

In my case,

/a/b/c/d/e/foo.html has not been moved to /a/b/c/d/foo.html

The reason I am doing so is because foo.html no longer exists because it is outdated and hence I would like to redirect the user to the category (directory) under which it used to reside.

If that directory also doesn't exists anymore, then it should go one level higher, but look for the index.html and not for foo.html in that higher level directory.

I hope I have been able to clear it up.

Thanks a lot.

jdMorgan

9:40 pm on Jun 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I tested this code on a live server, and it appears to do what you want:

# If requested URL *does not* exist
RewriteCond %{DOCUMENT_ROOT}/$1$3 !-f
# and next-higher directory level *does* exist
RewriteCond %{DOCUMENT_ROOT}/$1 -d
# Strip filename or lowest-level directory and [b]externally redirect[/b] to next-higher-level directory-index URL
RewriteRule ^(([^/]+)/)*(.+)$ http://www.example.com/$1 [R=301,L]
#
# Else if requested URL *does not* exist
RewriteCond %{DOCUMENT_ROOT}/$1$3!-f
# Strip filename or lowest-level directory, [b]internally rewrite[/b] to next-higher directory-index URL, and [b]restart mod_rewrite[/b]
RewriteRule ^(([^/]+)/)*(.+)$ /$1 [N,L]

If a requested URL does not exist, the URL is 'trimmed' of one path-part starting with the filename or lowest-level directory path=part, and the result is checked for directory-exists. If that directory-level exists, a 301 redirect is invoked.

Otherwise, the code will internally rewrite to the trimmed URL, and restart mod_rewrite processing. Therefore, the code effectively loops, trimming off one URL element per pass until an existing directory-level is reached. Then a 301 redirect is invoked.

This approach eliminates multiple 301 redirects as the filepath is traversed.

Note that this code should be placed as high as possible in your .htaccess file. Because it is recursive, all mod_rewrite code between the top of the file and this code will be re-executed every time a path=part is removed, and all non-mod_rewrite code will also be paresed (but ignored) by mod_rewrite. The point being that if there's a lot of code between the top of the file and this code, it will make this code run slowly and inefficiently. However, you may have code that does need to execute before this code, and that should be left in place.

Jim