Forum Moderators: phranque

Message Too Old, No Replies

Is there any mod rewrite bug in my .htaccess?

         

Marfola

3:02 pm on Feb 16, 2009 (gmt 0)

10+ Year Member



We migrated to a new server in early January. Roughly 10 days later redirect errors appeared in Google WebmasterTools. Our previous server ran on Apache/2.2.4. The new server runs on Apache/1.3.37.

FYI: The urls in my site and sitemap are all final urls not 1 redirects. I do not have any chain redirects I manage the 301 in a single redirect. Using WebSniffer I have tested every single URL and every possible URL permutation including www, non-www, trailing /, no trailing /, index.html, typos, etc. They all return one of the following three headers: 301 permanent redirect (to the final URL), 404 not found and 200 for my target URLs.

That leaves underlying software as the cause.

jdmorgan mentions known bugs in Apache 1.3.x in his thread [webmasterworld.com]. I don't have root access so I can't perform the rewritelog directives or identify errors in the log. Can someone plase tell me if how and if I need to modify my .htaccess to bypass this bug?

Here’s my .htaccess:


Options +FollowSymLinks
RewriteEngine On

# If you use the RealUrl extension, then you'll have to enable the next line.
RewriteBase /

RewriteRule ^typo3.*$ - [L]
RewriteRule ^typo3$ typo3/index.php [L]

#script non-www to www redirect
RewriteCond %{HTTP_HOST} !^www\.mysite\.com [NC]
RewriteRule (.*) fileadmin/scripts/do-redirect.php

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-l

RewriteRule .* index.php [L]

jdMorgan

4:25 pm on Feb 16, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Rule #1 can be shortened to just:
 RewriteRule ^typo3 - [L] 

without changing how it works. Shorter=faster.

Take the [NC] flag off the RewriteCond in rule #3.

You forgot the [L] flag on the RewriteRule for rule #3.

The code would be clearer without the blank line between the RewriteConds and the RewriteRule in rule #4.

Jim

Marfola

10:33 am on Feb 17, 2009 (gmt 0)

10+ Year Member



Thanks for the advice.

You've suggested removing the [NC] flag. Shouldn't this be included in a canonicalization condition?

While definitely worth fixing it is unlikely that the errors caused the redirect error reported in Google Webmaster Tool. Here’s the error:

Errors for URLs in Sitemap
http://www.example.com/tipstravelkids/ontheroad/travelgames/ Redirect error
http://www.example.com/destinationguides/grandcanyon/plan/ Redirect error
http://www.example.com/vacationimages/photogallery/ireland/ Redirect error

As previously mentioned none of the urls reported by google appear in my sitemap or site. All of the urls listed return a proper 301. I have check all possibile permutations in Websniffer. Non have chain redirects. The only response I received in the Google Webmaster Forum suggested google might not like something websniffer doesn't mind and maybe underlying software is causing a conflict. It's now clear that .htaccess isn't the problem. There is nothing in the http.conf that could be causing a conflict. Is there a more accurate header tool than Websniffer?

Do you have any other suggestions?

Here's an excerpt of my do-redirect.php script:

if ($_SERVER["REQUEST_URI"] == '/fileadmin/scripts/do-redirect.php')
exit;

$site_prefix = 'http://www.example.com';
$uri = $_SERVER["REQUEST_URI"];
$urlPart = explode("/", $uri);

/* if URL like example.com//jdk: 404 */
if ($urlPart[1]== '' && $urlPart[2] != '' ) {
pagenotfound();
exit;
} else
/* homepage redirect: non-www to www */
if ($urlPart[1]== '' ) {
$url= $site_prefix . "/";
page301($url);
exit;
} else

/* if file exist redirect to www */
if (is_file("../.." . $uri)){
page301($site_prefix . $uri);
exit;
} else

/* Tips URLs redirect. Correct URL structure: www example.com/tipstravelkids/<category>/<title>/index.html */
/* Check if URL is malformed */
/* if URL is truncated return 404 */
if ( $urlPart[1]== 'tipstravelkids' && $urlPart[3] == '' ) {
pagenotfound();
exit;
} else
/* if URL is longer return 404 */
if ( ($urlPart[1]== 'tipstravelkids') && $urlPart[5] != '' ) {
pagenotfound();
exit;
} else
/* if URL urlPart[4] is different from "index.html" or "" return 404 */
if ( ($urlPart[1]== 'tipstravelkids') && ($urlPart[4] != 'index.html' && $urlPart[4] != '') ) {
pagenotfound();
exit;
} else

/* if wellformed check db to be sure subpath are correct: if so redirect to www-index, otherwise 404 */
if ($urlPart[1]== 'tipstravelkids' && ($urlPart[4] == 'index.html' ¦¦ $urlPart[4] == '')) {
/* if <category> and <article> don't exist, return 404 */
$category = rawurldecode($urlPart[2]);
$title = rawurldecode($urlPart[3]);
$res = mysql_query("SELECT uid FROM tt_news WHERE deleted = 0 AND tx_m2ettnews_urlCategory = '" . addslashes($category) . "' AND tx_m2ettnews_urlTitle = '" . addslashes($title)."'");
if ($res != NULL) {
$results = mysql_fetch_row($res);
if ($results[0]['uid'] == null ) {
pagenotfound();
exit;
}
} else {
pagenotfound();
exit;
}
/* else redirect to www - index */
$url= $site_prefix . "/" . $urlPart[1] . "/" . $urlPart[2] . "/" . $urlPart[3] . "/index.html";
page301($url);
exit;
} else

.... check other kind of URL ...

and here's the page301 function:

function page301 ($url) {
@header( "HTTP/1.1 301 Moved Permanently", true, 301);
@header( "Location: " . $url);
exit;
}

[edited by: eelixduppy at 5:38 pm (utc) on Feb. 19, 2009]
[edit reason] please use example.com in code [/edit]

jdMorgan

2:29 pm on Feb 17, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> You've suggested removing the [NC] flag. Shouldn't this be included in a canonicalization condition?

"Recommended" would be more accurate.

Note that the RewriteCond pattern is negated with "!". Include the [NC] only if you wish to allow your site to be indexed under www.example.com, www.Example.com, wWw.ExAmPlE.cOm, and all other possible case variants.

While it is true that domain names are supposed to be handled in a case-insensitive way, allowing case variations with [NC] introduces a dependency; With [NC] your site now depends on search engines to "get it right." If they introduce a bug that affects domain casing, your site suffers, and it will likely take you some time to discover why. And you will find the "cure" is to remove the [NC] from that line.

You should actually use

RewriteCond %{HTTP_HOST} !^www\.example\.co[b]m$[/b] 

With the end-anchor and without [NC], this RewriteCond means, "Redirect unless the requested hostname is exactly 'www.example.com' -- no FQDN, no port numbers, no case variation."

If your server is directly-accessible by IP address, you should also provide for HTTP/1.0 client access by allowing a blank hostname, in order to avoid a possible "infinite" redirection loop:

RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$ 

Subtle code changes -- big effects.

Be very careful interpreting GWT's use of the word "error" -- It is up to you to decide if a 301 "error" is really an error or not. They may be saying it's an error because they found a link to a URL that gets redirected, and they "want" to find links only to the final (new) URL. Of course, you can control that on your own site, but you have no control of old or incorrect links to your site found by Google on other Web sites.

Jim

Marfola

7:29 pm on Feb 17, 2009 (gmt 0)

10+ Year Member



Hi Jim,

thanks for the clarification and recommendations. Both are extremely helpful.

With regards to the 'errors' reported in GWT, if it was only a url or two I wouldn't be concerned. Unfortunately 4 different urls from random sections of my site are reported each day. Some have external links but not all do.

Mariella

jdMorgan

11:37 pm on Feb 17, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My main point is this:
> They may be saying it's an error because they found a link to a URL that gets redirected, and they "want" to find links only to the final (new) URL.

It appears that your code says, "If URL is bad, then return 404. If URL is good, then 301."

Therefore it appears that any "good" URL requested by Google will get redirected, and Google will call that an error. They want you to link to the final URL, and not to a URL that will return a redirect.

In simple and general terms, the solution is to change your code so that it says, "If URL is bad, then return 404. If URL is good, include and send page content. So, if the URL is good, you want PHP to simply "include" the content associated with the originally-requested URL, instead of redirecting to a different URL.

This is conceptually similar to the difference in mod_rewrite between an external redirect and an internal rewrite. Right now, your PHP code is doing an external redirect, while an internal "rewrite" would be more appropriate.

You will find that your server load will be much-reduced by using the "include" method, and that users will see the "page" that they requested loading twice as fast.This is because the 301 response no longer needs to be sent to the client, and the client does not have to issue a second HTTP request using the URL supplied by your redirect. They will never see any URL in their browser address bar except for the one in the link that they clicked.

You also won't have to deal with the problem of preventing direct access to the content file.

Jim

[edited by: jdMorgan at 11:39 pm (utc) on Feb. 17, 2009]

Marfola

9:58 am on Feb 18, 2009 (gmt 0)

10+ Year Member



Jim,

I have nowhere near your expertise thus please help me understanding the following:

It appears that your code says, "If URL is bad, then return 404. If URL is good, then 301."

The code posted refers to non-www and thus all 'good' urls will return a 301 rather than a 200. It is my understanding that this is the best practice for cannonical redirects.

Therefore it appears that any "good" URL requested by Google will get redirected, and Google will call that an error. They want you to link to the final URL, and not to a URL that will return a redirect.

None of our internal links and none of the urls in our sitemap return a 301. They are all final URLs (200) not 1 redirects. The only backlinks that return a 301 are 'good' urls that are non-www or non-index. All other backlinks either return a 200 if good or 404 if bad.

This is conceptually similar to the difference in mod_rewrite between an external redirect and an internal rewrite. Right now, your PHP code is doing an external redirect, while an internal "rewrite" would be more appropriate.


Wouldn't we create a duplicate content issue if we used an internal redirect rather than a 301 for a cannonical redirect?

If your server is directly-accessible by IP address, you should also provide for HTTP/1.0 client access by allowing a blank hostname, in order to avoid a possible "infinite" redirection loop:
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$

Our server isn't directly-accessible by IP address. Should we include the reccommended code in any case?

Mariella

[edited by: Marfola at 10:00 am (utc) on Feb. 18, 2009]

jdMorgan

5:26 pm on Feb 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If the redirect is only invoked if the requested hostname is wrong and needs to be canonicalized, then there is no problem; Just ignore what I wrote. I spent only two seconds looking at your PHP code, and did not see the "purpose" of the redirect documented in that code.

If your server isn't accessible by IP address, and there is no chance that it ever will be, then you don't need to allow for blank HTTP Host headers. But in most cases, it's pretty hard to say "I'll never, ever change hosts." And since the result of not including that simple modification is that your server may temporarily lock up if you ever do move to an IP-accessible hosting account and you do get an HTTP/1.0 request, the modification is cheap insurance against a bug that might be hard to find if you don't remember this thread...

I believe that writing robust and well-documented code saves time and money, and this dictates my coding style.

Jim

Marfola

8:09 am on Feb 19, 2009 (gmt 0)

10+ Year Member



thanks jim

[edited by: Marfola at 8:10 am (utc) on Feb. 19, 2009]