Forum Moderators: phranque

Message Too Old, No Replies

mod rewrite sending people 404s

         

jake66

3:08 am on May 12, 2007 (gmt 0)

10+ Year Member



A (very small) portion of my visitors get mis-written results throughout my website, on every page they click on.

This has happened on three different hosts, so it is not a host-specific issue.

For an example, this is my current setup:
/index.php?cPath=21

Setup with mod_rewrite:
/category/Category.html

Stylesheet:
/style.css (yes, it's in the root)

Example of an Image:
/images/pixel_trans.gif

For MOST users, it works fine! I cannot reproduce this error on any test machine or test website I tried it on.

But nevertheless, some users get:
/category/style.css
/category/images/pixel_trans.gif
....for EVERYTHING (every image, every stylesheet, etc. that's in the source).

Anyone know what could be causing this?

Here's the htaccess rules:

RewriteEngine on
RewriteBase /
RewriteRule ^([^/]*)\.html$ $1.php?%{QUERY_STRING} [NC]
RewriteRule ^/?(manufacturers)/([^/]*)\.html$ index.php?manufacturers_id=$2&%{QUERY_STRING} [NC]
RewriteRule ^/?(product)/([^/]*)\.html$ product_info.php?products_id=$2&%{QUERY_STRING} [NC]
RewriteRule ^/?(category)/([^/]*)\.html$ index.php?cPath=$2&%{QUERY_STRING} [NC]

The pages this occurs on are all re-written themselves (as such /category/ ), but the actual files are located in the root ( / ).

jdMorgan

6:42 pm on May 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm not sure what URLs are requested to get these errors, and you didn't include any error log data, so let's just clean this up first and see if that helps:

RewriteEngine on
RewriteBase /
RewriteRule ^([^.]+)\.html$ $1.php [NC,L]
RewriteRule ^manufacturers/([^.]+)\.html$ index.php?manufacturers_id=$1&%{QUERY_STRING} [NC,L]
RewriteRule ^product/([^.]+)\.html$ product_info.php?products_id=$1&%{QUERY_STRING} [NC,L]
RewriteRule ^category/([^.]+)\.html$ index.php?cPath=$1&%{QUERY_STRING} [NC,L]

Also, you state that "The pages this occurs on are all re-written themselves (as such /category/)", but it's impossible to tell what effect that may have without seeing the code that rewrites them.

Jim

jake66

9:49 pm on May 12, 2007 (gmt 0)

10+ Year Member



Thanks for responding!

Here is a snippet from my error log:

[Sat May 12 15:59:59 2007] [error] [client 24.84.X.X] File does not exist: /home/**/public_html/category/images/pixel_trans.gif

the actual filename is /home/**/public_html/images/pixel_trans.gif
Below are the codes that rewrite the links within the website. They're both called as includes.

url rewrite:

<?php

function callback($pagecontent) {
$pagecontent = preg_replace_callback("/(<[Aa][ \r\n\t]{1}[^>]*href[^=]*=[ '\"\n\r\t]*)([^ \"'>\r\n\t#]+)([^>]*>)/",'wrap_href',$pagecontent);
return $pagecontent;

}

function transform_uri($param) {
$uriparts = parse_url($param[2]);
$newquery='';
$scheme = $uriparts['scheme'].'://';
if (($scheme!= 'http://') && ($scheme!= 'https://')) return $param[1].$param[2].$param[3];
$host = $uriparts['host'];
if ($host!= $_SERVER['SERVER_NAME'] && $host!= $_SERVER['SERVER_ADDR']) return $param[1].$param[2].$param[3];

$path = $uriparts['path'];
list($file,$extension) = explode('.', basename($path));
if($extension!= 'php') return $param[1].$param[2].$param[3];
$extension = ".html";
$path = rtrim(dirname($path),'/');
$query = $uriparts['query'];
$anchor = $uriparts['anchor'];
if ($a = explode('&',$query)){
foreach ($a as $b) {
list($key,$val) = split('=',$b);
switch ($key) {
case 'cPath':
if(eregi('[_0-9]', $val)){
if($cat_arr = explode('_', $val)){
$count = false;
foreach($cat_arr as $value){
$cat_Q = tep_db_query("select c.categories_id, cd.categories_name from " . TABLE_CATEGORIES . " c, " . TABLE_CATEGORIES_DESCRIPTION . " cd where c.categories_id = '" . $value . "' and c.categories_id = cd.categories_id");
$cat_name = tep_db_fetch_array($cat_Q);
if(!$count){
$result .= $cat_name['categories_name'];
$count = true;
}
else{
$result .= '_' . $cat_name['categories_name'];
}
}
$cat = '/category/'. str_replace(' ' , '+' , $result);
}
else{
$cat = '/category/'.$val;
}
}
else{
$cat = '/category/'.$val;
}
break;
case 'language':
$lan = $val.'/'.$path;
break;
case 'products_id':
$name_Q = tep_db_query("select products_name from " . TABLE_PRODUCTS_DESCRIPTION . " where products_id = '" . $val . "'");
$pro = ($t = tep_db_fetch_array($name_Q))? '/product/' . str_replace(" ", "_" , $t['products_name']) : '/product/'.$val;
break;
case 'manufacturers_id':
$manufacturers_Q = tep_db_query("select manufacturers_name from " . TABLE_MANUFACTURERS . " where manufacturers_id = '" . $val . "'");
$man = ($t = tep_db_fetch_array($manufacturers_Q))? '/manufacturers/'.str_replace(" ", "_" , $t['manufacturers_name']) : $man = '/manufacturers/'.$val;
break;
case 'catid':
if(strstr($_SERVER["HTTP_USER_AGENT"],'Mozilla')) $newquery .= $key.'='.$val.'&';
break;
default:
if($newquery ¦¦ $key) $newquery .= $key.'='.$val.'&';
}
}
}
if ($newquery) $newquery = '?'.rtrim($newquery,'&');
$path = '';
if(isset($man)) $path .= $man;
if(isset($cat)) $path .= $cat;
if(isset($pro)) $path .= $pro;

((isset($man) ¦¦ isset($cat) ¦¦ isset($pro)))? $host .= '' :$host .= '/';
if($file == 'index' ¦¦ $file == 'product_info'){
if((isset($man) ¦¦ isset($cat) ¦¦ isset($pro))) $file= '';
}
if(eregi('reviews',$file)) $file = '/' . $file;
return $param[1].$scheme.$host.$file.$path.$extension.$newquery.$anchor.$param[3];

}
function wrap_href($param) {
return transform_uri($param);
}

ob_start("callback");

?>

sef:

<?php

//products_id
if(isset($HTTP_GET_VARS['products_id']) &&!eregi('^[0-9]*$',$HTTP_GET_VARS['products_id'])){
$name_Q= tep_db_query("select products_id from " . TABLE_PRODUCTS_DESCRIPTION . " where products_name = '" . str_replace("_"," ", $HTTP_GET_VARS['products_id']) . "'");
if(tep_db_num_rows($name_Q)){
$t = tep_db_fetch_array($name_Q);
$HTTP_GET_VARS['products_id'] = $t['products_id'];
}
}
// manufactures_id
if(isset($HTTP_GET_VARS['manufacturers_id']) &&!eregi('^[0-9]*$',$HTTP_GET_VARS['manufacturers_id'])){
$band_Q = tep_db_query("select manufacturers_id from " . TABLE_MANUFACTURERS . " where manufacturers_name = '" . str_replace("_"," ", $HTTP_GET_VARS['manufacturers_id']) . "'");
if(tep_db_num_rows($band_Q)){
$t = tep_db_fetch_array($band_Q);
$HTTP_GET_VARS['manufacturers_id'] = $t['manufacturers_id'];
} else {

require('includes/unknown.php');

exit();

}
}

if(isset($HTTP_GET_VARS['cPath']) &&!eregi('^[_0-9]$', $HTTP_GET_VARS['cPath'])){

$cPath = $HTTP_GET_VARS['cPath'];
if(!eregi('^[_0-9]*$',$cPath)){
$cat_arr = explode('_' , $cPath);

$parent = 0;

$count = false;

foreach($cat_arr as $value){

/*echo($value . '<br>');*/

if(!$count ) {
$cat_Q = tep_db_query("select c.categories_id, c.parent_id from " . TABLE_CATEGORIES . " c left join " . TABLE_CATEGORIES_DESCRIPTION . " cd on (c.categories_id=cd.categories_id) where cd.categories_name = '" . str_replace('+' , ' ' , $value) . "'");

} else {

$cat_Q = tep_db_query("select c.categories_id, c.parent_id from " . TABLE_CATEGORIES . " c left join " . TABLE_CATEGORIES_DESCRIPTION . " cd on (c.categories_id=cd.categories_id) where c.parent_id = '" . (int)$parent . "' and cd.categories_name = '" . str_replace('+' , ' ' , $value) . "'");

}
if( $cat_name = tep_db_fetch_array($cat_Q) ) {
if(!$count) {
$result .= $cat_name['categories_id'];
$count = true;
} else {
$result .= '_' . $cat_name['categories_id'];
}

$parent = $cat_name['categories_id'];

} else {

require('includes/unknown.php');

exit();

/*

tep_redirect(tep_href_link('unknown.php', '', 'NONSSL', false));

echo ('error with category ' . $value . '<br>');

*/

}
}
$HTTP_GET_VARS['cPath'] = $result;
}
}

?>

jdMorgan

11:18 pm on May 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can probably fix this in one of two ways:

Either modify the php script to drop the "fake" subdirectory path from image and css references, or add more mod_rewrite code to correct the bad URLs that php is generating.

It may be a simple case of linking to these files using page-relative links. Since these relative links are resolved by the browser, and the browser thinks the page is located at /category/<something.html, any image referenced using a page-relative link like <img src="images/pic.gif"> will be requested from /category/images/pic.gif. A simple solution is to use a server-relative link --in this case, <img src="/images/pic.gif"> -- so the browser builds the image link starting with the root directory instead of the page directory.

Jim

jake66

12:31 am on May 13, 2007 (gmt 0)

10+ Year Member



Since these relative links are resolved by the browser, and the browser thinks the page is located at /category/<something.html, any image referenced using a page-relative link like <img src="images/pic.gif"> will be requested from /category/images/pic.gif.

So is difference in browser config what causes this mishap for some people, but not others?
(For example, I'm looking at my website right now and the odd rewrites do not occur. Everything looks perfect, as it should.)

PS The cleanup of my htaccess you did, can you explain the benefits of changing it? I realize it's old code, but was there any security holes or memory leaks? Just want to know for future reference. :)

jdMorgan

7:28 pm on May 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not so much a "browser configuration" issue as a "correctly-written client" issue.

For a given URL, mod_rewrite will either work every time or it won't work at all, unless your Apache installation is corrupt in an extremely-unlikely way. I assume that you're sure that your script and back-end are working, and/or that you don't use page-relative addressing on pages which have been "rewritten out" of their URL-implied subdirectories.

If that is the case, then my next question would be, "Are you sure that these are actually humans using browsers, and not just several instances of a badly-written scraper 'bot?" I'd be checking their user-agent parameters to be sure they're all valid, chasing their IP addresses to see where they're from, looking at the raw log file to see if they take a 'human' click-path through your site, etc. Has any of these "visitors" ever contacted you to report a problem with your site?

Jim

jake66

6:00 am on May 14, 2007 (gmt 0)

10+ Year Member



Ah, that makes sense now. I never actually thought that it could have been scrapers.

No, nobody has ever emailed me informing me of this problem - I only know it exists because it shows up in the 404 error logs.

I've seen a few of them show up in my "who's online" page, some of them have valid referral strings and (seemingly) valid user-agents.

But consistently, seemingly valid users are also always trying to look up some MSOFFICE file with several query strings. These I initially assumed to be a zombie computer or hacker bot feeling up the website. I guess the same could be said for the people that are getting these mysterious rewrites.

Also, If you have a moment: I am interested in your cleanup of my htaccess script. Is there any immediate dangers of using my old version, or did you simply modernize it?

jdMorgan

1:11 pm on May 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I fixed real errors in the regex patterns which will cause your rules to execute several times more slowly, among other things. They will work as you have them, but force the matching engine to perform many retries before a match is found, thus wasting CPU time.

For more information, see the regular-expressions tutorial cited in our forum charter [webmasterworld.com].

[added] The MSOffice requests may be a result of the visitor browsing with the "Discussion/collaboration" options turned on in MSIE, or in some cases, people actually using MS Word as a browser. The requested filenames can be used to determine which is the case. [/added]

Jim

[edited by: jdMorgan at 1:13 pm (utc) on May 14, 2007]

jake66

4:45 pm on May 16, 2007 (gmt 0)

10+ Year Member



Strange! When I updated my htaccess with your edits, I got 404s when I tried to click on a category. Any suggestions as to what could have happened?

jdMorgan

12:54 am on May 17, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Dunno... Do category URLs differ from the others? Is there additional path info for categories?
(An examplified example URL would be helpful)

Jim

jake66

5:26 am on May 23, 2007 (gmt 0)

10+ Year Member



jdMorgan, Can I sticky you a URL? I don't really want to post it publicly. :)

jdMorgan

2:04 pm on May 23, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Since I cannot answer all the questions here due to time limitations, and I don't claim even to be able to do so, let's not go there. Instead, please replace the domain with example.com, and change anything else you really need to change in the URL-path -- The closer the example URL is to the real one, the more useful it will be in diagnosing the actual problem.

The things that make a forum like this one work are that *all* members can contribute to a thread, and that the thread is then available to many members who may come along later with the same or a similar problem.

That's why every one of us here at WebmasterWorld owns "example.com" and sells widgets... although color, size, and texture variations do occur :)

Jim

jake66

7:27 am on Jun 6, 2007 (gmt 0)

10+ Year Member



Understood. :)

URL structure as follows (one example):
[widgetwebsite.com...]

jdMorgan

4:42 pm on Jun 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, I fat-fingered typing the pattern in this first rule:

RewriteRule ^([^.]+)\.html$ $1.php [NC,L]

It should be:

RewriteRule ^([^./]+)\.html$ $1.php [NC,L]

Otherwise, it will match any and all requests for URL-paths ending in .html, and override all the rules that follow it.

Jim