Forum Moderators: phranque

Message Too Old, No Replies

htaccess is sending bots to nonexistant pages

seems to be only yahoo and google that are effected?

         

jake66

5:56 am on Nov 21, 2005 (gmt 0)

10+ Year Member



i use htaccess to rewrite from /product_info/product_id=00 to:
Product+Category/Product_Brand.html

the problem:
yahoo and google are indexing pages THAT DO NOT EXIST and never have.

they're indexing VALID category names, but they are also linking them to other categories. for instance:

Old+Antiques/1920s.html (valid)
the spiders find this no problem.

but for the past month, they have been finding url's like this:
Old+Antiques/1920s_1930s.html
what they are doing is merging one category with another, and this is producing a 200/OK response.

i can't figure out why.
could it be in my htaccess or my code?

my htaccess:

RewriteRule ^/?(category)/([^/]*)\.html$ index.php?cPath=$2&%{QUERY_STRING} [NC]

my code:

<?php
/*
SEF Link Transformer for osCommerce (SEF stand for Search Engine Friendly)
Version: Lite 0.8.0 Alpha
Original contibution base: Silencer (silencer@softhome.net)
New contribution with catgory names, brand names , and product names: Nimmit - www.freeriderstores.com

*/

function callback($pagecontent) {
$pagecontent = preg_replace_callback("/(<[Aa][ \r\n\t]{1}[^>]*href[^=]*=[ '\"\n\r\t]*)([^ \"'>\r\n\t#]+)([^>]*>)/",'wrap_href',$pagecontent);
return $pagecontent;

}

function transform_uri($param) {
$uriparts = parse_url($param[2]);
$newquery='';
$scheme = $uriparts['scheme'].'://';
if (($scheme!= 'http://') && ($scheme!= 'https://')) return $param[1].$param[2].$param[3];
$host = $uriparts['host'];
if ($host!= $_SERVER['SERVER_NAME'] && $host!= $_SERVER['SERVER_ADDR']) return $param[1].$param[2].$param[3];

$path = $uriparts['path'];
list($file,$extension) = explode('.', basename($path));
if($extension!= 'php') return $param[1].$param[2].$param[3];
$extension = ".html";
$path = rtrim(dirname($path),'/');
$query = $uriparts['query'];
$anchor = $uriparts['anchor'];
if ($a = explode('&',$query)){
foreach ($a as $b) {
list($key,$val) = split('=',$b);
switch ($key) {
case 'cPath':
if(eregi('[_0-9]', $val)){
if($cat_arr = explode('_', $val)){
$count = false;
foreach($cat_arr as $value){
$cat_Q = tep_db_query("select c.categories_id, cd.categories_name from " . TABLE_CATEGORIES . " c, " . TABLE_CATEGORIES_DESCRIPTION . " cd where c.categories_id = '" . $value . "' and c.categories_id = cd.categories_id");
$cat_name = tep_db_fetch_array($cat_Q);
if(!$count){
$result .= $cat_name['categories_name'];
$count = true;
}
else{
$result .= '_' . $cat_name['categories_name'];
}
}
$cat = '/category/'. str_replace(' ' , '+' , $result);
}
else{
$cat = '/category/'.$val;
}
}
else{
$cat = '/category/'.$val;
}
break;
case 'language':
$lan = $val.'/'.$path;
break;
case 'products_id':
$name_Q = tep_db_query("select products_name from " . TABLE_PRODUCTS_DESCRIPTION . " where products_id = '" . $val . "'");
$pro = ($t = tep_db_fetch_array($name_Q))? '/product/' . str_replace(" ", "_" , $t['products_name']) : '/product/'.$val;
break;
case 'manufacturers_id':
$brand_Q = tep_db_query("select manufacturers_name from " . TABLE_MANUFACTURERS . " where manufacturers_id = '" . $val . "'");
$man = ($t = tep_db_fetch_array($brand_Q))? '/brand/'.str_replace(" ", "_" , $t['manufacturers_name']) : $man = '/brand/'.$val;
break;
case 'osCsid':
if(strstr($_SERVER["HTTP_USER_AGENT"],'Mozilla')) $newquery .= $key.'='.$val.'&';
break;
default:
if($newquery ¦¦ $key) $newquery .= $key.'='.$val.'&';
}
}
}
if ($newquery) $newquery = '?'.rtrim($newquery,'&');
$path = '';
if(isset($man)) $path .= $man;
if(isset($cat)) $path .= $cat;
if(isset($pro)) $path .= $pro;

((isset($man) ¦¦ isset($cat) ¦¦ isset($pro)))? $host .= '' :$host .= '/';
if($file == 'index' ¦¦ $file == 'product_info'){
if((isset($man) ¦¦ isset($cat) ¦¦ isset($pro))) $file= '';
}
if(eregi('reviews',$file)) $file = '/' . $file;
return $param[1].$scheme.$host.$file.$path.$extension.$newquery.$anchor.$param[3];

}
function wrap_href($param) {
return transform_uri($param);
}

ob_start("callback");

?>

code part 2


<?php
if(isset($HTTP_GET_VARS['products_id']) &&!eregi('^[0-9]*$',$HTTP_GET_VARS['products_id'])){
$name_Q= tep_db_query("select products_id from " . TABLE_PRODUCTS_DESCRIPTION . " where products_name = '" . str_replace("_"," ", $HTTP_GET_VARS['products_id']) . "'");
if(tep_db_num_rows($name_Q)){
$t = tep_db_fetch_array($name_Q);
$HTTP_GET_VARS['products_id'] = $t['products_id'];
}
}
// manufactures_id
if(isset($HTTP_GET_VARS['manufacturers_id']) &&!eregi('^[0-9]*$',$HTTP_GET_VARS['manufacturers_id'])){
$brand_Q = tep_db_query("select manufacturers_id from " . TABLE_MANUFACTURERS . " where manufacturers_name = '" . str_replace("_"," ", $HTTP_GET_VARS['manufacturers_id']) . "'");
if(tep_db_num_rows($brand_Q)){
$t = tep_db_fetch_array($brand_Q);
$HTTP_GET_VARS['manufacturers_id'] = $t['manufacturers_id'];
}
}
if(isset($HTTP_GET_VARS['cPath']) &&!eregi('^[_0-9]$', $HTTP_GET_VARS['cPath'])){
if(!eregi('^[_0-9]*$',$cPath)){
$cat_arr = explode('_' , $cPath);
foreach($cat_arr as $value){
$cat_Q = tep_db_query("select categories_id from " . TABLE_CATEGORIES_DESCRIPTION . " where categories_name = '" . str_replace('+' , ' ' , $value) . "'");
$cat_name = tep_db_fetch_array($cat_Q);
if(!$count){
$result .= $cat_name['categories_id'];
$count = true;
}
else{
$result .= '_' . $cat_name['categories_id'];
}
}
$HTTP_GET_VARS['cPath'] = $result;
}
}
?>

not sure if it matters, but i am using oscommerce? this is NOT an issue with oscommerce, i have already hit their support boards, nobody can find any reason to believe it's related to osc in any way.

all 3 of the files i posted are not stock oscommerce.

what would i have to do to make this produce a 404 instead of 200 response?

jdMorgan

4:14 pm on Nov 21, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



mod_rewrite in .htaccess is dumb as a rock. All it can do is re-arrange the format of the URLs that are requested by client browsers and 'bots. It can't add unique words to requested URLs -- It could add repeats, but not new and 'sensible' words.

So the problem here is to determine whether the robots are 'inventing' these URLs, or whether you have published them on your pages somehow. Do a search to see if you can find these URLs on any of your pages. If so, then you'll need to review the scripts that create and output the links on your pages, and find out why the URL-description data entries are getting run-on or merged together.

This forum is primary Apache server-guts related, so if you need help with PHP, ask over in the PHP forum. You'll get more expertise over there. I suggest narrowing down the problem and posting only the relevant code snippets, since big code dumps put people off and will be trimmed or deleted by the moderators for that reason.

Jim

jake66

2:01 am on Nov 22, 2005 (gmt 0)

10+ Year Member



i've sent a bot through the site to find every url linked. they aren't showing up as being linked anywhere.

i don't know whether the problem is from the htaccess or php, is why i posted it in this forum

jake66

3:32 am on Nov 22, 2005 (gmt 0)

10+ Year Member



it seems to have fixed itself, either that.. or there was a glitch with one of the categories i deleted today.

if anyone has any suspicions as to why this happened, i would love to hear it :)