Forum Moderators: coopster
the problem:
yahoo and google are indexing pages THAT DO NOT EXIST and never have.
they're indexing VALID category names, but they are also linking them to other categories. for instance:
Old+Antiques/1920s.html (valid)
the spiders find this no problem.
but for the past month, they have been finding url's like this:
Old+Antiques/1920s__1930s.html
what they are doing is merging one category with another, and this is producing a 200/OK response.
i can't figure out why.
could it be in my htaccess or my code?
if any part of the codes used needs to be posted.. please let me know.
this is from an oscommerce-based website, but i have already hit the osc forums and it has been determined to not be an oscommerce issue.
the htaccess and rewrite script i use is not stock oscommerce.
i have sent a bot to pull every page linked on my site, none of these url's are showing up anywhere.
for reference, here is my htaccess rule:
RewriteEngine on
RewriteBase /
RewriteRule ^([^/]*)\.html$ $1.php?%{QUERY_STRING} [NC]
RewriteRule ^/?(category)/([^/]*)\.html$ index.php?cPath=$2&%{QUERY_STRING} [NC]
i use htaccess to rewrite from /product_info/product_id=00 to:
Product+Category/Product_Brand.htmlthe problem:
yahoo and google are indexing pages THAT DO NOT EXIST and never have.
if you don't have something to show for this page
Old+Antiques/1920s__1930s.html
why are you sending back 200/OK?
i would send a 404 from php code to the client for any page for which i don't have information to show.
:)
for example:
the default oscommerce category url's are: /index.php?cPath=1
mine are rewritten to use the category name in the url.
i'm new to htaccess & php, perhaps there is something painfully obvious i am missing? how could i force a 404 response on these nonexistant url's?
You will need to check within your php script for a result, and if there is not any information, set a '404 Not Found' header manually, before anything is output by your php page.
EG
if(!$result=mysql_fetch_array($your_stuff)) {
header($_SERVER['SERVER_PROTOCOL']." 404 Not Found"); exit();
}
Hope this helps.
Justin
I guess that your rules are written in a way that non-existent categories are(wrongly) getting matched by one of the rules. This is usually not a problem, since the links within your site _should_ all point to sensible URLs, so (eg)
category1/bluewidget/ID55
would be the "same" page as
category1/blahblahblah/ID55
But you would never link to the latter, so it's not a problem.
OSCommerce can be notoriously finicky with caches and shared session directories which _might_ explain your overlapping category links.
Do you have 1 rule per category, or 1 rule to cover all cases?
if the URL you enter is not matched by any rules...
Most .htaccess files use pattern matching for efficiency, and SEs request random URLs to see how you handle 404 errors on your site... (also, if you are my competitor and I notice you load a blank page when no information is found rather than serving a 404 I might be inclined to give you some links --- free! =)
The best answer is to ensure your php page serves a proper error if there are no results returned.
Justin
If your site returns blank pages, I (or SEs requesting random pages, with similar patterns to real pages) can create multiple, duplicate content pages on your site for you...
A pattern matching .htaccess file is easy to get around, but extensive, exact rules are time consuming --- There is nothing wrong with the .htaccess rules posted in message 1 of this thread (except that [^.] would be more efficient than [^/] AND an L flag with the NC --- [NC,L] --- would help a little too).
Unless you have *very* defined patterns in your URLs the best place to correct this issue is in the php file, not the .htaccess. To stop this in the .htaccess, you basically have to have a rule for every page. Why? because if you use a pattern, it will match more than you would like...
EG all of my pages are 3 letters, 1 uppercase, followed by 2 lowercase, to match I would use the following:
RewriteRule ^[A-Z][a-z]{2}\.html$ /somestuff.html [L]
The above would match my pages (Dog, Cat, Log), but would also match any other number of non-existent pages. (Rrr, Www, Lrw).
To actually correct the issue in the .htaccess you need to have a rule for every page... Much better to correct the issue in the php that should have been this way from the start.
Justin
Also, try to figure out what URLs have been indexed. This will usually give you a clearer understanding of where the error is occuring.
I wrote some other SEO tutorials (Advanced SEO URLs) which can be found at the contributions. Take a look, they might help your understanding.
To actually correct the issue in the .htaccess you need to have a rule for every page... Much better to correct the issue in the php that should have been this way from the start.
for your reference, i have zipped all of the files as they are on my website (in relative folders as well)
i have also included a zip of the original contribution zip i downloaded from the oscommerce website that rewrites these url's.
you can download it here: (this doesn't work with a www for some reason):
s37.yousendit.com/d.aspx?id=1BBQL2TRG9CLI2HIKGXNFDFE00
it's in .zip format
7.44 KB in total size
i loaded it to yousendit.com to prevent unnecessary calls to my server (i don't know how many of you or lurkers are interested in downloading the file, and this is a very large forum..)
i don't know whether or not i'm allowed to post a zip with php and an htaccess file or not, if i am not... please delete it and i will simply post the source of the files needed.
:)
i hit reload on the offending URL....
all that showed were items relating to "1950s" instead of what originally showed (1920s) stuff.
once i renamed the subcategory back to 1920s & hit reload.... 1920s-related products shown again.