Forum Moderators: phranque
Here is the .htaccess code please:
RewriteEngine On
# Redirect to correct domain if incorrect to avoid canonicalization problems
RewriteCond %{HTTP_HOST} !^example\.com
RewriteRule ^(.*)$ http://example.com/$1 [R=301,L]
# Redirect URLs ending in /index.php or /index.html to /
RewriteCond %{THE_REQUEST} ^GET\ .*/index\.(php¦html)\ HTTP
RewriteRule ^(.*)index\.(php¦html)$ /$1 [R=301,L]
# Rewrite keyword-rich URLs for paged category pages
RewriteRule ^Products/.*-C([0-9]+)/Page-([0-9]+)/?$ category.php?category_id=$1&page=$2 [L]
# Rewrite keyword-rich URLs for category pages
RewriteRule ^Products/.*-C([0-9]+)/?$ category.php?category_id=$1&page=1 [L]
# Rewrite keyword-rich URLs for product pages
RewriteRule ^Products/.*-C([0-9]+)/.*-P([0-9]+)\.html$ /product.php?category_id=$1&product_id=$2&%{QUERY_STRING} [L]
# Rewrite media files
RewriteRule ^.*-M([0-9]+)\..*$ /media/$1 [L]
# Rewrite robots.txt
RewriteRule ^robots.txt$ /robots.php
[edited by: jdMorgan at 6:34 am (utc) on April 2, 2009]
[edit reason] example.com [/edit]
Your first two rules are reversed, and both should contain the domain name in the substitution URL. Note that we use only "example.com" here -- for our protection and yours.
Your internal redirect rule patterns need some optimization, and there are some potentially-very-serious duplicate-content and googlebombing vulnerabilities in this approach; If other WebmasterWorld members do not address these additional problems, I will do so when I have a bit more time.
Jim
Study your code and how it differs to that in [webmasterworld.com...] and adopt that instead.
.
There are three major problems with each of your four rewrites, those where you use
.* near the beginning of the pattern. Taking just one example:
# Rewrite keyword-rich URLs for category pages
RewriteRule ^Products/[b].*[/b]-C([0-9]+)[b]/?[/b]$ category.php?category_id=$1&page=1 [L] Using
.* the server will match everything in the URL right to the end. The .* pattern is 'greedy'. It says 'grab the whole URL'. That is the wrong thing to do, because it will then have to back off and retry hundreds of matches until it finds the right one, because there is more stuff to match after the place you said 'get it all'. You should use a more specific pattern - one that can be parsed from left to right. I have no idea what goes in there, but I would assume it is hyphenated keywords. The pattern 'as is' is inefficient and will be slow to operate. You might need something like (([^\-]+\-)+) instead. This says 'match to the next hyphen, one or more times' and will be more efficient. The other problem is very common and a potential way for your search results to be completely destroyed. You likely want a URL like
example.com/Products/my-cool-stuff-C94728/ but you don't 'qualify' or 'check' the keywords. That means a competitor could link to example.com/Products/poisonous-unsafe-junk-C94728/ and your site would return '200 OK' and duplicate content, and would be indexed and would rank. You need to grab that part of the URL, and send it as an extra parameter (like &keywords=$n for example) to your script where the script will validate the words are exactly right for that page of content. For non-valid words your script should use the HEADER command to return either a 404 Error, or a 301 redirect to the correct URL for that content. The last problem is relatively minor, but yet another way to destroy your SERP. You allow a 'valid' URL to either have or to omit the trailing slash (see
/?$ in your code). That is, both will return '200 OK' and the same content. That's more Duplicate Content. What you should do, is pick one of them to be the canonical URL, and for all other requests in the 'other' format, you should issue a 301 redirect. You already do that for 'with-www' - you redirect to 'without-www'. You should do the same here. Redirect 'with-slash' to 'without-slash'. That same rule should also force non-www, etc, within the same rule, otherwise you end up with a redirection chain. This new redirect must be listed *before* all of the other redirects. One minor issue is the used of mixed-case. It is much harder for a mixed-case URL to be passed by speech and conveyed correctly. Allowing the server to respond to mixed case is another form of Duplicate Content. You should use all lower-case for all of the URL if you can. It will prevent a lot of headaches in the future.
.
Finally, you have a bunch of redirects completely missing. If I request
example.com/category.php?category_id=458292&page=1 I will be served the content with '200 OK' status. You should take these requests (both for www and non-www, and for parameters in *any* order) and redirect them to the canonical form, forcing www at the same time for those requests. Failure to do so is yet another source of Duplicate Content. Having said all that, your initial code was one of the best 'first go' coding examples seen in recent weeks. However, the job is a lot more involved than you first expected.
[edited by: jdMorgan at 4:01 pm (utc) on April 2, 2009]
[edit reason] edited at poster's request [/edit]
example.com/category.php?category_id=458292&page=1 I will be served the content with '200 OK' status. You should take these requests (both for www and non-www, and for parameters in *any* order) and redirect them to the canonical form, forcing www at the same time for those requests. Failure to do so is yet another source of Duplicate Content.
Normally those redirects would be listed first in your .htaccess file.
In this case you have to insert keywords into the new URL and there is no way for .htaccess to do that.
The solution is fairly simple. Use a rewrite to connect those requests to a special redirect script that uses the category and/or product number to look up the keyword list in the database. Use the PHP HEADER command to send a 301 redirect to the correct URL. Do ensure you add the extra stuff to this to make a 301 redirect, as the default is a 302 redirect.
The URLs pointed to by redirects should also contain both the protocol and the full domain name, so that there is no ambiguity when the non-canonical version is requested. The redirect should fix both of those things at the same time as it fixes everything else.
Your redirect script is likely a dozen lines of PHP code and a database query, for each type of URL. It is a fairly simple job.
The redirect script will also need to return a HEADER 404 for any completely non-valid requests (where the category or product ID does not exist at all).
Having said all that, your initial code was one of the best 'first go' coding examples seen in recent weeks. However, the job is a lot more involved than you first expected.
Jim
The former should lead to content being served, and the latter should result in a redirect back to the friendly URL format. The requests could be differentiated by adding a 'hidden' parameter (like
&friendly=true or something) in the rewrite to trigger this selection. If the parameter is missing, serve a redirect to the friendly URL, and strip all parameters in that redirect. If the parameter is present, serve the content. For me, I would simply rewrite direct client requests for category.php and product.php URLs to this redirect.php script and let that generate the correct redirects. The script would need to do a quick lookup in the database to get the correct keyword part of the URL that matches the category and product IDs, so that it could generate the correct 'friendly' URLs.
It would also need to check that the category ID does exist, and the product ID does exist, and that the category and product ID when used together are a valid combination. For any that fail this test, a 404 error needs to be issued. This function will also need to appear in your main script... you don't want to serve content when 'incorrect category and product combinations' appear in a URL.