Forum Moderators: coopster

Message Too Old, No Replies

Search engine friendly urls best practices

seo, search engine friendly urls, clean urls, htaccess php

         

topwebdesigns

9:38 am on Nov 20, 2009 (gmt 0)

10+ Year Member



For a few years now i've always created search engine friendly urls manually through htaccess like so:

RewriteRule product/(.*)/(.*)-(.*).html$ index.php?page=product&cat=$1&id=$2&title=$3

RewriteRule category/(.*)-(.*).html$ index.php?page=category&id=$1&title=$2


Or something similar. However I notice that nearly every script I look at nowadays uses something similar to:


RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule (.*) index.php

I'm guessing they then parse the url in php (I often struggle to find the file that does this and as such don't have that good idea of how they do it)

I've tried duplicating this myself as I can see the benefits of it. However I seem to be duplicating code between parsing and building a url take these two smallers examples for instance (any actual code would probably be alot more indepth).


<?php
<?php
class url {
/**
* $params
*
* Holds the paramaters for a page
*
* It's basically $_GET
*
* @access private static
* @var array
* @see self::param()
*/
private static $params = array();

/**
* param()
*
* Gives access to self::$params.
*
* If something doesn't exist it returns FALSE
*
* @access public static
* @params string
* @return string/array
* @use if($value = self::param('id') != FALSE) {//Do something with $value}
*/
public static function param($id=FALSE) {
if(empty(self::$params)) {
self::parse($_SERVER['REQUEST_URI']);
}
if($id != FALSE) {
return (isset(self::$params[$id])) ? self::$params[$id] : false;
} else {
return self::$params;
}
}

/**
* build()
*
* Builds a url ready for output
*
* @access public static
* @param $app string
* @param $page string
* @param $params array
*/
public static function build($app,$page="index",$params=array()) {
$fileType = '.html';
switch($app) {
case 'catalog':
$output = 'catalog';
if($page == 'product') {
$output .= '/product/'.$params['cat_name'].'/'.$params['prod_id'].'-'.$params['prod_name'];
} elseif($page == 'category') {
$output .= '/category/'.$params['cat_id'].'-'.$params['cat_name'];
}
break;
case 'blog':
$output = 'blog';
if($page == 'article') {
$output .= '/article/'.$params['cat_name'].'/'.$params['article_id'].'-'.$params['article_title'];
} elseif($page == 'category') {
$output .= '/category/'.$params['cat_id'].'-'.$params['cat_name'];
}
break;
}
return text::cleanUrl(SITE_URL.$output.$fileType);
}

/**
* parse()
*
* Unpacks a url when the page is loaded
*
* @access public static
* @param string
* @return array
*/
public static function parse($rawURL) {
$url = str_replace(SITE_URL,"",preg_replace('/^\/¦\/$¦\.[^.]+$/','',$rawURL));

$bits = explode("/",$url);

switch($bits[0]) {
case 'catalog':
self::$params['app'] = 'catalog';
switch($bits[1]) {
case 'product':
self::$params['page'] = 'product';
self::$params['cat_name'] = $bits[2];
self::$params['prod_id'] = $bits[3];
self::$params['prod_name'] = $bits[4];
break;
case 'category':
self::$params['page'] = 'category';
self::$params['cat_id'] = $bits[2];
self::$params['cat_name'] = $bits[3];
break;
}
break;
case 'blog':
self::$params['app'] = 'blog';
switch($bits[1]) {
case 'article':
self::$params['page'] = 'article';
self::$params['cat_name'] = $bits[2];
self::$params['article_id'] = $bits[3];
self::$params['article_name'] = $bits[4];
break;
case 'category':
self::$params['page'] = 'category';
self::$params['cat_id'] = $bits[2];
self::$params['cat_name'] = $bits[3];
break;
}
break;
}
return self::$params;
}
?>

Now that's just a dumbed down version of what i've actually got. There may be bugs in it I just knocked it together quickly, what i'm more interested in is a discussion on the theories and best practices around doing it this way. How do other people do and what do you think is the best way to handle friendly urls?

p.s. the preview seems to show all the formatting stripped out so sorry if the code snippets are a tad unreadable.

rocknbil

7:38 pm on Nov 20, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't have time today for a full analysis, but a couple comments:

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule (.*) index.php

This is actually server intensive because it has to do a complete search of the file system on every request (thx jdMorgan) It's also a little bit lazy, but lazy is not always a bad thing, it can mean getting the job off your desk and on to more important things. :-)

Your initial method is closer to more efficient

RewriteRule product/(.*)/(.*)-(.*).html$ index.php?page=product&cat=$1&id=$2&title=$3

This only rewrites based on the request and doesn't search the entire system; however, you shouldn't have to be concerned about the query string in the .htaccess. The directory doesn't have to even exist. This opens up a world of keyword rich possibilities.

Let's say, instead, you do this

RewriteRule Keyword-Rich/.*$ products.cgi

(I intentionally changed it to .cgi, it's not always PHP. :-) But it's the same either way)

Now, you're just passing anything that begins with /Keyword-Rich to products.cgi/.php.

What you do is write up a little parser that does this

/Keyword-Rich/Product-category/Product-Title

$uri = explode('/',$_SERVER{'REQUEST_URI'});
$title = array_pop($uri); // Product-Title
$category = array_pop($uri); // Product-Category

Note I'm using words that can reflect keywords in most cases, not product or category id's as numbers. Why bother doing this if your URL is /Generic/1/2/345? In building your URL's, you would want to use something from the category and product titles.

If you use AdWords, some incoming URL's will still have a query string,

/Keyword-Rich/Product-category/Product-Title?bla-bla=AdWords-tag

So you'll have to split that off, just in case.

$tmp = explode('?',$title);
$title = $tmp[0];

Now using the method of choice, remove the dashes.

$title = preg_replace('/-/',' ',$title);
$category = preg_replace('/-/',' ',$category);

Depending on what you're doing, you may or may not need category, or maybe category needs to be a number.

select cat_id from categories where title = '$category';

This all being in a function,

$arr = ($category_id,$title);
return $arr;

So when called,

($cat_id,$product) = your_url_function();

But wait! We can still use old URL's by doing this:

if (! $cat_id) { $cat_id = $_GET['cat']; }
if (! $product) { $product = $_GET['prod']; }

(With sufficient sanitizing and error checking on those, of course.)

So now,

select * from products where cat_id=$category_id and p_title='$product';

And you have both, SE friendly URL's and query strings from the same script (you should do something if these exist for very long to avoid duplicate content.)

It gets a little complicated and you have to make decisions on what to do about special characters and characters that are not **supposed** to be in URL's - ", ', ;, :, ?, etc. But using this method your CMS/product entry can be more aware of keyword rich titles and know they will go directly in the URL.

A second approach, which I use: products have a "url" field. If present, your programming will use it, otherwise, uses the title.

select * from products where cat_id=$category_id and (p_title='$product' or p_url='$product');

This allws you to get up and running just on the titles alone, and add the crafted URL's later. Of course, be sure you have error checking on input to make sure the url's and/or titles are unique.

Generating the URL's - as you extract the titles, sub out spaces for dashes. Be mindful of handling leading or trailing spaces, they will create troubles, and also manage the special characters as mentioned above, both on input and output.

topwebdesigns

11:44 am on Nov 21, 2009 (gmt 0)

10+ Year Member



Thanks for the response. I never considered the server intensity before, I suppose I fell into the mistake of seeing scripts like Wordpress, IPB and Magento doing something and assuming it's the best way to do things.

Although if we're talking about load etc I would have thought doing an sql query for the product title wouldn't be a very good approach. Perhaps i'm just old school, but i've always avoided using strings in a where except in searches. Everything else works of ints which I believe is less work for SQL?

As such I always using something like domain.com/{p_id}-{p_title}.html and using p_id in the where of course this way p_title could be anything and it would still work, which is why a check would have to be made with a permanent redirect called if the title is wrong to avoid duplicate urls. (I can't see any reason why the wrong title would be put in a url, but it's always best to check)

Also is there any benefit to using "array_pop($uri)".

I tended to know which increment of the id I was looking for and use $uri[4] or whatever.

But the problem with duplicate code still seems to exist. If I change the url I need to change at least 2 functions and possibly a htaccess if I change it to cater for something more specific (not to mention building a header redirect to cater for people accessing the old url).

g1smd

1:42 pm on Nov 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Avoid the (.*) notation whenever possible. It is greedy, promiscuous and ambiguous, and very server intensive. Never have multiple (.*) patterns in a single rule; you could kill your server load

Be specific about the exact URL format your site will use and craft .htaccess rewrite rules which closely match that URL format. This saves work in the script.

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule (.*) index.php

The above is lazy. The code is very server intensive. It also maps requests for /robots.txt, and other such calls, to you script if the file is missing. Most scripts can't handle that request so send garbage back to the bots.

When you use

([^/]+)/([^/]+)/([^/]+)
notation and fail to pass all three backreferences to your script, the unchecked variables open your site up to malicious linking. You link to
/cat/my-great-product/34567
and someone else links to
/cat/overpriced-low-quality-junk/34567
and you're in trouble if your site returns "200 OK" for both requests. Check all parts of the URL against your database, and issue a 301 redirect to the correct URL if the parts and names don't match up.

All of these things are mentioned very often in the WebmasterWorld Apache forum.

topwebdesigns

4:01 pm on Nov 21, 2009 (gmt 0)

10+ Year Member



Thanks for the points g1smd, I agree it's important to check the url and do a 301 redirect if the url is incorrect.

However could you clarify this statement; "Never have multiple (.*) patterns in a single rule"?

Are you suggesting that you shouldn't have something like this:

RewriteRule product/(.*)/(.*)-(.*).html$ index.php?page=product&cat_name=$1&prod_id=$2&prod_name=$3

That seems very restrictive when trying to use htaccess to redirect urls.

Also, what are your thoughts on using something ([0-9]+). I generally use this to make sure i'm getting the right thing e.g: prod_id should always be an int.

RewriteRule product/(.*)/([0-9]+)-(.*).html$ index.php?page=product&cat_name=$1&prod_id=$2&prod_name=$3

Is this more intensive as it needs to check that it's numerical? Or does it make little difference?

rocknbil

8:53 pm on Nov 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



i've always avoided using strings in a where except in searches.

Yes, integer searches are faster, but you shouldn't take a significant performance hit for looking up a text value. Like anything else, it has to be weighed against what is most important to the task at hand: the nth detail or overall site performance - in terms of SEO and conversion. To wit,

domain.com/{p_id}-{p_title}.html

The whole point, really, of using SEO friendly URL's is to make the URL easily memorable and capitalize on keywords in the URL (which, by itself, is no "magic bullet" but is a major contributor to overall page indexing.) My point is to not waste this space on generic numbers.

Also is there any benefit to using "array_pop($uri)".

As apposed to what? There's always more than one way to do it. In the sample pseudo-code, the task is the get the last /delimited/elements/ in the URI, and this does it pretty reliably. you can do array counting and all sorts of things, but this is just one simple way to get at the elements you need to look up the product and category. The var $url is not re-used, so there's no disadvantage to shortening it by the popped elements.

I tended to know which increment of the id I was looking for and use $uri[4] or whatever.

Yes, but in a way, you're "trapping" yourself into this logic. :-) What if you decide to move something so that you now have an/extra/element in your URI's? You need to modify the programming. Or if you want to use your scenario on a new set of products without a category? You need to modify your programming. Or you add a new feature, say, cataloged articles, and want to apply this scenario to those? You need to duplicate your programming. Or move it to a subdomain? You guessed it . . . likely modify your programming.

The whole idea is to program it in such a way that you don't have to assume "rules" such as "product id is always $array[4], product is $array[5] . . ." In the above scenario, popping off the last two elements (or more!) allows you to evaluate them against your database so if something changes, you don't have to rework your script.

coopster

2:28 pm on Dec 5, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Great question ... I'm going to pull back a moment from the technical talk though and pose this question in regards to best practices today ...

Is there still value today in having the "friendly" part of the category and/or product in the url?

CyBerAliEn

11:32 pm on Dec 10, 2009 (gmt 0)

10+ Year Member



Is there still value today in having the "friendly" part of the category and/or product in the url?

I'd argue: YES

Even though the notion of pretty/friendly URLs has been around a few years now, it still isn't 100% common. And the fact is, a URL like site.com/about/contact/ looks a lot better than say site.com?section=about&page=contact. It looks better to the human eye and the computer eye. Search engines will follow and check out links, but you're missing out on possible points by not having a relevant URL. If your page is all about yellow bunny rabbits, you'll certinaly get better rankings with a URL like 'site.com/yellow-rabbits.html' (or such) verses 'site.com/rabbits/?color=yellow'. Is it a HUGE difference? NO! You won't fly to page 1 suddenly, lol. But the idea is to augment and capitalize on everything you have.

And the general movement, despite being "prettier" to look at, is LARGELY driven by the motivation to get better page rankings/placements/etc on the search engines (however small the effect is---many consider every little bit counts; if you can do it and it helps a little, why wouldn't you? especially if it can directly equate to $$$).

Most people (I am referring to my brother, my girlfriend, my friends, my family, co-workers, etc; people who are not web developers/programmers/technically-proficient) usually never even type a URL in; so it does not matter in this regard how simple/pretty a URL is. Most people get to where they want (even if they know what they want) by using search engines or bookmarks. If I had a penny for every time someone entered something common like "youtube" or "facebook" into their search bar or a search engine, or "gmail" into Google (or even "google" into a browser's Google search box) I'd be rich! haha

So will it make your site bad ass, top ranking? No. Will it help "a bit"? Yes. If you can do it, I say do it. As it really doesn't take a lot of effort to accomplish (especially with a simple Apache rewrite as noted above).

PS... I am one who uses the method you note in your first post. Crazy? Maybe. Why? I don't want to have to edit my HTACCESS file for every possible routing request. I run every request through a "PHP router" to figure out what to do, where to send it, etc. In a sense, it IS more resource intensive; however, I have yet to experience any significant difference in performance. But I find it to be much more manageable (especially when you're routing is more complex than just a single, simple process flow "/products/yellow-banana.html" to "products.php?name=yellow-banana" etc). Though I'd agree... if you could easily get by with a few rules/conditions in Apache, I would recommend going that route.

TheMadScientist

9:30 pm on Dec 11, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteRule product/(.*)/(.*)-(.*).html$ index.php?page=product&cat_name=$1&prod_id=$2&prod_name=$3

The best way is actually a forward-looking negative match...

RewriteRule product/([^/]+)/([^-]+)-([^.]+).html$ index.php?page=product&cat_name=$1&prod_id=$2&prod_name=$3

The preceding matches any character which is NOT a / one or more times, followed by is NOT a hyphen one or more times, followed by a hyphen, followed by is NOT a dot one or more times, followed by a .(dot) ...

There are some huge differences in number of matches and the way they work which are so big I struggle with the math, but here's the English version:

.* matches any single character 0 or more times, so your first pattern matches the entire line to the end, then the second pattern matches everything to the end of the line, then your 3rd pattern matches anything to the end of the line... All patterns are matched and compared, then the best is used, so you can have from 1,000s to 100,000s or more unnecessary possible matches which have to be eliminated before 'the best match' can be determined.

Using forward looking, negative matches eliminates this, because not only do they stop matching at a 'preset-delimiter' the delimiter is checked for (by the Perl regex engine used by PHP anyway) before any of the rest of the string is compared or considered a match, so in the preceding example I gave you a / a - and a . would be checked for prior to the regex matching the entire string and if they are not present the match 'breaks' and the entire string is not compared.

Anything, including [A-Z0-9a-z] (which is more efficiently written as a case insensitive [a-z0-9]), is more efficient than .* when it is repeated, because .* matches everything, which is good when you only need to match everything once, but bad otherwise, because it matches everything, so to find the best match to use all matches and portions of matches must be compared.

* I hope I've said this in basic English well enough for people to get the point, and if I missed a little bit of technicality along the way, please just go with the point... .* is NOT the way to go, unless you are matching everything to the end of the line with a single (.*) pattern and IMO there is usually a better way even then.

topwebdesigns

11:15 pm on Dec 11, 2009 (gmt 0)

10+ Year Member



Thanks alot, that is yet another thing that I haven't contemplated before. No doubt it's only shaving milliseconds, but then again if a site was to get 100,000 hits a day it would only take an extra 36 milliseconds on each page to wind up adding an extra hours worth of processing per day.

TheMadScientist

1:56 am on Dec 12, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yeah, it can actually add up in a hurry...

The following is from the PHP manual and Mod_Rewrite uses a different regular expression engine, but it should give an idea of how they match:

Beware of patterns that contain nested indefinite repeats. These can take a long time to run when applied to a string that does not match. Consider the pattern fragment (a+)*

This can match "aaaa" in 33 different ways, and this number increases very rapidly as the string gets longer.

[us3.php.net...]

A little more expansion on the topic is basically, the .* pattern matching is not just forward...

I'm not sure if I can exactly explain how they work, because it's more of an 'understood thing' than something I usually try to put in words, but the patterns match forward taking into account all possible matches then 'back off' or 'back match' from the end of the string, then (If I'm remembering correctly) in your example, the second pattern would kick-in and match to the end of the string and would then 'back match' to the breaking point of the 3rd .*, which would then match to the end of the string and the first pattern would be 'back-matched' to the first breaking point, and all possible matches in between have to be compared to find the best match, so in a URL with 3 .* patterns and 16 characters you would have something like this:

url/url-url.html

The first .* matches 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16 characters and they are stored.

The second .* matches 0,1,2,3,4,5,6,7,8,9,10,11,12 characters and they are all stored. Fragments of this match also match the overall pattern, so the first expression must be compared for a 16 character match, where this expression matches 0 characters, then it must be compared for a 15 character match where this one matches 1 character, then... Another way of saying it is the first one matches all the way through, then backs off when this one kicks in and all possibilities of both expressions matching any portion of the string must be evaluated... NVM what happens when the 3rd .* kicks in.

The third .* matches 0,1,2,3,4,5,6,7,8 characters and they are all stored. This one gets even worse than the first because there is a double overlap and I'm not even going to try to do the math on the possible combinations.

If someone feels like doing the math on all the possible combinations of matches go ahead, but I don't have that kind of time, because the easiest way to explain it is the expressions basically match forward, then backward, then forward again until the best possible match is found. There could actually be millions of extra unnecessary possible matches in a normal length URL.

Keeping in mind (to the best of my knowledge) simply using a case-insensitive [a-z] rather than [a-zA-Z], because regular expressions look for the best possible match, saves 26 comparisons on a single character really puts things in perspective.