Forum Moderators: coopster

Message Too Old, No Replies

URL special characters - need encoding/decoding help

html entities, url, special characters

         

nquinn

11:57 pm on Aug 24, 2009 (gmt 0)

10+ Year Member



Hi all,

I need some help handling special characters in my database.

GOAL: To create SEO-friendly URL's that include the title of the movie.

Example: www.domain.com/movie/name-of-movie,12345.htm

ENVIRONMENT: PHP5, mySQL with all fields set to UTF

THE PROBLEM:

Because the movie titles have a variety of special characters (html entities) included, I cannot output to URL's directly. I have tried a variety of encoding/decoding methods, but still am having these issues. As far as I know the original data source is all latin-1.

My code:


function makeurl($input) {
$input = str_replace("'", "", $input); // apparencly decode doesn't work on this character
$input = html_entity_decode($input); // first, undo html entities back into their raw characters
$input = trim($input);
$input = strip_tags($input);
$input = str_replace(" & ", "-and-", $input);
$input = str_replace("/", "-", $input);
$input = str_replace(",", "", $input);
$input = str_replace("'", "", $input);
$input = str_replace('"', '', $input);
$input = str_replace(":", "-", $input);
$input = str_replace(" ", "-", $input);
$input = str_replace(".", "", $input);
$input = preg_replace("/[-\s]+/", '-', $input); // replace ----'s to -

$input = htmlentities($input);
$input = strtolower($input);
return $input
}

Specific problems I need help with:

1. html_entity_decode does not seem to handle all html entities. According to a comment php.net, it only handles about 100 of 250 possible entities.

For example, it does not properly decode ' , which is a single quote ('). Additionally, it's having problems decoding other characters like those in this title:

Prêt-à-Porter will decode to:
pr%26ecirc;t-%26agrave;-porter

Any suggestions?

jatar_k

1:34 pm on Aug 25, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



not much to do but what you're doing if you are bent on names in urls

find everything improperly converted/handled and handle it

the only other option would be to normalize/simplify the data itself and substitute or remove some of the chars

idfer

7:48 pm on Aug 25, 2009 (gmt 0)

10+ Year Member



I don't quite understand why you need to decode the movie name first, are you getting the names from scraping other pages or something? If they're stored in db encoded for html, well that's a bad idea for reasons you've just run into.

Anyways, to insert the movie name into the url, all you need to do is pass it through urlencode(), something like:

$url = 'http://example.com/movie/' . urlencode($movie_name) . ',12345.htm';

And then when you're outputing the url to html, pass it through htmlentities():

echo '<a href="'. htmlentities($url) . '">movie</a>';

No need to strip tags or anything, unless you really need to for SEO.