Forum Moderators: coopster
I've written a rudimentary function that does a bit of the leg work with the basic characters, but I would like one that will just convert any non-ASCII character into into it's ASCII counterpart. I've looked a bit on Google but haven't found anything. Does anyone have a function they've already written, know of a solution out there already, or know of any resources on common nasty characters from Rich Text/Word Documents and their ASCII counterparts?
Thanks!
function superhtmlentities($text) {
$entities = array(128 => 'euro', 130 => 'sbquo', 131 => 'fnof', 132 => 'bdquo', 133 => 'hellip', 134 => 'dagger', 135 => 'Dagger', 136 => 'circ', 137 => 'permil', 138 => 'Scaron', 139 => 'lsaquo', 140 => 'OElig', 145 => 'lsquo', 146 => 'rsquo', 147 => 'ldquo', 148 => 'rdquo', 149 => 'bull', 150 => '#45', 151 => 'mdash', 152 => 'tilde', 153 => 'trade', 154 => 'scaron', 155 => 'rsaquo', 156 => 'oelig', 159 => 'Yuml');
$new_text = '';
for($i = 0; $i < strlen($text); $i++) {
$num = ord($text{$i});
if (array_key_exists($num, $entities)) {
switch ($num) {
case 150:
$new_text .= '-';
break;
default:
$new_text .= '&'.$entities[$num].';';
}
} else {
if($num < 127 ¦¦ $num > 159) {
$new_text .= $text{$i};
}
}
}
return $new_text;
}
Now all my pages with Word content validate!
The script that you give is just, it would seem, a slow way to convert from Windows-1252 to another encoding.
The entity code given in the array for 150 is wrong - it should be an en-dash which is – not - which is a hyphen. There's no reason to except out the 150 in your switch statement either.
[edit: whoops it is set to $num>159. My bad![/edit]
Finally, you need to go all the way through 159, the Y Dieresis, so that should be $num>=160 or $num>159
More to the point, though, if you have the multi-byte extensions compiled into PHP, you can use md_detect_encoding() [php.net] to find out if what you have is in fact Windows-1252 and, if so, convert to whatever encoding you want to use on your page using mb_convert_encoding [php.net]. Be forewarned, that you must convert to something like UTF-8 and serve your pages up as such. If you convert to ISO-8859-1, it won't work right because these code points (128-159) don't exist in ISO-8859-1.
The Unicode Support [webmasterworld.com] thread from our Forum Library [webmasterworld.com] should help you out. See especially message 8 and the links in the last message.
[edited by: ergophobe at 5:28 pm (utc) on Feb. 1, 2005]
With "systems" like that for "organizing" my knowledge, I can see I'm going to be in real trouble when my memory starts to fade!