Forum Moderators: coopster

Message Too Old, No Replies

Converting text for Word Document to ASCII

Getting rid of "smart" quotes and invalid characters

         

jusdrum

12:29 am on Jan 26, 2005 (gmt 0)

10+ Year Member



I have a few people who post content in a CMS. They write their stuff in Word, then cut and paste directly into the form. When this happens, it has characters in it (such as smart quotes, ellipses, etc) that cause the HTML validator to throw errors. I know it's a bit anal, but I'd like to be able to strip/replace these characters upon posting the text.

I've written a rudimentary function that does a bit of the leg work with the basic characters, but I would like one that will just convert any non-ASCII character into into it's ASCII counterpart. I've looked a bit on Google but haven't found anything. Does anyone have a function they've already written, know of a solution out there already, or know of any resources on common nasty characters from Rich Text/Word Documents and their ASCII counterparts?

Thanks!

coopster

2:10 am on Jan 30, 2005 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Have a look at the open source project "htmlarea". It has a javascript function to strip out MS Word junk ...

ergophobe

12:16 am on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You can also hook it up to integrate with tidy which can also strip Word junk as well as fix up the code in other ways (add missing end tags to the best of it's ability).

Or is that what you meant Coop? I can't remember if you need tidy installed with htmlArea to get it to work or not.

jusdrum

5:37 pm on Jan 31, 2005 (gmt 0)

10+ Year Member



I found a great function on php.net that makes it easy to strip nasty Word characters:

function superhtmlentities($text) {
$entities = array(128 => 'euro', 130 => 'sbquo', 131 => 'fnof', 132 => 'bdquo', 133 => 'hellip', 134 => 'dagger', 135 => 'Dagger', 136 => 'circ', 137 => 'permil', 138 => 'Scaron', 139 => 'lsaquo', 140 => 'OElig', 145 => 'lsquo', 146 => 'rsquo', 147 => 'ldquo', 148 => 'rdquo', 149 => 'bull', 150 => '#45', 151 => 'mdash', 152 => 'tilde', 153 => 'trade', 154 => 'scaron', 155 => 'rsaquo', 156 => 'oelig', 159 => 'Yuml');
$new_text = '';
for($i = 0; $i < strlen($text); $i++) {
$num = ord($text{$i});
if (array_key_exists($num, $entities)) {
switch ($num) {
case 150:
$new_text .= '-';
break;
default:
$new_text .= '&'.$entities[$num].';';
}
} else {
if($num < 127 ¦¦ $num > 159) {
$new_text .= $text{$i};
}
}
}
return $new_text;
}

Now all my pages with Word content validate!

ergophobe

6:51 pm on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sorry, I think we misunderstood the question. The tidy solution may not actually handle that. I'm not sure. I was thinking like Coopster that you wanted to get rid of all the detritus Word adds during conversion to HTML. It's clear from your subtitle that's not the case, but I didn't undertand your real question until you posted the followup.

The script that you give is just, it would seem, a slow way to convert from Windows-1252 to another encoding.

The entity code given in the array for 150 is wrong - it should be an en-dash which is &#8211; not &#45; which is a hyphen. There's no reason to except out the 150 in your switch statement either.

[edit: whoops it is set to $num>159. My bad![/edit]
Finally, you need to go all the way through 159, the Y Dieresis, so that should be $num>=160 or $num>159

More to the point, though, if you have the multi-byte extensions compiled into PHP, you can use md_detect_encoding() [php.net] to find out if what you have is in fact Windows-1252 and, if so, convert to whatever encoding you want to use on your page using mb_convert_encoding [php.net]. Be forewarned, that you must convert to something like UTF-8 and serve your pages up as such. If you convert to ISO-8859-1, it won't work right because these code points (128-159) don't exist in ISO-8859-1.

The Unicode Support [webmasterworld.com] thread from our Forum Library [webmasterworld.com] should help you out. See especially message 8 and the links in the last message.

[edited by: ergophobe at 5:28 pm (utc) on Feb. 1, 2005]

jusdrum

7:22 pm on Jan 31, 2005 (gmt 0)

10+ Year Member



Thanks ergophobe, like I said though, that function I just found off of php.net. You mentioned a few things that were wrong in terms of which characters where which, do you know of a list of these characters and their corresponding HTML entities?

ergophobe

12:50 am on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Funny you should ask. When I was posting on this I thought about the odd workings of my mindow strange my mind - I don't have the characters memorized and I don't have a link bookmarked. When I need to I just google on "Windows-1252 korpela" Pretty random, but it brings you straight to Jukka Korpela's excellent page [cs.tut.fi] on the subject.

With "systems" like that for "organizing" my knowledge, I can see I'm going to be in real trouble when my memory starts to fade!