It handles the basic stripping of pointed brackets, ampersands, etc.. but I am looking for something more elegant.
I found a thread on thelist at evolt that gives a PHP example for a similar situation:
[lists.evolt.org...]
But I want to (use perl) and convert things like:
“ or ” to &034;
(left and right curved double quote to standard ASCII double quote)
’ or ’ to &039; (single quote)
(left and right curved single quote to standard ASCII single quote)
– or — to &045;
(em / en dash to hyphen)
etc...
I started playing with the CPAN HTML::Entities code, but after a few minutes I decided to ask here if anyone knew of something better before I spend too much time.
If nobody has anything else -- here's a little toy I started. Enter some test with unescaped high ASCII, Unicode chars, Windows-1252 stuff, etc.. then view the source of the page it returns and you'll see what "entities" does.
BUT I AM HOPING SOMEONE HAS SOMETHING BETTER ALREADY...
#!/usr/local/bin/perl
# ==========
# sgmconvt.pl
# ==========
#
#
use HTML::Entities;
$unsafe_chars = "&,<,>,\n,ü";
#
use CGI;
$query = new CGI;
#
$string = $query->param("f_string");
#
print "Content-Type: text/html\n\n";
print "<html>\n";
print "<head>\n";
print "<title>Online SMGL Entities / Character Encoder Decoder Tool</title>\n";
print "</head>\n";
print "<body>\n";
print "<center>\n";
#
print "<FORM ACTION=\"\" METHOD=\"post\">\n";
print "<TEXTAREA NAME=f_string ROWS=4 COLS=40>$string</TEXTAREA><br>\n";
print "<INPUT TYPE=reset VALUE=Reset>\n";
print "<INPUT TYPE=submit VALUE=Submit>\n";
print "</FORM>\n";
#
encode_entities( $string, $unsafe_chars );
print "<br>Encoded:<br>$string<br><br>\n";
#
print "</center>\n";
print "</body>\n";
print "</html>\n";
#
# eof
un-encoded characters
The characters are encoded, just not encoded in the charset you are using for the page. ;)
Basically, you can consider that the content copied from Word is encoded in windows-1252 (assuming the version of Windows is Western European). You are probably either declaring ISO-8859-1 or UTF-8 on your pages, and the extended characters in windows-1252 which do not correspond to an ISO-8859-1 equivalent, such as the curly quotes, are going to cause problems.
You can look at using
iconv to convert the incoming data to the character encoding of your choice: [search.cpan.org...]
[packages.debian.org...]
[gnu.org...]
If you switch to UTF-8 for your site, you can add an
accept-charset="UTF-8" attribute to your form and declare the page charset as UTF-8, and IE (and all modern browsers) will submit the textarea contents in UTF-8, including the curly quotes re-encoded as the UTF-8 counterparts.