Forum Moderators: phranque
use Unicode::String qw(utf8);
$str_1 = utf8("csdude’s string");
$str_2 = utf8("🤪🤪🤪😀");
print qq~
$str_1
$str_2
~; Oddly, both of these were posted by the same user!I'm reminded of the behavior of this very forum before its most recent revision (some 7 years ago by now, how time flies). If the first non-ASCII character in a thread was utf-8 encoded, the whole thread would be stored in the database as utf-8; otherwise it defaulted to latin-1. The current forums are strictly Latin-1, which may lead to picturesque results if you open a thread that predates the change.
The first one has an apostrophe that's encoded as iso-8859-1Curly apostrophes and quotation marks are among the non-ASCII characters that exist in 1-byte encodings such as Latin-1. But the apostrophe is not in the generic Latin-1 that overlaps with unicode, only in Windows-Latin-1. So once a string containing a ’ has been encoded as Latin-1, it can’t be reinterpreted as unicode.
DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci <head>
<meta charset="utf-8">
</head> <form action="xxx" accept-charset="utf-8">
cp1252 West European (latin1)That explains much, since that is the very Windows codepage seen in your examples. I don't know if you want to take it as comforting, or the reverse, that your database appears to be doing exactly what it was told to do (whether by you, a factory default, a gremlin or some other agency).
$text = iconv(
"CP1252", // "from"
"UTF-8//IGNORE", // "to"; //IGNORE should force it to remove anything that can't be converted
$text); // this
$text = htmlentities($text, ENT_QUOTES, "Windows-1252");
$text = html_entity_decode($text, ENT_QUOTES , "utf-8");
// and then this
mb_convert_encoding($text, 'CP1252', 'UTF-8');
How long has the site been running
and how often is its content revisited