Forum Moderators: coopster
We had to face several problems:
Since our host has set PHP and MySQL to handle iso-8859-1 (-15) charset encoding, we had to come up with a set of manipulations to 'tidy' up the input before storing it in DB.
We have noticed that the PHP engine entitizes already the characters of the $_POST variable that were outside the charset scopes.
Besides, we want to get rid of the MS Word oddities, thus also entitizing those correctly.
But it is good practice to htmlentities() input before storing in DB, to make sure that any user will have accentuated characters displayed correctly, for instance. But by doing so, the '&' in the previously created entities will also be turned into &, ruining the efforts. So this has to be restored.
We also want to get rid of the back_return added by nl2br used when echoing text in a page (except inside an input field). But since users copy and paste like mad, the DB could be polluted by those.
And last, '&' is one of the forbidden characters, except when in an entity.
I'm certain that many of you also had to face such situations. Since I haven't found a ready to use solution, I made up our own, as a PHP class, my very first one, and thus certainly improveable in many ways.
Your comments and suggestions will be appreciated.
Notawiz.
Here is the code of the class (attention, the 'pipes' must be restored!):
class entityHandler {
function ms_cleanup($str) {
$this->string = $str;
$this->string = ereg_replace( 133, "…", $this->string); // ellipses
$this->string = ereg_replace(8216, "‘", $this->string); // left single quote
$this->string = ereg_replace( 145, "‘", $this->string); // left single quote
$this->string = ereg_replace(8217, "’", $this->string); // right single quote
$this->string = ereg_replace( 146, "’", $this->string); // right single quote
$this->string = ereg_replace(8220, "“", $this->string); // left double quote
$this->string = ereg_replace( 147, "“", $this->string); // left double quote
$this->string = ereg_replace(8221, "”", $this->string); // right double quote
$this->string = ereg_replace( 148, "”", $this->string); // right double quote
$this->string = ereg_replace(8226, "•", $this->string); // bullet
$this->string = ereg_replace( 149, "•", $this->string); // bullet
$this->string = ereg_replace(8211, "–", $this->string); // en dash
$this->string = ereg_replace( 150, "–", $this->string); // en dash
$this->string = ereg_replace(8212, "—", $this->string); // em dash
$this->string = ereg_replace( 151, "—", $this->string); // em dash
$this->string = ereg_replace(8482, "™", $this->string); // trademark
$this->string = ereg_replace( 153, "™", $this->string); // trademark
$this->string = ereg_replace( 169, "©", $this->string); // copyright mark
$this->string = ereg_replace( 174, "®", $this->string); // registration mark
return $this->string;
}function entitize($str) {
$entitized=htmlentities($this->ms_cleanup($str));
// this may cause entitizing of "&" in entities => restore the ampersand
$search_pattern="/&?(#\d{3,6};¦[a-zA-Z]{2,6};¦#[x0-9a-fA-F]{2,6};)/i";
if (preg_match_all($search_pattern, $entitized, $entity_match)) {
$old_matches=array();
$replace_matches=array();
foreach($entity_match[0] as $e_key=>$e_value) {
$old_matches[]="/$e_value/";
$replace_matches[]=preg_replace("/&/","&",$e_value);
}
$restored=preg_replace($old_matches,$replace_matches, $entitized);
}
// finally, avoid having <br /> in DB when echoed string was copied/pasted
$restored=$this->stripbr($restored);
return $restored; // string to be stored in DB
}// when echoed => nl2br. If then copied/pasted in textarea, strip the <br />
function stripbr($str){
$str=eregi_replace('<br[[:space:]]*/?[[:space:]]*>',"",$str);
return $str;
}
// check for forbidden characters
function char_check ($str,$comment_pattern,$old,$label,$invalid_comment,$length) {
$split_pattern="/&(#\d{3,6};¦[a-zA-Z]{2,6};¦#[x0-9a-fA-F]{2,6};)/";
$chunks = preg_split($split_pattern, $str, -1, PREG_SPLIT_NO_EMPTY);
foreach($chunks as $split_string) {
if (preg_match($comment_pattern, $split_string, $match)) {
$error_pres=TRUE;
$match_array=$match; // will always be the last match of the tested string
}
}
if ($error_pres==TRUE) {
$new=array($label,$match_array[1],$length); // $match[1] must come AFTER the preg_match
$invalid_presentation_comment=str_replace($old,$new,$invalid_comment);
$error_txt="$invalid_presentation_comment<br />\n";
}
return $error_txt;
}
}
Can someone test it to see if it works as expected also on other platforms?
Thx
Notawiz
// cleanup of windows code page cp1252 characters 127-159
function ms_cleanup($str) {
$this->string = $str;
// 127 not in cp1252 (DEL)
$this->string = ereg_replace( 128, "€", $this->string); // euro
$this->string = ereg_replace( 129, "™", $this->string); // trademark
$this->string = ereg_replace( 130, "‚", $this->string); // single low 9 quot
$this->string = ereg_replace( 131, "ƒ", $this->string); // function with hook
$this->string = ereg_replace( 132, "„", $this->string); // double low 9 quot
$this->string = ereg_replace( 133, "…", $this->string); // ellipses
$this->string = ereg_replace( 134, "†", $this->string); // dagger
$this->string = ereg_replace( 135, "‡", $this->string); // double dagger
$this->string = ereg_replace( 136, "ˆ", $this->string); // circumflex
$this->string = ereg_replace( 137, "‰", $this->string); // per mille
$this->string = ereg_replace( 138, "Š", $this->string); // capital s with caron
$this->string = ereg_replace( 139, "‹", $this->string); // single left angle quot
$this->string = ereg_replace( 140, "Œ", $this->string); // capital ligature oe
// 141 not in cp1252
$this->string = ereg_replace( 142, "Ž", $this->string); // capital z with caron
// 143 not in cp1252
// 144 not in cp1252
$this->string = ereg_replace( 145, "‘", $this->string); // left single quote
$this->string = ereg_replace( 146, "’", $this->string); // right single quote
$this->string = ereg_replace( 147, "“", $this->string); // left double quote
$this->string = ereg_replace( 148, "”", $this->string); // right double quote
$this->string = ereg_replace( 149, "•", $this->string); // bullet
$this->string = ereg_replace( 150, "–", $this->string); // en dash
$this->string = ereg_replace( 151, "—", $this->string); // em dash
$this->string = ereg_replace( 152, "˜", $this->string); // small tilde
$this->string = ereg_replace( 153, "™", $this->string); // trademark
$this->string = ereg_replace( 154, "š", $this->string); // small s with caron
$this->string = ereg_replace( 155, "›", $this->string); // single rigth angle quot
$this->string = ereg_replace( 156, "œ", $this->string); // small ligature oe
// 157 not in cp1252
$this->string = ereg_replace( 158, "ž", $this->string); // small z with caron
$this->string = ereg_replace( 159, "Ÿ", $this->string); // capital Y with dia
// end of cp1252 differencesreturn $this->string;
}