Forum Moderators: coopster

Message Too Old, No Replies

Class for handling entities - my first attempt

Comments or improvements appreciated

         

Notawiz

4:46 pm on Aug 2, 2005 (gmt 0)

10+ Year Member



When opening our website to worldwide users, a lot of annoying problems with characters (and entities) started to spoil the fun.

We had to face several problems:

  • people using different OS and subsequent Accept_charset
  • people writing texts in MS Word or other text processors, and then pasting it in input fields
  • people copying (entity decoded) text strings echoed elsewhere in the page and pasting them in input fields
  • people entering non latin characters (asian, arab, cyrillic...)
  • people entering 'forbidden' characters

Since our host has set PHP and MySQL to handle iso-8859-1 (-15) charset encoding, we had to come up with a set of manipulations to 'tidy' up the input before storing it in DB.

We have noticed that the PHP engine entitizes already the characters of the $_POST variable that were outside the charset scopes.
Besides, we want to get rid of the MS Word oddities, thus also entitizing those correctly.

But it is good practice to htmlentities() input before storing in DB, to make sure that any user will have accentuated characters displayed correctly, for instance. But by doing so, the '&' in the previously created entities will also be turned into &, ruining the efforts. So this has to be restored.

We also want to get rid of the back_return added by nl2br used when echoing text in a page (except inside an input field). But since users copy and paste like mad, the DB could be polluted by those.

And last, '&' is one of the forbidden characters, except when in an entity.

I'm certain that many of you also had to face such situations. Since I haven't found a ready to use solution, I made up our own, as a PHP class, my very first one, and thus certainly improveable in many ways.

Your comments and suggestions will be appreciated.

Notawiz.

Here is the code of the class (attention, the 'pipes' must be restored!):


class entityHandler {

function ms_cleanup($str) {
$this->string = $str;
$this->string = ereg_replace( 133, "…", $this->string); // ellipses
$this->string = ereg_replace(8216, "‘", $this->string); // left single quote
$this->string = ereg_replace( 145, "‘", $this->string); // left single quote
$this->string = ereg_replace(8217, "’", $this->string); // right single quote
$this->string = ereg_replace( 146, "’", $this->string); // right single quote
$this->string = ereg_replace(8220, "“", $this->string); // left double quote
$this->string = ereg_replace( 147, "“", $this->string); // left double quote
$this->string = ereg_replace(8221, "”", $this->string); // right double quote
$this->string = ereg_replace( 148, "”", $this->string); // right double quote
$this->string = ereg_replace(8226, "•", $this->string); // bullet
$this->string = ereg_replace( 149, "•", $this->string); // bullet
$this->string = ereg_replace(8211, "–", $this->string); // en dash
$this->string = ereg_replace( 150, "–", $this->string); // en dash
$this->string = ereg_replace(8212, "—", $this->string); // em dash
$this->string = ereg_replace( 151, "—", $this->string); // em dash
$this->string = ereg_replace(8482, "™", $this->string); // trademark
$this->string = ereg_replace( 153, "™", $this->string); // trademark
$this->string = ereg_replace( 169, "©", $this->string); // copyright mark
$this->string = ereg_replace( 174, "®", $this->string); // registration mark
return $this->string;
}

function entitize($str) {
$entitized=htmlentities($this->ms_cleanup($str));
// this may cause entitizing of "&" in entities => restore the ampersand
$search_pattern="/&?(#\d{3,6};¦[a-zA-Z]{2,6};¦#[x0-9a-fA-F]{2,6};)/i";
if (preg_match_all($search_pattern, $entitized, $entity_match)) {
$old_matches=array();
$replace_matches=array();
foreach($entity_match[0] as $e_key=>$e_value) {
$old_matches[]="/$e_value/";
$replace_matches[]=preg_replace("/&/","&",$e_value);
}
$restored=preg_replace($old_matches,$replace_matches, $entitized);
}
// finally, avoid having <br /> in DB when echoed string was copied/pasted
$restored=$this->stripbr($restored);
return $restored; // string to be stored in DB
}

// when echoed => nl2br. If then copied/pasted in textarea, strip the <br />
function stripbr($str){
$str=eregi_replace('<br[[:space:]]*/?[[:space:]]*>',"",$str);
return $str;
}

// check for forbidden characters
function char_check ($str,$comment_pattern,$old,$label,$invalid_comment,$length) {
$split_pattern="/&(#\d{3,6};&#166;[a-zA-Z]{2,6};&#166;#[x0-9a-fA-F]{2,6};)/";
$chunks = preg_split($split_pattern, $str, -1, PREG_SPLIT_NO_EMPTY);
foreach($chunks as $split_string) {
if (preg_match($comment_pattern, $split_string, $match)) {
$error_pres=TRUE;
$match_array=$match; // will always be the last match of the tested string
}
}
if ($error_pres==TRUE) {
$new=array($label,$match_array[1],$length); // $match[1] must come AFTER the preg_match
$invalid_presentation_comment=str_replace($old,$new,$invalid_comment);
$error_txt="$invalid_presentation_comment<br />\n";
}
return $error_txt;
}
}

Notawiz

9:39 am on Aug 3, 2005 (gmt 0)

10+ Year Member



After doing some more research and testing, I had to update the first function of the class, handling the windows cp1252 character soup.

Can someone test it to see if it works as expected also on other platforms?

Thx
Notawiz


// cleanup of windows code page cp1252 characters 127-159
function ms_cleanup($str) {
$this->string = $str;
// 127 not in cp1252 (DEL)
$this->string = ereg_replace( 128, "&#8364;", $this->string); // euro
$this->string = ereg_replace( 129, "&#8482;", $this->string); // trademark
$this->string = ereg_replace( 130, "&#8218;", $this->string); // single low 9 quot
$this->string = ereg_replace( 131, "&#402;", $this->string); // function with hook
$this->string = ereg_replace( 132, "&#8222;", $this->string); // double low 9 quot
$this->string = ereg_replace( 133, "&#8230;", $this->string); // ellipses
$this->string = ereg_replace( 134, "&#8224;", $this->string); // dagger
$this->string = ereg_replace( 135, "&#8225;", $this->string); // double dagger
$this->string = ereg_replace( 136, "&#710;", $this->string); // circumflex
$this->string = ereg_replace( 137, "&#8240;", $this->string); // per mille
$this->string = ereg_replace( 138, "&#352;", $this->string); // capital s with caron
$this->string = ereg_replace( 139, "&#8249;", $this->string); // single left angle quot
$this->string = ereg_replace( 140, "&#338;", $this->string); // capital ligature oe
// 141 not in cp1252
$this->string = ereg_replace( 142, "&#381;", $this->string); // capital z with caron
// 143 not in cp1252
// 144 not in cp1252
$this->string = ereg_replace( 145, "&#8216;", $this->string); // left single quote
$this->string = ereg_replace( 146, "&#8217;", $this->string); // right single quote
$this->string = ereg_replace( 147, "&#8220;", $this->string); // left double quote
$this->string = ereg_replace( 148, "&#8221;", $this->string); // right double quote
$this->string = ereg_replace( 149, "&#8226;", $this->string); // bullet
$this->string = ereg_replace( 150, "&#8211;", $this->string); // en dash
$this->string = ereg_replace( 151, "&#8212;", $this->string); // em dash
$this->string = ereg_replace( 152, "&#732;", $this->string); // small tilde
$this->string = ereg_replace( 153, "&#8482;", $this->string); // trademark
$this->string = ereg_replace( 154, "&#353;", $this->string); // small s with caron
$this->string = ereg_replace( 155, "&#8250;", $this->string); // single rigth angle quot
$this->string = ereg_replace( 156, "&#339;", $this->string); // small ligature oe
// 157 not in cp1252
$this->string = ereg_replace( 158, "&#382;", $this->string); // small z with caron
$this->string = ereg_replace( 159, "&#376;", $this->string); // capital Y with dia
// end of cp1252 differences

return $this->string;
}