Converting accents, diacritics, etc to plain text

I have an occasional issue with users pasting entities in an attempt to get around my filters; eg, they know that I'll filter "foo" so they use "föö" instead. Most of the time these show when someone copies a pronunciation key from another site so I don't want to totally forbid it, I just want to convert it to the closest ascii character.

Since I'm pretty sure that this only occurs when they paste in to a contenteditable area (unless there's a keyboard trick I don't know), I'm trying to convert those entities to plain text via onPaste before it ever even makes it to Perl for processing.

I'm about halfway there:

// HTML
<div id="contenteditable">
 thïš ïš ä prétty<b></b> thöröûgh štrïng
</div>

// JQuery
//  I know jQuery isn't necessary here, but I'm already using it for other things
String.prototype.encodeHTML = function () {
return this.replace(/[\u0080-\u024F]/g, 
function (v) {return '&#'+v.charCodeAt()+';';}
);
}

alert($('#contenteditable').html().encodeHTML());

// Returns
// th&#239;&#353; &#239;&#353; &#228; pr&#233;tty<b></b> th&#246;r&#246;&#251;gh &#353;tr&#239;ng

That just gives me the decimal reference, though. I'd much rather have the named reference (like "é"), which I could then just remove "&" and "acute;" to leave the "e".

Short of defining a long array of every decimal reference that I can find, can you suggest how to either get the named reference instead of the decimal, OR to convert the decimal to named?

var entities = { '224': 'a', '192': 'A', // and so on }; String.prototype.encodeHTML = function () { return this.replace(/[\u0080-\u024F]/g, function (v) { return entities[v.charCodeAt()]; } ); } var str = ($('#contenteditable').html().encodeHTML());

var str = 'thïš ïš ä prétty<b></b> thöröûgh štrïng'; var entities = { 65 : 'A', ... 1514 : 'n' }; String.prototype.encodeHTML = function () { return this.replace(/[\u00A2-\u0B7F]/g, function (v) { return entities[v.charCodeAt()] || v; } ); } str = str.encodeHTML();

Converting accents, diacritics, etc to plain text

csdude55

csdude55

NickMNS

csdude55

csdude55

csdude55

csdude55

lucy24

csdude55

csdude55

lucy24

csdude55

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week