Forum Moderators: open

Message Too Old, No Replies

Converting accents, diacritics, etc to plain text

         

csdude55

7:51 pm on Nov 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have an occasional issue with users pasting entities in an attempt to get around my filters; eg, they know that I'll filter "foo" so they use "föö" instead. Most of the time these show when someone copies a pronunciation key from another site so I don't want to totally forbid it, I just want to convert it to the closest ascii character.

Since I'm pretty sure that this only occurs when they paste in to a contenteditable area (unless there's a keyboard trick I don't know), I'm trying to convert those entities to plain text via onPaste before it ever even makes it to Perl for processing.

I'm about halfway there:

// HTML
<div id="contenteditable">
thïš ïš ä prétty<b></b> thöröûgh štrïng
</div>

// JQuery
// I know jQuery isn't necessary here, but I'm already using it for other things
String.prototype.encodeHTML = function () {
return this.replace(/[\u0080-\u024F]/g,
function (v) {return '&#'+v.charCodeAt()+';';}
);
}

alert($('#contenteditable').html().encodeHTML());

// Returns
// th&#239;&#353; &#239;&#353; &#228; pr&#233;tty<b></b> th&#246;r&#246;&#251;gh &#353;tr&#239;ng


That just gives me the decimal reference, though. I'd much rather have the named reference (like "&eacute;"), which I could then just remove "&" and "acute;" to leave the "e".

Short of defining a long array of every decimal reference that I can find, can you suggest how to either get the named reference instead of the decimal, OR to convert the decimal to named?

csdude55

10:49 pm on Nov 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can convert it to UTF-16 with this, but I have no idea if that helps:

function (v) { return v.charCodeAt().toString(16); }


As far as I can tell, that returns the same as the hexadecimal reference without the 16-bit padding.

NickMNS

12:11 am on Nov 6, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You can try this.
[npmjs.com...]

I haven't used it myself in JS, but I have used a similar package for Python.

csdude55

1:08 am on Nov 6, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It looks like that requires Node.js... I can install it, but it seems like overkill for such a relatively small thing.

I've created a list of all &#0; through &#9999;, and it looks like I have about 1000 of them that could easily replace another letter. So I COULD create an associative array for all of them, but that'll be around 16kb.

csdude55

6:24 am on Nov 6, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As I'm going down this path, I think it's the only practical way to go. I'm seeing a lot of symbols that are totally subjective and there's no WAY a program could figure it out. Example, that &#3310; (&#3310;) looks like ES?

I also have tons of symbols that could pass for multiple characters; eg, &#3179; (&#3179;) could pass for an X or a K.

I've currently sorted through 3400 of them, though, and have 1162 of them saved for the array. I'm already at 23kb, and I'm only 1/3 of the way through! At this rate the array will realistically be closer to 100kb.

Anyway.

If I continue down this path, would you suggest that I save the symbols in the array, or the decimal references? Something like this (not tested, just typed for this post):

var entities = {
'à': 'a',
'À': 'A',
// and so on
};

var str = ($('#contenteditable').html()
.replace(/&#(\d+);/g, entities[$1]);


or this:

var entities = {
'224': 'a',
'192': 'A',
// and so on
};

String.prototype.encodeHTML = function () {
return this.replace(/[\u0080-\u024F]/g,
function (v) { return entities[v.charCodeAt()]; }
);
}

var str = ($('#contenteditable').html().encodeHTML());


or something totally different?

Since I do have a handful of decimal references in my database, I'm leaning towards the second one...


[edited by: not2easy at 5:44 pm (utc) on Nov 6, 2021]
[edit reason] disabled smileys [/edit]

csdude55

6:32 am on Nov 9, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For the sake of posterity, here's the final script that I created. This was built to work in conjunction with the Profanity Filter posted in the Perl forum, preventing users from using similar-looking characters to sneak past the filter.

var str = 'thïš ïš ä prétty<b></b> thöröûgh štrïng';

// Step 1, convert anything not Basic Latin to decimal reference
// 0B7F is the highest one I could find listed anywhere, although
// it really goes much further
String.prototype.encodeHTML = function () {
return this.replace(/[\u00A2-\u0B7F]/g,
function (v) {
return '&#' + v.charCodeAt() + ";";
}
);
}

// run the encodeHTML function
str = str.encodeHTML();

// Next, create a list of entities that look like Basic Latin characters
// my full list is 950 lines long, I can post it if anyone wants it
// I included decimal references for basic Latin, in case someone
// manually types &#65; in an attempt to sneak by
var entities = {
// &#65; => "A"
65 : 'A',
...
1514 : 'n'
};

// Next, convert any decimal references in str to the object value
// defined in "entities"
str.match(/(&#(\d+);)/);
const num = RegExp.$2;

if (entities[num] !== undefined)
str = str.replace(RegExp.$1, entities[num]);


If you're trying to save space then I saved a small amount of storage by changing "entities" to something like "e" and then using:

e.65 =
e.192 =
e.193 = 'A';


but I decided to keep mine in the original format for readability.

csdude55

6:46 pm on Nov 9, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A moderately faster modification:

var str = 'thïš ïš ä prétty<b></b> thöröûgh štrïng';

var entities = {
65 : 'A',
...
1514 : 'n'
};

String.prototype.encodeHTML = function () {
return this.replace(/[\u00A2-\u0B7F]/g,
function (v) {
return entities[v.charCodeAt()] || v;
}
);
}

str = str.encodeHTML();


This variation would not convert all characters in str to a decimal reference; instead, it loops through them and if they're defined in "entities" then it converts them. If they're not defined, though, then they're ignored.

lucy24

9:48 pm on Nov 9, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Be sure to let us know when the whole thing goes live. I want to start a stopwatch and see how long it takes someone to bypass the filters :)

csdude55

10:49 pm on Nov 9, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Right! LOL I already had a much shorter version of this in Perl, though, so I THINK that I'm already past the "let's see what we can break" phase. Now it's just being proactive by expanding it to other characters, and moving it to JavaScript to take the load off of the server.

csdude55

6:24 am on Nov 10, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Expanding on the final script, do you think there would be value to adding a condition to the beginning?

var str = 'thïš ïš ä prétty<b></b> thöröûgh štrïng';

if (/[\u00A2-\u0B7F]/.test(str)) {
var entities = {
65 : 'A',
...
1514 : 'n'
};

// yada yada yada

str = str.encodeHTML();
}


The script runs onPaste. It's part of a 33kb .js file that is loaded when they open the page, so it's already in the cache by time they get to the point where they're going to paste something. My thought is that if they're pasting a large article then I don't want it to take forever to show up unnecessarily.

Unless we're talking like "saving 1/2 a second" or something. Since this loads after ads and everything then I'm not as concerned with speed as I am with PHP and Perl scripts. Saving 5 seconds would matter, but 1/2 second, not so much.

lucy24

6:21 pm on Nov 10, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



[\u00A2-\u0B7F]
Meaning, “If test string contains characters in this range, do extra stuff, otherwise proceed to Step 3(c)”? That’s probably a question for benchmark testing.

csdude55

6:30 pm on Nov 10, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Your synopsis is correct :-) I don't know how to bench test JavaScript, though! I can bench server-side scripts, but I haven't found a way to do JS.

And I'm guessing that any tests on it would be completely dependent on the browser and user's computer, so I don't know how reliable it could actually be.

My concern is the same that it was when I had this in Perl; 99,999 out of 100,000 submissions are good, so I hate to slow down all of them just because of that 1.