Forum Moderators: coopster & phranque

Message Too Old, No Replies

Perl's tr/// operator

         

csdude55

6:54 pm on Nov 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't know that I've EVER used this, it's one of those things I learned about but never had a need for it!

And now that I can actually use it... it doesn't work right! LOL

Here's what I have:

tr/äéïöûš/aeious/;


The goal is to convert:

ä => a
é => e
ï => i
ö => o
û => u
š => s

But it doesn't do what I'm expecting; instead, I get this:

$_ = 'thïš ïš ä prétty thöröûgh štrïng';

tr/äéïöûš/aeious/;

print;
# Result:
# thasss asss ae praotty thasrasasgh sstrasng


Am I using it wrong, or does Perl just not recognize umlauts as a single character?

I found that if I use "use utf8;" then it works as expected, but I kinda hate to import an unnecessary module just for this; I'm really just making sure that a user isn't trying to get around filters.

I know that I could use a series of s///g, like:

s/ä/a/g;
s/é/e/g;
s/ï/i/g;
s/ö/o/g;
s/û/u/g;
s/š/s/g;


but that, too, feels like overkill when I'm focusing on speed and performance.

Any other suggestions?

lucy24

8:16 pm on Nov 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This looks totally entrancing so I will come back to it.

Meanwhile, I note that if you reinterpret (not convert) “thïš ïš ä prétty thöröûgh štrïng” from UTF-8 to Latin-1, it turns into

thïš ïš ä prétty thöröûgh štrïng

I can’t investigate further, but this leads to the suspicion that there is already some behind-the-scenes transformation going on. It especially intrigues me that ä is read as “ae” while almost everything else becomes “as” or “ss”.

Isn’t there a “remove diacritics” option lurking somewhere?

csdude55

10:55 pm on Nov 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Isn’t there a “remove diacritics” option lurking somewhere?

With Perl, the answer to that type of question is pretty much always "there's a module for that!" LOL

This would probably be the best to do that:

[metacpan.org...]

and there's also this one (that's more widely distributed):

[metacpan.org...]

But since my focus lately has been on speed and performance, I try to keep modules to a minimum. Especially in a case like this, where I need it maybe 1 time out of 10,000.

Bench testing; using six s///g; over 1000 iterations took 0.002947s, and using utf8 with tr///; took 0.000946s (excluding the time it took to load utf8, since there's no real way to bench test that).

Knowing that it's going to interpret the diacritic (new word of the day, thanks!) as multiple characters, though, I guess there aren't any other options. I dunno, right now it looks like use utf8; and tr///; are the way to go, unless something better comes along.

lucy24

11:51 pm on Nov 3, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Now, if you don’t “use utf8” what does it use? Does everything get stored as Latin-1? If so, what happens with diacritics that aren’t in the Latin-1 character set (or its superset, Windows-Latin-1 or 1252).

csdude55

12:08 am on Nov 4, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If I just print with or without "use utf8", it just shows up with the diacritics. I'm guessing that the default display is cp1252 West European (latin1)?

I don't entirely understand the process, but just declaring "use utf8" makes tr/// work properly here. So I don't have to call a function or anything special, but what it displays doesn't appear to be utf8, either.

lucy24

5:35 pm on Nov 4, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



what it displays doesn't appear to be utf8, either

?
All visible characters exist somewhere in utf-8. Did you mean that things are getting unpacked or reinterpreted? (As when the text is suddenly littered with Å and Ã.) Or that some characters disappear?

csdude55

6:34 pm on Nov 4, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ummm... I have no idea what I was trying to say there :-O LOL

I'm not really working with "stored" data, though, I'm taking what's user-submitted and processing it. They post through a contenteditable area, so if they copy from Word or something then it comes through exactly how they copy it.

Does the charset of the HTML page matter? My current site uses:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

The rebuild uses:

<meta charset="UTF-8">

Since "use utf8" seems to modify the text automatically without running it through a function it makes me nervous to use, but I guess it doesn't really matter since the display is going to be forced to be UTF-8? MySQL stores data as "cp1252 West European (latin1)", but PHP sets it to UTF-8 again:

mysqli_set_charset($dbh, 'utf8');

I don't THINK that I can change it in MySQL without upsetting the last 20 years of stored data?

Either way. Since I'm trying to be proactive instead of reactive, I realize that I have a LONG list of potential diacritic characters to consider (about 350?). Maybe I really should back up and consider using Unicode::Diacritic::Strip?

[metacpan.org...]

It relies on Unicode::UCD for the list of diacritics so I KNOW it's going to be slow! But it'll be a lot more accurate than me trying to duplicate it manually.

lucy24

9:25 pm on Nov 4, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They post through a contenteditable area, so if they copy from Word or something then it comes through exactly how they copy it.
This may actually depend on their operating system. But I suppose they’d notice if something in their input area didn’t match what they pasted in. (Long ago, I met a website whose text input only worked correctly if you intentionally told them to use a different language than the one you’re actually in. I have mercifully blocked the details.)

MySQL stores data as "cp1252 West European (latin1)", but PHP sets it to UTF-8 again
Oh, criminy. If that’s the setup, no solution will be perfect. There are at least three stages where text is moved from Point A to Point B (for example, from the input window to the place where it gets screened, and from there to the database, and from there to the visible html, and I’ve probably missed a few). In each of those stages, non-ASCII text has to be stored as some kind of numerical entity *, and then the next stage is faced with a numerica entity that may have a different meaning.

For example:
original text contains the character é (e-acute).

if it is stored in Latin-1 (either 8859-1 or 1252) it becomes E9
(we will not talk about what happens if it is stored in some other one-byte encoding: Mac for example is the forbidden character 8E)
If it is stored in UTF-8 it becomes C3A9

If that E9 is opened in something that expects UTF-8, it will either disappear or it will merge with the following one or two letters, depending on what they are, because E9 by itself has no meaning.
If, contrariwise, that C3A9 is opened in something that expects Latin-1, it will be read as é

I don’t think an existing database can be coverted, as such. It would have to be downloaded, converted into a new encoding, and then re-uploaded. That’s the kind of thing you save for when the site is due for major revisions anyway--at which point you probably decide instead that it is not really necessary to preserve discussion threads from 2007 ;)

Oh yes and ... In Apache, contrary to ordinary usage, a charset declaration in the config file will override a charset declaration in an individual html document. Most of the time, this will not cause problems in modern browsers, but it's worth remembering.


* Yes, technically ASCII is also stored as numerical entities, but it doesn’t matter because those numbers are the same in all encodings. At least the ones that assume Roman script.

csdude55

4:02 am on Nov 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Here's a weird little discovery... I have a backup table in MySQL that stores data before I apply any filters. Using phpMyAdmin I searched for:

select * from table where
comment like '%ä%' or
comment like '%é%' or
comment like '%ï%' or
comment like '%ö%' or
comment like '%û%' or
comment like '%š%'


This returns every row that contains a, e, i, o, u, or s, though! Not the diacritic version.

I know that I've had them come through in the past (which is why I started filtering them anyway), but does MySQL (or MariaDB) drop the accents before storage? Or does it simply drop the accents when I do a select query, making it impossible for me to find them? Or am I inadvertently doing something that's converting them before they ever even hit the Perl script, making this whole endeavor pointless?

In comparison, I DID find several references to &#[0-9]+;. I had 800 of them before April 2017, then they stopped until January 6, 2021... coincidentally, just days after I upgraded to a new server with MariaDB 10. Since then I've had 35 more, almost entirely from people copying a definition from another site.

So it looks like I ALSO need to use HTML::Entities to decode the entities to the corresponding Unicode character:

[metacpan.org...]

Sheesh.

lucy24

4:33 am on Nov 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



does MySQL (or MariaDB) drop the accents before storage?
It can’t, or they would no longer be there when you again pull up whatever is being stored in there. If you post to one of your forums using characters with diacritics, are the diacritics still there in the final displayed post?

What, exactly, does “like” mean? In English it sure sounds as if it means, in some sense, “like”: ä is like a, é is like e and so on.

HTML::Entities to decode the entities
I hope it works on all three forms:
decimal entity &#233;
hexadecimal entity &#xE9;
named entity &eacute;

csdude55

5:41 am on Nov 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you post to one of your forums using characters with diacritics, are the diacritics still there in the final displayed post?

I just now tested, using "î"... OK, it got stored in there exactly like that, so false alarm. More importantly, though...

What, exactly, does “like” mean?

Excellent point! If I used REGEXP then it does work:

WHERE comment REGEXP '[äéïöûš]'

But when I tried to get a little more inclusive, it started finding regular letters again:

REGEXP '[àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸåÅçÇðÐøØéèêëçñøðå]'

So one or more of those are still matching regular letters :-(

I hope it works on all three forms:
decimal entity &#233;
hexadecimal entity &#xE9;
named entity &eacute;

It DOES appear to, but I have to install the module and find out for sure. It says:

use HTML::Entities;

$a = "V&aring;re norske tegn b&oslash;r &#230res";
decode_entities($a);
encode_entities($a, "\200-\377");

For example, this:

$input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";
print encode_entities($input), "\n"

Prints this out:

vis-&agrave;-vis Beyonc&eacute;'s na&iuml;ve
papier-m&acirc;ch&eacute; r&eacute;sum&eacute;


So I'm thinking:

use HTML::Entities;

$_ = "user input string";

# decode to convert decimal and hexadecimal to entity
decode_entities($_);

# encode again to convert entity to named
encode_entities($_);

# there has to be a finite list of potential names, so then
# I could do something like this:
s/&(\w)(?:uml|acute|circ|grave|...);/$1/g;


But that would catch, for example, &nbsp; and convert it to "n", so if I go this route then I'm going to have to play around with it a bit.

lucy24

5:24 pm on Nov 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



$a = "V&aring;re norske tegn b&oslash;r &#230res";
A worthy sentiment, to be sure ;) but funny that the source elected &#230 rather than the equally valid &aelig; It raises a tangential issue, though: letters like å and ü may be used as camouflage to get past word censors--but in a different context they are distinct letters, to be unpacked as aa, ue and the like.

:: pawing through bookmarks ::

There’s a finite list of named entities [htmlhelp.com] (divided into three pages) though the list is quite a bit longer than I thought. I suppose they are intended for ancient websites that have to use ascii encoding; names like “oring” are easier to remember than a random string of numbers.

(\w)(?:uml|acute|circ|grave)
Yes, you’d definitely need to list them by name: uml circ grave acute tilde ... uh ... slash cedil. Fortunately the ¨ is always called an umlaut, even when it’s functionally a dieresis. (No language uses both, at least not with the same vowel.)

And then there are the ligatures--æ and so on--and eth ð and thorn þ. I don’t want to spend too much time racking through my vocabulary, but I guess you could say “meaþead” to camouflage “meathead”, that kind of thing.

csdude55

6:39 pm on Nov 5, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Gah, this is getting out of hand! LOL Luckily, my demographic is exclusively English, so I don't have to worry about legitimate uses for diacritics, etc. The only legit use would be when they're copying pronunciation keys, and I think everyone will survive if that's converted :-) LOL

Backing up a little bit, though, since my issue is with the user pasting these characters to a contenteditable, I realized that I can get at least halfway there using JavaScript. If I can get the rest of it in JS then that would be a lot better, since it wouldn't really affect the site performance in any noticeable way. I'll post over on the JavaScript forum, and maybe come back here if that doesn't lead anywhere.