Regex replacement of a weirdly encoded character - Perl Server Side CGI Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Regex replacement of a weirdly encoded character

csdude55

7:44 am on Jul 16, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Every once in awhile I see where someone posts a weird character that doesn't encode correctly, like:

# this should say "example"
тАЬexampleтАЭ

So I have a few regex in place:

s#тАв#Х#g;
s#тАж#...#g;
s#тА[УФ]#-#g;
s#(Т|тА[ЩШ])#'#g;
s#(У|Ф|тАЬ?)#"#g;

That last one is the one in specific question. Today, someone posted the above тАЬexampleтАЭ, and while both of the characters were properly replaced with the ", it also had a "unknown" at the end; in Chrome it appeared as a diamond-shape with a ? in the middle.

It's worth noting that this was the last character in the string, and when I opened it in MySQL and went to the end, then used the arrow key to move to the left, it appeared to have an invisible character at the end. I don't know how to explain it... sorta like how in the 90s we would use Ctrl+whatever to hide characters and make directories impossible to enter? I would arrow to the left and nothing appeared to happen, then arrow again and it would move over properly.

I copied the exact text to Notepad and saw the same thing, as if an invisible character was there. Then I copied it to Notepad++ (and had the same invisible character) and set it to "Show All Characters", but nothing appeared there.

So I'm kind of at a loss here. Am I overlooking something in the regex, or did replacing it at the end of the string inadvertently leave something behind?

csdude55

7:48 am on Jul 16, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Update: I just found that I could copy and paste that invisible character to the SQL field of phpMyAdmin, and it appeared as a bold red dot. Maybe 4 times larger than a decimal.

Dimitri

11:39 am on Jul 16, 2019 (gmt 0)

WebmasterWorld Senior Member

Top Contributors Of The Month

This is not your question, but it looks like a problem of charset (character encoding). You have the charset of the page where the form is, the charset used by your programming language (PHP, Perl, node.js, etc...) and, you have the charset used by your SQL database. You need to ensure every elements of the chain is using the same charset (today, the most common would be UTF8).

This can be a mess I know.

not2easy

1:57 pm on Jul 16, 2019 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

What you see is not always what you get. Dimitri's suggestion is spot on because mixed character sets may not show you what is actually causing you to see what you are seeing. Correcting for what you can see may not do what you expect it to do.

Dimitri

4:54 pm on Jul 16, 2019 (gmt 0)

WebmasterWorld Senior Member

Top Contributors Of The Month

I forget to mention that you have also the charset of the output too ! You might get the right string everywhere, but if you output the string in a different charset, then you'll get odd results.

not2easy

5:19 pm on Jul 16, 2019 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

I don't speak php fluently but I am pretty sure there is a way to convert the charset of posts to comply with your own charset settings. One script I use offers a simple checkbox to require that imported text be converted to UTF8 prior to being posted to sql tables so I hope it is a simple 'extra line of php' fix and not some elaborate proprietary parsing process that does that.

In other words, I do not know the answer, but I know it exists.

Dimitri

5:22 pm on Jul 16, 2019 (gmt 0)

WebmasterWorld Senior Member

Top Contributors Of The Month

I think @csdude55 uses Perl (considering the category in which he made his post) , I just realized it :)

As for PHP's users:
[php.net...]

csdude55

6:16 pm on Jul 16, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This is true, but I honestly can't figure out how to fix it. As an example, I have a form with a simple textarea; someone on their mobile device types up what appears to be a simple description that includes a single or double quote, and their phone plugs in the curly quote instead.

(I checked, it mostly occurs with mobile; desktop users only have it when they're copying something from another website that doesn't use UTF-8)

So they type the curly quote in the textarea, then click submit. On my end, it goes through a Perl script that eventually adds it to MySQL (which shows "Server charset: UTF-8 Unicode (utf8)"). And then MySQL has it stored with characters like тА.

Then on the user's end, the page loads and shows them the quote they used, so they think it's all good. But on desktop, the user just sees a diamond or a square block for the unrecognized character.

Other than the series of regex to replace those characters, I don't know how to change the charset in Perl. I found this from about 10 years ago, but I haven't tested it and don't know if it's still applicable:

use Encode::Detect::Detector;
my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".
       "\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".
       "\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".
       "\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030

use Encode;
my $string = decode($encoding_name, $unknown);

If that's still a working solution, is it better than using a handful of regex to fix them as they show up?

For an immediate solution, is there any reason why I shouldn't just add another regex to the end of the list, like:

# the invisible character is in the first section, simply copied and pasted
s#Э##g;

iamlost

11:18 pm on Jul 16, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You mention MySQL so I'll remind of the historic problem of [ utf8 ], the MySQL encoding variation that is actually a limited subset version of real UTF-8; it did not recognise characters above 0xFFFD. So long as you are using MySQL 5.5.3 or later (I certainly hope so as current stable is 8.0.16) one should reference [ utf8mb4 ] encoding instead.

Just something to confirm, would be a simple solution.