Welcome to WebmasterWorld Guest from 3.81.29.226

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

A strange UTF-8 problem

Perl written CMS

     
9:47 pm on Aug 21, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member jetteroheller is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 22, 2005
posts: 3062
votes: 6


My CMS is written by me in Perl and creates a localhost:/
I am just changing everything to UTF-8

It's on Ubuntu 16.04 LTS, Perl like preinstalled.
Browser is Version 60.0.3112.90 (Official Build) (64-bit)

When my Perl working as localhost:/ returns html files to Chrome, all works fine
There is in the head
<meta http-equiv="content-type" content="text/html; charset=utf-8">
and the web site is correct shown in utf-8

But when localhost:/ delivers pages to interact with the CMS,
there is also
<meta http-equiv="content-type" content="text/html; charset=utf-8">
in the head, but it does not work.

All UTF-8 encoded characters are shown as 2 char garbage.
All my entries in text areas and text input fields are sent back ISO-8859-1 encoded.

I used on the page delivered by my CMS the developer tools, javascript console:
document.characterSet
"UTF-8"

But despite this message, that's utf-8 all the German special characters are shown as 2 strange chars

Any idea about the problem?
9:57 pm on Aug 21, 2017 (gmt 0)

Moderator from US 

WebmasterWorld Administrator lifeinasia is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 10, 2005
posts:5852
votes: 199


All my entries in text areas and text input fields are sent back ISO-8859-1 encoded.
Not sure if it will work in your case, but you can try:
<form accept-charset="UTF-8">

You should also look at your database settings. The database, table, and/or field settings may not be set to UTF-8.
11:01 pm on Aug 21, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member jetteroheller is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 22, 2005
posts: 3062
votes: 6


I tried also to enclose all with <form accept-charset="UTF-8">, did also not work.

I do not use a special database, I use simple a text file.
This is converted by "recode" and checked by "bless hex editor"

All the input and textarea fields are also not sent by a <form>.
The communication is done by
var xhttp=new XMLHttpRequest();

The Perl script answers the XMLHttpReqest by sending back a text,
This text contains a javascript.

sub js_innerHTML
{
my ( $id, $html )= @_;
my $text= uri_escape ( $html );
$wm2::js.= "\nvar uesc=unescape('$text');\ntry{document.getElementById('$id').innerHTML=uesc}\n";
$wm2::js.= "catch(e){alert('js_innerHTML err:'+e)}\n";
}

This is the central sub. to create the answer.
$wm2::js is sent back to the browser.
The browser makes an eval of the text.

var js=xhttp.responseText;
try{eval(js)}
catch(e){alert(unescape('answer_from_server%0atry js catch%0a'+e));alert(js)}
12:05 am on Aug 22, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15928
votes: 884


The horrible thing about encoding in www pages is that a global setting will override local settings. Look at your response headers and verify that they say UTF-8. If they don't, you are in luck, because this is likely to be an easier fix than checking every step of the in-and-out-of-database process.

I do not use a special database, I use a simple text file.
This is converted by "recode" and checked by "bless hex editor"
Converted to what? 8859-1, ASCII or something else?

But despite this message, that's utf-8 all the German special characters are shown as 2 strange chars
Can you give a few examples? We need to make sure things are only getting mangled once. Also check what happens with (oelig) and curly quotes; these specific characters are often useful as diagnostics.
5:31 am on Aug 22, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member jetteroheller is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 22, 2005
posts: 3062
votes: 6


My CMS has all my web sites in a folder 'internet'
All the data for a web seite are in /internet/my-domain.com/admin/source/source.txt

the conversion was
sudo apt install recode
cd to the domain to be changed
recode -v iso-8859-1..utf8 source.txt

In the textarea shows as ö
When I enter "" in the textarea, it's sent back to the perl software like is in ISO-8859-1 encoded.

I also tried to include UTF-8 in the header

print $wm2::client "HTTP/1.0 200 OK", Socket::CRLF;
print $wm2::client "Content-type: text/html; charset=utf-8", Socket::CRLF;
print $wm2::client "Access-Control-Allow-Origin: null", Socket::CRLF;
print $wm2::client Socket::CRLF;
6:06 pm on Aug 23, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member jetteroheller is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 22, 2005
posts: 3062
votes: 6


I just found out a major part of the problem:
Input fields return ISO-8859-1 encoded instead of UTF-9

I posted this in
[webmasterworld.com...]
6:23 am on Aug 24, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member jetteroheller is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 22, 2005
posts: 3062
votes: 6


The problem was based on the method how to sent and receive data for and from the browser

Line starting with # show the old version

sub my_encode
{
my ( $text ) = @_;
$text=~ s/\%/\%25/g;
$text=~ s/\n/\%0D/g;
$text=~ s/\"/\%22/g;
return $text;
}

I found, that I have only to encode "% and return for the data send from server to browser.

sub js_innerHTML
{
my ( $id, $html )= @_;
#my $text= uri_escape ( $html );
my $text= my_encode ( $html );

#$wm2::js.= "\nvar uesc=unescape('$text');\ntry{document.getElementById('$id').innerHTML=uesc}\n";
$wm2::js.= "\ntry{document.getElementById('$id').innerHTML=my_decode(\"$text\")}\n";
$wm2::js.= "catch(e){alert('js_innerHTML err:'+e)}\n";
}

For the communication from browser to server, I replaced escape() against encodeURIComponent() at the browser side.

I found with different queries about "escape unescape ISO-8859-1 UTF8" no hint, that this old javascript functions causes severe trouble for UTF-8 encoded web sites.