Forum Moderators: coopster

Message Too Old, No Replies

Multi-lingual character encoding in PHP forms

Problem submitting non-English text

         

Brownie

4:43 pm on Sep 22, 2005 (gmt 0)

10+ Year Member



I have a PHP form which submits information to an email. Everything works fine when dealing with 'normal' text. However, if a user submits "München", the resulting text displays as "München".

The website is in 9 languages (EN, FR, DE, ES, IT, RU, JP, CN, & Korean). Page encoding is unicode...
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
...on all but JP, CN & KR.

Clearly it is an encoding problem, but I have looked around and cannot really see where to start to fix the problem. My PHP form submits to an external PHP file, which in turn checks configuration information on an external text file.

If I cannot get this to work with the European languages then there is no hope for the Asian forms.

StupidScript

4:57 pm on Sep 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Have you tried converting the characters into something more stable? The Asian characters are undoubtedly going to be your biggest challenge, but the example you gave could be solved by using
$inputB=htmlspecialchars($input)
, which would result in
M&#252;nchen
, and could be brought back to normal with
$input=htmlspecialchars_decode($inputB)
.

Brownie

10:45 pm on Sep 22, 2005 (gmt 0)

10+ Year Member



Thanks for the info. To start with, I was trying understand where the process was tripping up. Surely this must be a common problem? If someone can explain why the encoding trips up, then perhaps I can decide on the best long-term solution?

Brownie

9:09 am on Sep 23, 2005 (gmt 0)

10+ Year Member



My first post should have displyed an umlaut (&uuml;) in Munchen (as the correct text). The incorrectly encoded text is also actually displaying incorrectly on the forum! :)

StupidScript

5:02 pm on Sep 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmmm.

My Red Hat 7.3 Linux / PHP 4.3.2 box didn't do what your system did. Here's my sampe code:

<form action="<?=$_SERVER["SCRIPT_NAME"]?>" method=post>

<?

if ($_POST["field1"]) {

 echo "<input type=text name='field1' value='".$_POST["field1"]."' /> ".$_POST["field1"]."<br />\n";

 $to="myemail@example.com";

 $subject="Testing oddballs";

 $message=$_POST["field1"];

 $headers="From: myadmin@example.com\r\n";

 mail($to,$subject,$message,$headers);

}

else {

 echo "<input type=text name='field1' />\n";

}

?>

</form>

When

München
is input into the field, the output remains
München
both in the field and in HTML text next to it when the form is submitted and in the email I received on my Win98 box using the Calypso mail client.

What kind of server and/or version of PHP are you running?

AlexK

8:05 pm on Sep 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Health warning: I have no personal experience of your problem. Also, please forgive me if I point out things which you know even better than I do!

PHP will only handle certain charset-encodings [php.net] internally:

Encodings of the following types are safely used with PHP:
  1. A singlebyte encoding:
    • which has ASCII-compatible (ISO646 compatible) mappings for the characters in range of 00h to 7fh.
  2. A multibyte encoding:
    • which has ASCII-compatible mappings for the characters in range of 00h to 7fh.
    • which don't use ISO2022 escape sequences.
    • which don't use a value from 00h to 7fh in any of the compounded bytes that represents a single character.

I have read of problems with PHP using utf-8 encoding internally (sorry, cannot now give any reference) and therefore have placed a note in my mind to make sure that I maintain iso-8859-1 encoding. This will clearly be determined by both the server-encoding and the page-script encoding (another mind-notated reference was of someone reporting that his page-script-encoding determined the php-internal-encoding).

You are now dealing with quite a chain of transference:

  1. page delivery from server
    (you have stated the <meta>-declared encoding, but what do the page-headers say?)
  2. <Form>-encoding (have you set a ACCEPT-CHARSET attribute?)
    The default is "usually considered to be the character encoding used to transmit the document containing the FORM" but, in your situation, I would trust nothing!
  3. POST delivery from browser
    (check the headers again)
  4. reception by PHP
    (check the php-internal-encoding)
  5. delivery of results by PHP
    (what do you use to convert encoding? You have a choice of iconv [php.net], Multibyte Strings [php.net] or GNU Recode [php.net])

I have looked around and cannot really see where to start to fix the problem
The above should fix that!

ergophobe

11:37 pm on Sep 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a fairly lengthy post on this problem.

[webmasterworld.com...]

Look down the whole thread first as much of the meat is towards the bottom and coopster assembled a bunch of relevant links at the very end.

One problem that may be very hard to surmount is if your users carelessly paste in text isn't displaying the characters as they expect, but they don't notice and send it as is.

StupidScript

11:59 pm on Sep 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Great stuff, AlexK. And a very good thread note, ergophobe. I'm going to wait on the sidelines for now ... ;)

AlexK

5:08 am on Sep 24, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



ergophobe:
I have a fairly lengthy post on this problem.

Wow! *That*'s a good post! Thank goodness somebody with direct-experience of this issue answered!

I am also going to take a little space to promote my Class on RFC-Compliant Request/Response Headers [webmasterworld.com] (roundly ignored by everybody!) since it offers the chance to programmatically-discover the charsets available/sent. One item that I wanted to add to the Class was to be able to auto-convert charsets, but felt I had insufficient experience at this point to tackle it. Hence my interest in this topic, and thanks once again for your posting.