British Pound Symbol and Plain Text Emails

Forum Moderators: coopster

Message Too Old, No Replies

British Pound Symbol and Plain Text Emails

neophyte

4:56 am on Dec 7, 2012 (gmt 0)

Hello All -

I'm sending out both a HTML AND Plain Text email that contains an item cost preceded by British Pound Sterling symbol (£).

On the HTML email the pound sign comes out fine (as expected) but the plain text email shows a question mark in a black diamond.

I've tried to solve this issue for the text email by running the cost and symbol string through html_entity_decode() but without any luck: still a black diamond/question mark.

I looked at the headers for my text email and it shows the content as being "Content-type: text/plain; charset=utf-8".

Should the character set of the plain text email be something other than utf-8 to solve this problem? Or am I also missing something else?

All assistance greatly appreciated.

SevenCubed

5:10 am on Dec 7, 2012 (gmt 0)

Have you tried just typing it in using ALT+0163

lucy24

6:20 am on Dec 7, 2012 (gmt 0)

Uhm, wait, you're using an html entity in plain text? That won't work. You have to use the actual character-- but make sure your e-mail carries encoding information that can be correctly read by the recipient's e-mail program. Here it sounds as if your own e-mail reader is falling down on the job. Or you've got a mismatch between defaults for Read and Send.

The question mark in a black diamond is the UTF-8 "I can't display this character" sign. (It has, ahem, an official name.) In this case it's because your pounds sign was initially encoded in Latin-1, giving it an encoding that's gibberish to UTF-8.

I made up the last sentence in the preceding paragraph. I am now going to look it up.

Yes. In Latin-1 the pounds sign is A3. This codepoint doesn't exist in UTF-8; 80 through BF occur only as the second byte in multi-byte characters beginning in C2 through FF. If your pounds sign had happened to come immediately after something in the range Ā (A-circumflex) through ß (German eszet = ss) with no intervening space, your UTF-8-encoded e-mail would render it as some other wholly unintended character. And if it had come right between ... oh, never mind.

neophyte

6:58 am on Dec 7, 2012 (gmt 0)

Hi Lucy24 -

That's a lot of information - some that I understand... some that's way over my head. Just FYI: £ and other currency symbols are being drawn from a database.

Give all of the information you've related, what is the easiest way - given that these symbols are coming from a DB - to make this... and other various currency symbols... render correctly in a plain text email?

swa66

8:24 am on Dec 7, 2012 (gmt 0)

The easiest way is to use UTF-8 *everywhere*: in your database, in your processing, in your html, in your communication between the database and your php, in your string processing in php (mb_*), ...

It's hard to switch, but once you do it's a lot easier to go with UTF-8 everywhere.

[edited by: swa66 at 8:31 am (utc) on Dec 7, 2012]

topr8

8:27 am on Dec 7, 2012 (gmt 0)

add another database field? and insert the appropriate character depending on whether you are rendering the html or plain text version of the email

g1smd

8:49 am on Dec 7, 2012 (gmt 0)

A much easier way is to use the ISO 4217 currency codes: GBP, USD, etc.

This neatly avoids the problem where the seller lives in a country that uses the same named currency as the buyer but with a different value. When an email arrives, it might not be clear which "dollar" price is being quoted.

swa66

9:07 am on Dec 7, 2012 (gmt 0)

g1smd:
ISO4217: sure, but if I go to a .co.uk website that quotes something in pounds, nobody would expect it not to be GBP.
But that's still not going to solve accented characters in e.g. a shipping address, you still have to deal with that. And then the easiest remains to be UTF-8 all the way.

topr8:
another field: do remember that the database, and the communication between the database are by default in iso-latin (actually using a swedish sorting order by default). If you do not change it to UTF-8, you're storing gibberish in the database if you squeeze in UTF-8. And it will haunt you down the line due to the multi-character nature of UTF-8

neophyte

10:37 am on Dec 7, 2012 (gmt 0)

Thanks for everyone's replies.

swa66 -

I've always been interested in a "UTF8 Everywhere" approach and have tried to implement this in my projects. Interestingly the table that holds a long list of currency characters (specified as html currency symbol codes like £ ... not £ as currently stated) uses a UTF8 character set with UTF8_unicode_ci collation on each field.

Additionally, I'm setting all of the following in my project init file:

ini_set('mbstring.language','Neutral');
ini_set('mbstring.internal_encoding','UTF-8');
ini_set('mbstring.encoding_translation','0n');
ini_set('mbstring.http_input','auto');
ini_set('mbstring.http_output','UTF-8');
ini_set('mbstring.detect_order','auto');
ini_set('mbstring.substitute_character','none');
ini_set('default_charset','UTF-8');

I thought I had everything covered... apparently not.

You think that if I used some form of MB_ string processing (and what would you suggest?) on these currency characters previous to plain text emailing, that the plain text display problem would be eliminated? or is this a fundamental problem of using £ and such for the characters to begin with?

I'm very interested in a "best practices" approach to this issue so that I won't have to trouble over this in the future.

Thanks for your, and everyone's, help on this.

neophyte

10:38 am on Dec 7, 2012 (gmt 0)

swa66

1:42 pm on Dec 7, 2012 (gmt 0)

I'm not 100% sure what all those ini_set actually cause - I don't use them myself.

What I do:

When connecting to a database on the command line (mysql -p -u xyz)
CHARSET utf8;
-> this stops the latin-1 conversions, I now talk utf-8 to th database.

When creating a table e.g.:
CREATE DATABASE test DEFAULT CHARACTER SET = `utf8`;
-> this makes the database I create UTF-8

When connecting from php I only use mysqli (I've no clue how the obsolete mysql works in this respect)
I use this: in my connect script:


// connect to the database
$mysqli = new mysqli($server, $user, $pass, $db);
//error handling suppresed
//set UTF-8
$mysqli->set_charset("utf8");

This does the same as the "CHARSET utf8" command line when taking directly to mysql.

I output polyglot html5 nowadays, where I do that, I use:


// print start of doc
if(stristr($_SERVER["HTTP_ACCEPT"],"application/xhtml+xml")){
  header('Content-Type: application/xhtml+xml;charset=UTF-8');
}
print('<!DOCTYPE html>'."\n");
print('<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">'."\n");
print(' <head>'."\n");
print(' <meta charset="UTF-8" />'."\n");
print(' <title>...

Note the charset in there twice: once in the header and once in the document: it puts brwosers in the UTF-8 so whatever form you get them to send to you (unless you ask for other encoding), will now be UTF-8 encoded.

When processing input, I validate the input (of course), but part of that is validating that I get valid UTF-8 sequences if I expect text strings:


 if ( !mb_check_encoding($input, 'UTF-8')) {
   // error handling goes here
 }

similarly, to check that some input is not too long (using 50 as an example here:


 if ( mb_strlen($input,'UTF-8') > 50 ) {
   // error handling goes here
 }

Oh yes, and of course since I use polyglot html5, I'm only allowed 5 htmlentities anymore: &, ", <, > and ' So I have my own output filters to replace those were needed.
I don't worry about SQL injection all that much cause I use prepared statements everywhere where I touch user input.

lucy24

7:42 pm on Dec 7, 2012 (gmt 0)

is this a fundamental problem of using £ and such for the characters to begin with?

For plain text, html &entities; -- all three kinds* -- will eventually have to be converted into single characters. Otherwise your plain-text e-mail will either display the literal string "£" (urk! let's go for the How To Look Like An Amateur award!) or it will unintentionally turn the whole e-mail into html, making a mess of everything else.

You can now count the number of places where encodings have to match:

changing an entity into a character for database storage
and/or
changing an entity {et cetera} for retrieval from database
and/or changing a character into an entity for both
and/or changing an html entity into a percent-encoded entity
and/or transfer from holding bin to e-mail
{et cetera}

Set everything explicitly and don't rely on defaults.

* The form &#nnn; is a decimal entity. Character number in base 10. There are also &#xnnn; hexadecimal entities and £ named entities. In raw html, all produce the intended result, independent of encoding. But they take up extra room and make your file hard to read. I'll bet you don't say å and ö throughout!

neophyte

1:19 am on Dec 8, 2012 (gmt 0)

swa66 and Lucy24 -

Thanks very much for your detailed explanations... I see that I've got quite a few things to consider/correct to make these symbols render correctly in plain text... live and learn.

Thanks again!