I thought it was the other way around. We've had a few threads [google.com] on the topic and I was under the impression from Japanese programmers and site designers that SHIFT-JIS was a better encoding. I was told that it worked better with databases used to power the site's back end. However, that doesn't seem to be as much of an issue these days.
Back in the days of the Netscape browser I had some real problems with older versions being able to handle EUC-JP encoding. None of the modern PC or mobile browsers seem to have issues with EUC-JP that I've heard of.
That's weird: I've always been told the contrary.
And I even remember that, after I left a Japanese webagency, the new webmasters quickly changed the encoding of the site to EUC-JP arguing that Shift-JIS can be buggy on mySQL database (is it true?) and that I was very incompetent for using shift-JIS.
And it looks like the majority of Japanese sites use EUC-JP now.
Actually, I've heard that UTF-8 is the best, but I still use shift-jis.
Shift-jis can cause some display problems, improper kanji being displayed. I've have it happen a couple of times
I've had display issues with UTF-8 (ie: PHP trim function) but absolutely never with Shift-JIS. Japanese clients never reported a single incident or mojibake with shift-JIS.
We've always used Shift-JIS for the Japanese versions of our sites. No complaints so far.
I just read that Hankaku Kanas aren't supported on EUC-JP.
Still, the majority of sites use EUC-JP instead of shift-JIS.
I don't get it :(
As a stupid gaijin/gawilo that has to produce output in several languages, primarily English and other Latin languages as well as Chinese and Japanese, UTF-8 is the simplest solution for me since it is unambiguous and reasonably standardised and covers everything. And even then I actually encode all the non-7-bit characters as HTML entity codes to avoid them getting mangled between my code and the browser...
As to 'better' or 'worse' I think this is going to be like whether Fuji-san is better or worse than Mt Everest: it all depends on what you mean! B^>
Cool - thanks for the new (to me) word:
Mojibake is now English? Cool. ;)
|the majority of sites use EUC-JP instead of shift-JIS. |
Do you have stats on this or is this just from your personal experience? I still see a lot of Shift-JIS sites out there.
Unfortunately from what I've heard in the industry UTF-8 is still more problematic than either EUC-JP or Shift-JIS. (That goes for Chinese encoding as well.) There are character display issues with PHP and MySQL for instance that are the bane of developers of Japanese sites. I'm still looking forward to the day when Unicode will truly be the best encoding solution. They're heading in the right direction.
The trick of encoding the non-7-bit characters as &nnnnn; entity codes got round a lot of (Java) server problems for me in the early days and still works well. It essentially avoids any broken text-handling component anywhere in the path messing up the text.
Thanks again guys :)
From what I read, I think I'll choose the following option :
- use UTF-8 for western languages
- use shift-JIS for Japanese
What do you think?
Also, I have another question:
I have developed my own multilingual CMS that use the dedicated ISO encoding for every language. All the PHP files are in ANSI mode, the texts for the interface are taken from flat text files, the websites content is taken from mySQL databases. I've heard that it's necessary to convert the PHP files into UTF-8 for the encoding to work. Is that true? Even if the PHP files contain only code? And I have to convert the flat text files and the mySQL tables, right?
Sorry but I'm getting very confused with all those encoding issues :(
|Do you have stats on this or is this just from your personal experience? |
You're right, it's from personal experience.
I went through several sites last day and they all were in EUC-JP. I can't find statistics for websites encodings :(
Here are some offical stats. I looked at the source of the top 25 Japanese websites [alexa.com] according to Alexa. (I skipped the English language sites.)
- Yahoo Japan = EUC-JP
- Google Japan = UTF-8
- Mixi = EUC-JP
- FC2 = Shift_JIS
- Rakuten = x-euc-jp
- YouTube Japan = none
- Livedoor = UTF-8
- Goo = UTF-8
- MSN Japan = UTF-8
- Wikipedia = UTF-8
- Amazon Japan = UTF-8
- Infoseek Japan = EUC-JP
- Nifty = Shift_JIS
- 2ch.net = x-sjis
- Nicovideo.jp = UTF-8
- Hatena = UTF-8
- Geocities Japan = EUC-JP
- BIGLOBE = Shift_JIS
- Sakura Internet = Shift_JIS
- Ameba = UTF-8
- Seesaa = UTF-8
- OCN = Shift_JIS
- Mobile Space = none
- Excite Japan = Shift_JIS
- Microsoft Japan = UTF-16
And the winner is:
- UTF-8 = 10
- Shift_JIS = 6
- EUC-JP = 4
- others = 3
- none = 2
So, I think I'll have to learn how to handle UTF-8 properly before I can use it seamlessly.
I just remembered, email!
You gotta set the encoding properly for that too!
From bill's list, the actual character encodings (compared to declared charsets) are slightly different in some cases. For the two sites marked "none" the encodings are UTF-8 for YouTube Japan and Shift_JIS for Mobile Space. Rakuten is actually EUC-JP, 2ch.net is Shift_JIS, and Microsoft Japan is UTF-8.
So the final tally is as follows:
- UTF-8 = 12
- Shift_JIS = 8
- EUC-JP = 5
[edited by: encyclo at 7:25 pm (utc) on July 1, 2007]
So, nobody knows how to safely transform shift-JIS/ISO sites into UTF-8?
I can hardly find any useful info on the internet :(
|how to safely transform shift-JIS/ISO sites into UTF-8? |
If you're running Linux, the best way is to use the
Also available via PHP:
I don't know the best way to convert documents under Windows, unfortunately, other than using the same library via PHP.
Thanks for the follow up on that one encyclo.
|So the final tally is as follows: |
- UTF-8 = 12
- Shift_JIS = 8
- EUC-JP = 5
I got a little lazy there. Thanks for keeping me on my toes. ;) I always tell people to check out your excellent thread: Character encoding, entity references and UTF-8 [webmasterworld.com]
Thanks for the link Bill :)
Too bad all my previous searches never pointed me to it :(
Thanks Encyclo :)
But it will be more time saving for me if I can find a Windows based UTF-8 converter.
I've found one called "Character Set Converter 1.3.7" but it can't convert from Asian character sets :(
From what I read here, UTF-8 and PHP doesn't function too well together : [phpwact.org...]
I think I'll wait for a stable PHP6 before moving toward UTF-8.
What do you think?
Shift_Jis is also iMode compatible where all the keitais aren't UTF friendly yet.
For php there is a mod that can change encodes because usually UTF-8 is for RSS but shift-Jis is for iMode. Many Japanese scripts that I've seen just use the same code to change the encode.
A long time ago I used to create websites in Japanese just using shift_JIS & English pages with the normal Western ICO encode. The backlinks from the English pages that were more popular had no affect on the Japanese pages. One day I changed everything to UTF-8 the rankings for all the Japanese pages went up. This was about 3 years ago but I think it might still apply.
I have mostly problem free experiences with utf-8 for Japanese sites.
The only problems I have had is when moving MySQL databases to different servers. Sometimes all hell broke loose but I think it was from me not setting up the database import correctly for utf-8 on the new server.
I have more problems with email encoding than website encoding....
You're using MBstring, right?
All the problem I got were from using PHP functions that don't support multibyte encodings like trim() for example.
Also, it can happen that a problem arises only on a certain Kanji which makes it very difficult to spot.
I don't get why all the PHP packages weren't provided with MBstring already included. There are too many shared environments that don't have MBstring.
Just out of curiosity: what makes it so difficult to use UTF-8 on emails?
It seems like I get a lot of different emails with different encodings so it is a bit of hassle to always change the encoding to read them. If I am not careful when I forward an email, then I send bakemoji email to others. Browsers seem better at sorting out what charset is used but email clients or webmail seem to have trouble automatically understanding what to do.
Can someone help me with ISO/JIS charsets on mySQL?
I want to know what the best practice is for storing both ISO and JIS strings in a unique mysql column from a website that uses either ISO or JIS encodings.
Should the mysql server be set to a specific charset?
Should I specify a charset when I store the data?
Thanks a lot for helping.