Asian characters in product-search

Forum Moderators: open

Message Too Old, No Replies

Asian characters in product-search

jcmoon

10:16 pm on Aug 25, 2008 (gmt 0)

We have scripts for an English site which we're getting to work on a Japanese counterpart. It's a shared Linux server, and our scripts are all Perl. I'm stuck in the product-search script.

Let's say visitors put Japan (in japanese) into the product-search ... the URL turns into http://example.co.jp/search?qu=%E6%97%A5%E6%9C%AC

For what it's worth, there's only a few dozen Japanese strings that should expect results in this product-search script, so if we can merely recognize those few, we'd be fine.

So here's my question:
When the above happens ... when the visitor ends up at that sort of URL ... what does the Perl script see as the contents of that variable?
How do we jump from whatever Perl sees, to something we can recognize (like 日本, or %26%2326085%3B%26%2326412%3B, or even %E6%97%A5%E6%9C%AC)?

bill

4:56 am on Aug 26, 2008 (gmt 0)

The variable looks a lot like JIS code that has been escaped. A lot of mail clients used to return that sort of mojibake when dealing with Japanese text. However, that string produces gibberish for me when I convert it.

phranque

11:47 am on Aug 26, 2008 (gmt 0)

i used a url decoding tool for UTF-8 and it converted to two reasonable-looking Japanese (Katakana?) characters:

日本

(this actually displays properly when i paste in the form but it gets converted when submitted)

normally perl will see the (percent) encoded text and you are responsible for properly decoding the value.
something like this would typically work:
$value =~ s/%([\da-f][\da-f])/chr(hex($1))/egi

to explain:
[\da-f] defines a character class for hexadecimal digits and it means a numerical digit or a letter 'a' through 'f'.
the statement takes the text string in $value and replaces any pair of hexadecimal digits that follows a percent and replaces it with the hex digits' value as converted by perl's chr function.

i'm not sure that this is much help for your character set so you might be better served to look at the perl Encode module (which includes japanese character mappings) and try the decode_utf8 method:
[search.cpan.org...]

hope this helps...

[edited by: Woz at 12:49 am (utc) on Aug. 27, 2008]
[edit reason] Spelling, per request. [/edit]

jcmoon

10:23 pm on Aug 26, 2008 (gmt 0)

Thanks for the tips! I tried the transforming line suggested, but for some reason the output was always the same as the input, no matter how I tweaked that line or tried to use pieces of it.

The decode_utf8 function did change things up, but the page ended up showing mojibake ... d'oh!

I might just not be doing those items correctly ...

But here's what my partner-in-crime found: do a google search for /perl mutiple-byte characters ken lunde/ and you'll find a handful of PDF's from one of the geniuses at Adobe Systems. In one place he off-hand mentions that a line like $string = "\xE6\x97\xA5\xE6\x9C\xAC"; has the same effect of putting the word "Japan" in Japanese in there.

And that, luckily, is something my scripts can anticipate, recognize, and react to.

phranque

11:59 pm on Aug 26, 2008 (gmt 0)

the page will show mojibake if you haven't specified the correct character set in the content type tag.
you can do this using a metatag such as:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

otherwise it uses the default character encoding for your browser which may be something like ISO-8859-1 for example.