Forum Moderators: open
Let's say visitors put Japan (in japanese) into the product-search ... the URL turns into http://example.co.jp/search?qu=%E6%97%A5%E6%9C%AC
For what it's worth, there's only a few dozen Japanese strings that should expect results in this product-search script, so if we can merely recognize those few, we'd be fine.
So here's my question:
When the above happens ... when the visitor ends up at that sort of URL ... what does the Perl script see as the contents of that variable?
How do we jump from whatever Perl sees, to something we can recognize (like 日本, or %26%2326085%3B%26%2326412%3B, or even %E6%97%A5%E6%9C%AC)?
日本
normally perl will see the (percent) encoded text and you are responsible for properly decoding the value.
something like this would typically work:
$value =~ s/%([\da-f][\da-f])/chr(hex($1))/egi
to explain:
[\da-f] defines a character class for hexadecimal digits and it means a numerical digit or a letter 'a' through 'f'.
the statement takes the text string in $value and replaces any pair of hexadecimal digits that follows a percent and replaces it with the hex digits' value as converted by perl's chr function.
i'm not sure that this is much help for your character set so you might be better served to look at the perl Encode module (which includes japanese character mappings) and try the decode_utf8 method:
[search.cpan.org...]
hope this helps...
[edited by: Woz at 12:49 am (utc) on Aug. 27, 2008]
[edit reason] Spelling, per request. [/edit]
The decode_utf8 function did change things up, but the page ended up showing mojibake ... d'oh!
I might just not be doing those items correctly ...
But here's what my partner-in-crime found: do a google search for /perl mutiple-byte characters ken lunde/ and you'll find a handful of PDF's from one of the geniuses at Adobe Systems. In one place he off-hand mentions that a line like $string = "\xE6\x97\xA5\xE6\x9C\xAC"; has the same effect of putting the word "Japan" in Japanese in there.
And that, luckily, is something my scripts can anticipate, recognize, and react to.
otherwise it uses the default character encoding for your browser which may be something like ISO-8859-1 for example.