Forum Moderators: phranque

Message Too Old, No Replies

supporting UTF-8 and ISO-8859-1 input

how to support both UTF-8 and ISO-8859-1 in "input" URL?

         

vinnydtm

9:17 pm on Jan 22, 2009 (gmt 0)

10+ Year Member



Hi,

We have a site supporting only ISO-8859-1 encoding. So we get input like this:

http://example.com/Param%E9/

(%E9 would be a "e" acute URL encoded as ISO).

Now, we want to support BOTH UTF-8 AND continue to support existing ISO encoding. i.e. these should work:

http://example.com/Param%E9/?charset=ISO-8858-1
http://example.com/Param%C8%A9/?charset=UTF-8
http://example.com/Param%C8%A9/ <- default if charset not defined (we haven't decided which to default).

So, the question is simply: can I do this in the Apache, possibly with rewrites?

The architecture we have is Apache webserver front end, connect to Tomcat which runs our application in the Webwork framework.

We have tried setting the Apache/Tomcat connector with URIEncoding="UTF-8" and other parameters which will make the site full consistent with UTF-8 encoding. But the tricky part is we still need to support the "input" URL with ISO encoding.

One option we consider is keep the input and URIEncoding as ISO and do the conversion in the application. But that seems a little unsatisfying.

Thanks for any insights!

-- Vince

jdMorgan

10:44 pm on Jan 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You'll have to do the conversion in the application, because only the application is going to have the 'knowledge' and the flexibility to figure out what encoding is being used.

As for the output side, you can check the "Accept-Charset" header in the client request to determine which encoding to use when sending your responses.

Jim

Caterham

12:09 am on Jan 24, 2009 (gmt 0)

10+ Year Member



So, the question is simply: can I do this in the Apache, possibly with rewrites?

What should be done? Convert the string with charset=ISO-8858-1 into UTF-8 and pass it to tomcat?

default if charset not defined (we haven't decided which to default).

But you know that this will be either A or B and not mixed?

We have tried setting the Apache/Tomcat connector with URIEncoding="UTF-8"

That should mean that your connector (mod_jk?) would expect é instead of é.

But the tricky part is we still need to support the "input" URL with ISO encoding.

Due to bookmarks? Are there clients which run into problems when you reference your pages like href="foo%C3%A9"?

and do the conversion in the application. But that seems a little unsatisfying.

You could use mod_perl in the URI-to-filename translation phase, but the question remains: What should be done?

vinnydtm

12:18 am on Jan 24, 2009 (gmt 0)

10+ Year Member



To answer Caterham:

1) Yes, like to convert ISO to UTF-8 before passing to tomcat
2) Yes, the input will be not be mixed
3) Yes. When we use URIEncoding="UTF-8" and all input are UTF-8, that's fine, but if I need to do something in the app if it's ISO, as jmorgan suggests, then we seem to need the URIEncoding="ISO-8859-1" to get the bytes to convert.
4) More than that. We provide this as an XML interface. The input URL is request and we return output. We can change the output from ISO to UTF-8 as XML have this in the header. But the request from customer applications may still be in ISO, so we must support that.
There's also Google and SE index.

Caterham

3:16 pm on Jan 25, 2009 (gmt 0)

10+ Year Member



1) Yes, like to convert ISO to UTF-8 before passing to tomcat

Ok, that can be done either via mod_perl or a small c module, both running in the translate_name phase. A piped external RewriteMap (prg:) is also possible, but I'd prefer 1) or 2).

I have such a script for mod_perl in use which translates "is not only ascii input" to UTF-8 but for another reason (underlying filesystem). The queryString check can be hacked in easily, but currently it cannot detect whether the input is already UTF-8 because the perl functions isutf8 didn't work as desired in this case, as far as I can remember.
I.e. for the case of a missing queryString, the script cannot detect at this moment if there's a UTF-8 char, hence it would be double-encoded. I know that my input is always not UTF-8 so that's not a problem in my environment.
But that wouldn't be a problem if you know "if qs is missing, always encode" or "if qs is missing, never encode" so that no detection is necessary.

To replace the mod_perl perl script I wrote a small c module, mod_translate_utf, which acts in the translate_name phase, too, but provides also two internal RewriteMap functions for mod_rewrite (int:latin1_to_utf8 and int:utf8_to_latin1). The module can detect if the input has UTF-8 chars, so there would be no need for a QueryString specifying the encoding. A few issues have to be resolved prior this module could used. apr_xlate_open doesn't return a convset for some reason, so the translation via apr_xlate doesn't work at this moment. I'll take a look at it why it hangs.

[edited by: jdMorgan at 10:09 pm (utc) on Jan. 25, 2009]
[edit reason] Disabled spurious smiliey-faces to clarify. [/edit]

vinnydtm

10:34 pm on Jan 28, 2009 (gmt 0)

10+ Year Member



Hi, after a bit more research, I found my solution using RewriteMap.

In the httpd.conf, define the Rewritemap:


RewriteMap utf8 prg:conf/utf8map.pl

The perl code utf8map.pl is as follow, which doesn't need any parameter. I simply check if the input looks like UTF-8, if not, then encode it.


#!/usr/bin/perl
use Encode;

$¦ = 1; # Turn off buffering

while (<STDIN>) {
chomp;
eval {
#-- check if UTF8, no change required
decode_utf8(&URLDecode($_), Encode::FB_CROAK);
print $_,"\n";
};
if ($@) {
print &URLEncode(encode_utf8(&URLDecode($_))),"\n";
}
}

#-- from [rami.info...]
sub URLDecode {
my $theURL = $_[0];
$theURL =~ tr/+/ /;
$theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;
$theURL =~ s/<!.(.¦\n)*.>//g;
return $theURL;
}

sub URLEncode {
my $theURL = $_[0];
$theURL =~ s/([\W])/"%" . uc(sprintf("%2.2x",ord($1)))/eg;
return $theURL;
}

Comments on improvement welcome.

- Vince

Caterham

3:04 pm on Feb 2, 2009 (gmt 0)

10+ Year Member



Why do you need to URL-unescape/escape the input/output? Are you working with double-escaped input like %25E9? The path your rule matches against and request_uri are already escaped by the core (unless you have a forward-proxy request). Use the escape and unescape map functions prior passing the value to your map; c code is faster.

prg maps have some drawbacks. Never use them without a lock file (RewriteLock directive). Otherwise, if there are simultaneous requests, one request might get the map result of the other request. The lock file is created once at main server startup and restart and will be locked when mod_rewrite runs your prg via apr_global_mutex_lock. If the mutex is locked, other threads will wait until the lock becomes available again, so that they can lock and run the map.

A small c module is always preferable, of course, but I don't have the time to recompile apr-util with a local iconv library, apr-iconv has some problems on my OS. Let me know if you'd like to test the c module.