Forum Moderators: phranque
We have a site supporting only ISO-8859-1 encoding. So we get input like this:
http://example.com/Param%E9/
(%E9 would be a "e" acute URL encoded as ISO).
Now, we want to support BOTH UTF-8 AND continue to support existing ISO encoding. i.e. these should work:
http://example.com/Param%E9/?charset=ISO-8858-1
http://example.com/Param%C8%A9/?charset=UTF-8
http://example.com/Param%C8%A9/ <- default if charset not defined (we haven't decided which to default).
So, the question is simply: can I do this in the Apache, possibly with rewrites?
The architecture we have is Apache webserver front end, connect to Tomcat which runs our application in the Webwork framework.
We have tried setting the Apache/Tomcat connector with URIEncoding="UTF-8" and other parameters which will make the site full consistent with UTF-8 encoding. But the tricky part is we still need to support the "input" URL with ISO encoding.
One option we consider is keep the input and URIEncoding as ISO and do the conversion in the application. But that seems a little unsatisfying.
Thanks for any insights!
-- Vince
As for the output side, you can check the "Accept-Charset" header in the client request to determine which encoding to use when sending your responses.
Jim
So, the question is simply: can I do this in the Apache, possibly with rewrites?
default if charset not defined (we haven't decided which to default).
We have tried setting the Apache/Tomcat connector with URIEncoding="UTF-8"
But the tricky part is we still need to support the "input" URL with ISO encoding.
and do the conversion in the application. But that seems a little unsatisfying.
1) Yes, like to convert ISO to UTF-8 before passing to tomcat
2) Yes, the input will be not be mixed
3) Yes. When we use URIEncoding="UTF-8" and all input are UTF-8, that's fine, but if I need to do something in the app if it's ISO, as jmorgan suggests, then we seem to need the URIEncoding="ISO-8859-1" to get the bytes to convert.
4) More than that. We provide this as an XML interface. The input URL is request and we return output. We can change the output from ISO to UTF-8 as XML have this in the header. But the request from customer applications may still be in ISO, so we must support that.
There's also Google and SE index.
1) Yes, like to convert ISO to UTF-8 before passing to tomcat
I have such a script for mod_perl in use which translates "is not only ascii input" to UTF-8 but for another reason (underlying filesystem). The queryString check can be hacked in easily, but currently it cannot detect whether the input is already UTF-8 because the perl functions isutf8 didn't work as desired in this case, as far as I can remember.
I.e. for the case of a missing queryString, the script cannot detect at this moment if there's a UTF-8 char, hence it would be double-encoded. I know that my input is always not UTF-8 so that's not a problem in my environment.
But that wouldn't be a problem if you know "if qs is missing, always encode" or "if qs is missing, never encode" so that no detection is necessary.
To replace the mod_perl perl script I wrote a small c module, mod_translate_utf, which acts in the translate_name phase, too, but provides also two internal RewriteMap functions for mod_rewrite (int:latin1_to_utf8 and int:utf8_to_latin1). The module can detect if the input has UTF-8 chars, so there would be no need for a QueryString specifying the encoding. A few issues have to be resolved prior this module could used. apr_xlate_open doesn't return a convset for some reason, so the translation via apr_xlate doesn't work at this moment. I'll take a look at it why it hangs.
[edited by: jdMorgan at 10:09 pm (utc) on Jan. 25, 2009]
[edit reason] Disabled spurious smiliey-faces to clarify. [/edit]
In the httpd.conf, define the Rewritemap:
RewriteMap utf8 prg:conf/utf8map.pl
The perl code utf8map.pl is as follow, which doesn't need any parameter. I simply check if the input looks like UTF-8, if not, then encode it.
#!/usr/bin/perl
use Encode;$¦ = 1; # Turn off buffering
while (<STDIN>) {
chomp;
eval {
#-- check if UTF8, no change required
decode_utf8(&URLDecode($_), Encode::FB_CROAK);
print $_,"\n";
};
if ($@) {
print &URLEncode(encode_utf8(&URLDecode($_))),"\n";
}
}#-- from [rami.info...]
sub URLDecode {
my $theURL = $_[0];
$theURL =~ tr/+/ /;
$theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;
$theURL =~ s/<!.(.¦\n)*.>//g;
return $theURL;
}sub URLEncode {
my $theURL = $_[0];
$theURL =~ s/([\W])/"%" . uc(sprintf("%2.2x",ord($1)))/eg;
return $theURL;
}
Comments on improvement welcome.
- Vince
prg maps have some drawbacks. Never use them without a lock file (RewriteLock directive). Otherwise, if there are simultaneous requests, one request might get the map result of the other request. The lock file is created once at main server startup and restart and will be locked when mod_rewrite runs your prg via apr_global_mutex_lock. If the mutex is locked, other threads will wait until the lock becomes available again, so that they can lock and run the map.
A small c module is always preferable, of course, but I don't have the time to recompile apr-util with a local iconv library, apr-iconv has some problems on my OS. Let me know if you'd like to test the c module.