Forum Moderators: Robert Charlton & goodroi
How does that look like?
The original URL (in ISO-8859-1) looks like eg. lösung.html
The correct encoding of that would be l%F6sung.html
Converted to UTF-8, msnbot requests l\xc3\xb6sung.html
Googlebot tries to "protect" that as: l%C3%B6sung.html
I don't think that the HTTP protocol allows for user agents fiddling with the charset of the URL. Consequently, both converted requests look very wrong to me. I see some small spiders getting this right without a problem, so the big two seem to be outsmarting themselfes here.
I'm still pondering ways to work around this. Unfortunately, only using ASCII characters is not practical in my situation. As far as I understand, encoding the critical characters as %xx myself would be wrong as well, because the user agent is then supposed to encode the % again. Any other ideas?
Of course, in an ideal world, the engines would just fix their spiders. Since I don't know why they're converting to Unicode in the first place, and how deeply this conversion is rooted in their system architecture, I have no idea how hard or easy that would be. Any comments about this from the representatives of those two engines?