Spider bugs with non-ASCII links

In some parts of my sites I use URLs that contain characters from ISO-8859-1. A normal browser will simply hex-encode those, and everything is fine. However, some robots seem to transform all URLs to Unicode internally, and then fail to do the correct reverse conversion when requesting the URL. Most likely they don't store the original charset with the URL, so they have no chance of getting it right. One of those broken robots is Googlebot, another one is msnbot, although with small variations.

How does that look like?

The original URL (in ISO-8859-1) looks like eg. lösung.html
The correct encoding of that would be l%F6sung.html

Converted to UTF-8, msnbot requests l\xc3\xb6sung.html
Googlebot tries to "protect" that as: l%C3%B6sung.html

I don't think that the HTTP protocol allows for user agents fiddling with the charset of the URL. Consequently, both converted requests look very wrong to me. I see some small spiders getting this right without a problem, so the big two seem to be outsmarting themselfes here.

I'm still pondering ways to work around this. Unfortunately, only using ASCII characters is not practical in my situation. As far as I understand, encoding the critical characters as %xx myself would be wrong as well, because the user agent is then supposed to encode the % again. Any other ideas?

Of course, in an ideal world, the engines would just fix their spiders. Since I don't know why they're converting to Unicode in the first place, and how deeply this conversion is rooted in their system architecture, I have no idea how hard or easy that would be. Any comments about this from the representatives of those two engines?

Spider bugs with non-ASCII links

converting URLs to UTF-8

bird

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week