Corrupted International Characters in Keyword Report (GA)

Forum Moderators: DixonJones

Message Too Old, No Replies

Corrupted International Characters in Keyword Report (GA)

Nuttakorn

6:24 am on Jul 8, 2011 (gmt 0)

I have found two keywords corrupted in top 20 keyword traffic from Baidu. It shows wrong international characters not in Chinese Mandarin keyword. It shows like this "ǽֽͼƭ" , is there any way to convert to Mandarin. I searched on Google support, some other report similar issue last year, I think Google might solve it. Any advice, thanks.

lucy24

6:54 am on Jul 8, 2011 (gmt 0)

What you've got, of course, is decimal HTML entities-- but the numbers are far too low for Chinese.

:: detour to converter [statman.info] so I can look them up in Character Viewer which is hexadecimal ::

1fd, 5bd, 37c, 1ad = UTF-8 c7bd, d6bd, cdbc, c6ad

Can the Forums display this?

잽횽춼욭

Uhm, I guess not, but paste them into anything with an HTML preview and they'll jump right up again.

I know what's happening but I don't know how to fix it. It's interpreting UTF-16 as UTF-8 and then converting the pieces into decimal HTML entities. If you find a utility to do the retro-conversion (I can do the file-encoding part but not the HTML entities except by cut-and-paste) I want to hear about it.

Edit: After looking at what happened to my Chinese characters I'm guessing that is what happened to yours too, and what you actually pasted in was a series of non-Roman, non-Chinese letters. (Which indeed look very strange because they're from all different unicode blocks!) That makes the conversion easier because all you need is a text editor or similar to do the reinterpreting.

Trying again with encoding set manually to UTF-8:

ǽֽͼƭ
잽횽춼욭

Nuttakorn

7:18 am on Jul 8, 2011 (gmt 0)

Actually it looks like this in GA [dl.dropbox.com...] , try to convert this one to meaningful word.

lucy24

8:15 am on Jul 8, 2011 (gmt 0)

Yup, that's what I got and what you posted before it got turned into entities. It's four letters but it looks like three because the second one is a Hebrew vowel.

Can you see the last two lines of my post? I had to set my browser's File Encoding manually to utf-8 to make the letters display. But it only applies to new text; entities don't change back.

Do the Chinese characters not mean anything? It's all, ahem, Chinese to me. So if your original site visitor searched for a nonsense phrase, there's not much we can do about it.

Nuttakorn

8:28 am on Jul 8, 2011 (gmt 0)

I can see only your second one. Actually this happen to me before but I am quite ignore it but this time, the keyword traffic volume is quite significant that we don't know that is non-sense phase or not.

lucy24

8:54 am on Jul 8, 2011 (gmt 0)

When I paste 잽횽춼욭 into g### and put a space between each one, I get 13,000 hits. Without spaces, nothing. But it looks as if they are all thematic lists of every character in the language, so it is just like searching for some random set of four short words in English.

Baidu can handle them without spaces, but it still seems to bring up garbage results. (I say this with extreme hesitation since I do not happen to know Chinese.) In fact a lot of the hits are not for text at all but for blocks of numbers that just happen to contain all four of those five-digit numbers, 51133 and so on.

Nuttakorn

5:51 am on Jul 11, 2011 (gmt 0)

Do you think we need to add this script to fix and prevent future issue?

-----------------------------------------------
String documentReferer = request.getParameter("utmr");
if (isEmpty(documentReferer)) {
documentReferer = "-";
} else {
// documentReferer = URLDecoder.decode(documentReferer, "UTF-8");
documentReferer = new String(documentReferer.getBytes("ISO-8859-1"), "UTF-8"); // fix
}
String documentPath = request.getParameter("utmp");
if (isEmpty(documentPath)) {
documentPath = "";
} else {
// documentPath = URLDecoder.decode(documentPath, "UTF-8");
documentPath = new String(documentPath.getBytes("ISO-8859-1"), "UTF-8"); // fix
}

lucy24

8:44 am on Jul 11, 2011 (gmt 0)

I dunno. Now you're speaking Hungarian ;)

What you've got here isn't the ordinary problem of Latin-1 being interpreted as UTF-8 or vice versa. It's UTF-16 being interpreted as UTF-8. So you'd need some way to extract the file-encoding information from the original-- or else write some fairly complicated routines to check for improbable juxtapositions, like a Phonetic Extension alongside a Hebrew vowel. (I think it is safe to say this would never occur in nature!)

This particular group of characters can also be encoded in UTF-8, but I don't know if that's the case for the whole UTF-16 spectrum.