Page is a not externally linkable
Spearmaster - 6:43 pm on Nov 3, 2002 (gmt 0)
Tokenizing Thai can be done, I suppose, if someone wanted to come up with a standard for tokenization. But the language is extremely complex and many words can be written in more than one way. There isn't even a proper standard used by the government for romanization - I've seen names or places or streets written 3-4 different ways. I have enough trouble trying to learn the alphabet - started three times, stopped three times - if only because I didn't really dedicate myself to the task. But as Hpche points out, there are a LOT of consonants and vowels to learn, not to mention tone symbols. A very small minority of these are considered obsolete, or not commonly used - you'd think they would remove them from the official alphabet, but no... LOL... so I depend on my kids or my staff to do my translation for me :) Hpche, my mistake. Sometimes my fingers get ahead of my brain LOL. Two keystrokes can equal a word. One keystroke equals a letter, but one "position" in a display sequence may equal 2 letters (or 3 if I am not mistaken). This is what makes Thai printer drivers so essential. You're also right about the URL encoding - it looked to me as if it represented a keystroke sequence, instead of upper ASCII. I'm so used to seeing URL encoding like %20 (space) etc. and completely forgot that upper ASCII would start with %80 and above. I would have thought some of the larger engines would have implemented some sort of substring matching by now. Thank goodness my keywords don't seem to be a major problem as far as false positives are concerned.
I should have said two bytes makes a CJK word. Not necessarily two keystrokes - that will depend on the input system you use. And by word, I mean a character. Some "words" have more than one character, but each character in the word is a word by itself - if that makes any sense LOL.