Page is a not externally linkable
- Search Engines
-- Asia and Pacific Region
---- Searching in asian languages


hpche - 5:10 pm on Nov 3, 2002 (gmt 0)


Thai is like English without the spacing - a string of letters.

Yes this is true, each keystroke in Thai makes a seperate character that is (usually) either a consonant, vowel or tone mark. A character by itself is not a word. I probably should have mentioned this in the first place :)

Two or more keystrokes in Thai equals a letter, and NOT a word.

Well this is not really true, there are a lot of words made with only two keystrokes likewise letters made with only one keystroke.

I took a closer look at the source and discovered that a search was being sent to the script like this: %bd - where the % serves as an escape character, and bd represents the actual keystrokes. This would indicate that its results are all served up (or stored) in this format, and thus eliminate the issue of pattern matching


This is just URL encoding, it can be (and is) easily changed back to the unencoded form, and the searches are matched on the unencoded form so pattern matching and parsing a word from a string is definitely possible.

The reason why simple substring matches are not so usefull is that you get to many false positives.


I guess, however the results without substring matching (as Google etc do presently) are also very poor. I'd also think that the type of false positive examples you mention are much more likely to occur in English with 'only' 26 characters than Thai with it's 44 consonants characters, 15 vowels characters, 4 tone marks and a few other symbols besides.

Dedicated Thai search engines (thaiseek.com, siamguru.com) do use substring matching and it would be nice if there was someway that Google and others could do the same for Thai searches, and other languages that have the same problem (if there are any?). Maybe they're not even aware of it, I should them a link to this thread :)


Thread source:: http://www.webmasterworld.com/asia_pacific_search_engines/284.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com