homepage Welcome to WebmasterWorld Guest from 54.234.128.25
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Visit PubCon.com
Home / Forums Index / Search Engines / Asia and Pacific Region
Forum Library, Charter, Moderators: bill

Asia and Pacific Region Forum

    
Searching in asian languages
how to get good results when there's no spaces between words?
hpche




msg:800525
 11:49 pm on Oct 17, 2002 (gmt 0)

This occurs for me when searching in Thai, but I think it would probably apply for Chinese and Japanese as well and maybe others like Arabic, I'm not sure.

The problem I'm finding that search engines match search terms based on spaces between words. Fine for most languages, but a bit of a problem in a language like Thai where spaces are used mainly only at the end of a sentence and there are no spaces between individual words in a sentence. Therefore, you can only really get the most accurate matches if your search term is basically a sentence in itself. I've looked at Google and Alltheweb advanced search pages and there doesn't seem to be any way to search disregarding spaces.

I can still sometimes get reasonable results from Google even so, but it's much less accurate than English language searching and I'm sure it's because of this problem with the spaces. Has anyone else encountered this and do you know of any way to get round it?

Thanks in advance.

 

bill




msg:800526
 7:40 am on Oct 18, 2002 (gmt 0)

Welcome to WebmasterWorld hpche.

The problem I'm finding that search engines match search terms based on spaces between words. <snip> Therefore, you can only really get the most accurate matches if your search term is basically a sentence in itself.

I'm not sure I follow you exactly here hpche, but I know that there aren't spaces in Japanese or Chinese text either, so when searching it is well known to the native searchers that they need to add a space between keywords. Generally I've found that searching for sentences in Japanese comes up with poor results. If I add spaces then engines like Google and AlltheWeb can see where to parse the keywords.

I can still sometimes get reasonable results from Google even so, but it's much less accurate than English language searching and I'm sure it's because of this problem with the spaces.

I'm not sure the spaces are your problem. I get better results with spaces usually. Maybe you could tell us a bit more why this is causing you trouble?

Terje




msg:800527
 5:05 pm on Oct 25, 2002 (gmt 0)

Most of the major engines does tokenize Chinese, Korean and Japanese. That is, they split continous text into smaller segments (tokens) for searching. I have no idea of Thai however.

Before, n-gram tokenization was the normal approach. Today most engines uses morphological analysers, although n-grams are still usefull in certain cases.

Some advice on searching:
- For complex languages (for instance Russian) or continous text languages such as japanese; If there is a language selector on the search frontend, try to set it to the language you are searching in.

Even if this means that you limit your search to pages in that language, it often also means that you can be sure that you have enabled special processing for that language. For instance this might be needed to be sure that Japanese tokenization is enabled.

This is an advantage of using local portals as these should typically have such processing enabled for all queries.

- Continous text segments are handled in different ways on different search engines even if they split the text the same way.

Some will handle "mysearchengine" as a phrase search for "my search engine". Others might handle it as a an AND type search: my AND search AND engine.
Others again might handle it as an OR type search.

Each approach has its merits, but to be sure that you get the effect you want, splitting terms with spaces is indeed a good advice to make sure you get AND or OR type searches if that is what you want.

Phrase search might be a different issue. I would recommend testing simply writing the text as a continous text segment, or as a phrase with spaces between the words.

However, even if you split the various search terms with spaces, you should still set the language for these queries as the search engine might still do more processing on the terms you have entered.

It is important for good results that the query and the indexed data is processed in the same was for the best results.

hpche




msg:800528
 6:00 pm on Oct 25, 2002 (gmt 0)

Thanks for the replies.

Okay what I meant was if I searched for the term 'World Cup' and the (theoretically) most relevant sites all had 'Fifa World Cup' or 'World cup 2002' in the title it wouldn't find them, as it would effectively be trying to match the word 'worldcup' against the word 'fifaworldcup'. Adding spaces between the keywords wouldn't help much as it would then be trying to match 'world cup' against 'fifaworldcup', which would still fail. At least for Thai, having spaces in between the words is an unnatural writing style, it'd be like having full stops after every word in English. If you forced a search engine to search in English for "World.Cup" including the full stop you'd doubtless still get some relevant results but they're not the likely to be the best ones.

The results are even worse when searching for, say, 5 words at time. Unless the words make an entire sentence by themselves then it almost always produces no results. If you put spaces in between them, it basically looks for 5 one-word sentences which also doesn't give good results.

I'm not sure I explained all that very well, but I hope you get what I mean.

It would be nice if there was a way of the search matching (from a search engines point of view) just a bit of a single word instead of entire words, which would certainly make it easier. Yahoo does it for directory matches e.g. [search.yahoo.com...] but a similar search on google produces no results.

Thanks for the suggestions Terje, I'll try them out.

bill




msg:800529
 7:00 am on Oct 26, 2002 (gmt 0)

Welcome to WebmasterWorld Terje and thanks for that explanation of tokenization. I wasn't aware of the technical aspects of that part of the search process. As far as the searching for continuous text segments, I had assumed that most engines used an AND type method when returning results for searches with spaces added.

hpche I think your problems may be due to the Thai or the bi-directional languages like Arabic & Hebrew. From what I've read Thai is a special case. I am unfamiliar with those type of language searches. With Japanese, Chinese and Korean language searches, an effective way to search for keywords is indeed to add spaces even though this is a unnatural to write these languages. I generally find phrase searching in Japanese ineffective on most engines. Maybe there are some aspects of That or bi-directional languages that require searchers to enter something else that would return better keyword searches?

Spearmaster




msg:800530
 12:40 pm on Nov 1, 2002 (gmt 0)

There's a significant difference between Chinese/Japanese/Korean and Thai.

Characters in the first three languages are words. They are easily "tokenized".

Characters in Thai are letters of the Thai alphabet. Because of the Thai writing style, these are sent out literally as letters in a single stream, soathaicanunderstandthisbutweforeignerscan't.

A search in Thai, therefore, would simply be pattern matching and not take into account spacing. The only disadvantage I can see immediately is if for some reason the phrase was physically broken up if only for line spacing reasons.

A search in Thai, therefore, should be no different than an English-language search, except you would not put spaces in between words - short phrases recommended. In the other three languages, it is still pattern matching but each word requires only two bytes.

Strangely enough, if you were to type in a series of keystrokes in upper-ASCII that created a pattern match in more than one language, you'd get results for both (in engines like Google, that is) and there is no way to differentiate the two languages except by reading the characters in the description :)

Terje




msg:800531
 5:49 pm on Nov 1, 2002 (gmt 0)

Easily? :)

Not sure what you mean, but neither Chinese, Korean or Japanese have spaces or other ways to separate words.

They all write "soathaicanunderstandthisbutweforeignerscan't".

Japanese is most certainly the most awkward of them all due to the 4 scripts in use, the way all 4 scripts can be mixed practically as you wish (this ensures that 4 may write practically any word in at least 4 ways) even within words as well as the irregularity of the language itself.

The highest accuracy numbers from vendors(!) of Japanese analyzers are at about 96%, and you can bet they are most likely inflated.

The reason why simple substring matches are not so usefull is that you get to many false positives.

Take english, I just went to brush my teeth and I looked at a Gillete shaving foam container to find some examples (their never in your head when you need them...).

It says ({}groups possible substrings that would match):
Advanced lubric{ants} an{d emo}llients for unsur{passed} raz{or g}lide. Helps revitalize sensitive skin. Gives you a closer, smo{other} mor{e com}for{t{able}} s{have}.

(Remember the spaces don't mean anything. We write continous text here without spaces so they should really have been removed)

As you can see, a short random picked text. The substring search would hit on "ants", "demo", "passed", "org", "other", "ecom", "table", "able" and "have" but without any relevance to those search terms.

I probably missed a few there as well.

Thats why most search engines has gone from n-grams to morphological analyzers.

Spearmaster




msg:800532
 6:51 pm on Nov 1, 2002 (gmt 0)

Like I mentioned - it's not the spacing.

All are double-byte. However, two keystrokes in Japanese, Chinese or Korean (or CJK) equals a word.

Two or more keystrokes in Thai equals a letter, and NOT a word. Otherwise we could simplify the input system to CJKT instead of CJK :) Vowels in Thai are modifiers of consonants, somewhat similar to French accents. So a single "letter" could conceivably take four keystrokes.

Simply put, both rely on patterns for searching. But it will be actually more difficult to match a Thai word - or part thereof - as opposed to a CJK two-byte word - simply because there are many more keystrokes involved.

Thai is like English without the spacing - a string of letters. CJK is a string of words.

CJK is wordwordword. Thai is abcdefgabcdefg etc.

bill




msg:800533
 7:12 am on Nov 2, 2002 (gmt 0)

The highest accuracy numbers from vendors(!) of Japanese analyzers are at about 96%, and you can bet they are most likely inflated.

Terje could you explain this statement a bit? Are Japanese text analyzers simply not that effective due to the complexity of the language?

Thai is like English without the spacing - a string of letters. CJK is a string of words.

Spearmaster your explanation clears a lot up for me. I had always read that Thai was a special case, but was never sure why.

Spearmaster




msg:800534
 8:01 am on Nov 3, 2002 (gmt 0)

Having said that LOL... some people in Thailand are still working to make Thai-language searching better.

By the way - the problems that exist in searching also affect simple things such as display and printing! Printing in Thai (or CJK) without the correct printer driver is like trying to print Postscript without a Postscript driver... LOL...

bill




msg:800535
 1:10 pm on Nov 3, 2002 (gmt 0)

I'm still not sure why they can't parse a word out of a string of text like they can in CJK. Logically it seems to me that it would work the same, but I guess there are still a few things I don't know about the Thai language...understatement

Spearmaster




msg:800536
 3:30 pm on Nov 3, 2002 (gmt 0)

I haven't done my site in Thai yet so haven't had to do much testing yet. But strangely enough, while I was doing some checks in OpenFind (Taiwan) just now, I took a closer look at the source and discovered that a search was being sent to the script like this: %bd - where the % serves as an escape character, and bd represents the actual keystrokes.

This would indicate that its results are all served up (or stored) in this format, and thus eliminate the issue of pattern matching. I didn't check Yahoo or Lycos or similar engines yet, but it wouldn't surprise me to find out that they were using a similar technique.

Thus, a string of characters (or keystrokes) would automatically eliminate any issues with spaces, since I presume they would be automatically discarded in a search if they existed in the query term.

Now I wonder how that would work in Thai... I can't test that as my own machine does not have Thai installed, but it would be a simple task for me to find out elsewhere.

Terje




msg:800537
 4:17 pm on Nov 3, 2002 (gmt 0)

Yes Bill, the irregularities in how japanese is written does indeed reduce accuracy vs. Korean and Chinese.

Spearmaster, I have no idea what you are talking about. Two keystrokes makes a word in CJK? Words in CJK most definately aren't written with two keystrokes, neither are all of them made from two characters.

I don't really see how the number of keystrokes has anything to do with tokenization or why Thai is really any more difficult to tokenize.

Examples please?

hpche




msg:800538
 5:10 pm on Nov 3, 2002 (gmt 0)

Thai is like English without the spacing - a string of letters.

Yes this is true, each keystroke in Thai makes a seperate character that is (usually) either a consonant, vowel or tone mark. A character by itself is not a word. I probably should have mentioned this in the first place :)

Two or more keystrokes in Thai equals a letter, and NOT a word.

Well this is not really true, there are a lot of words made with only two keystrokes likewise letters made with only one keystroke.

I took a closer look at the source and discovered that a search was being sent to the script like this: %bd - where the % serves as an escape character, and bd represents the actual keystrokes. This would indicate that its results are all served up (or stored) in this format, and thus eliminate the issue of pattern matching


This is just URL encoding, it can be (and is) easily changed back to the unencoded form, and the searches are matched on the unencoded form so pattern matching and parsing a word from a string is definitely possible.

The reason why simple substring matches are not so usefull is that you get to many false positives.


I guess, however the results without substring matching (as Google etc do presently) are also very poor. I'd also think that the type of false positive examples you mention are much more likely to occur in English with 'only' 26 characters than Thai with it's 44 consonants characters, 15 vowels characters, 4 tone marks and a few other symbols besides.

Dedicated Thai search engines (thaiseek.com, siamguru.com) do use substring matching and it would be nice if there was someway that Google and others could do the same for Thai searches, and other languages that have the same problem (if there are any?). Maybe they're not even aware of it, I should them a link to this thread :)

Spearmaster




msg:800539
 6:43 pm on Nov 3, 2002 (gmt 0)

I should have said two bytes makes a CJK word. Not necessarily two keystrokes - that will depend on the input system you use. And by word, I mean a character. Some "words" have more than one character, but each character in the word is a word by itself - if that makes any sense LOL.

Tokenizing Thai can be done, I suppose, if someone wanted to come up with a standard for tokenization. But the language is extremely complex and many words can be written in more than one way. There isn't even a proper standard used by the government for romanization - I've seen names or places or streets written 3-4 different ways.

I have enough trouble trying to learn the alphabet - started three times, stopped three times - if only because I didn't really dedicate myself to the task. But as Hpche points out, there are a LOT of consonants and vowels to learn, not to mention tone symbols. A very small minority of these are considered obsolete, or not commonly used - you'd think they would remove them from the official alphabet, but no... LOL... so I depend on my kids or my staff to do my translation for me :)

Hpche, my mistake. Sometimes my fingers get ahead of my brain LOL. Two keystrokes can equal a word. One keystroke equals a letter, but one "position" in a display sequence may equal 2 letters (or 3 if I am not mistaken). This is what makes Thai printer drivers so essential.

You're also right about the URL encoding - it looked to me as if it represented a keystroke sequence, instead of upper ASCII. I'm so used to seeing URL encoding like %20 (space) etc. and completely forgot that upper ASCII would start with %80 and above.

I would have thought some of the larger engines would have implemented some sort of substring matching by now. Thank goodness my keywords don't seem to be a major problem as far as false positives are concerned.

Diklee




msg:800540
 3:15 am on Nov 6, 2002 (gmt 0)

In addition to the discussions on tokenization, two possible explanations on the observation that Google is not doing as well for Asian languages as it could for English:

1) Take Chinese as an example, Google might have fewer pages in Chinese (same for Japanese, Thai, etc.) than English pages. I haven't seen any figures about the breakdown of the number of pages in Google's index by language

2) Google's ranking heavily depends on link analysis. In Chinese, for example, I don't think there are enough web sites with enough links to each other, thus forming a trust worthy "reference network". My observation is that there are a large number of very large sites in China, but they rarely point to each other. Pages in the Western world, for obvious reasons, rarely link to Chinese pages. Under this situation, Google's ranking would be largely dependent on keyword matching.

By the way, Google (as well as Fast) uses Basis Technology (Boston) to do linguistic analysis on web pages and perhaps queries too. It means that given a Chinese character string "abcde", it will try to separate it into words, such as "ab cde". As such the accuracy of search depends on how accuracy the breakdown is.

Woz




msg:800541
 3:58 am on Nov 6, 2002 (gmt 0)

Welcome to Webmasterworld Diklee (Ley Ho)

You raise some interesting points on the ratio of Native Language pages to English pages.

>I haven't seen any figures about the breakdown of the number of pages in Google's index by language.

I did some research [webmasterworld.com] some time ago on Fast which may be of interest. It is more difficult to do the same research on Google.

>Google's ranking heavily depends on link analysis. ~~ Pages in the Western world, for obvious reasons, rarely link to Chinese pages.

Interesting observation, this would probably affect ranking to a certain degree as long as the search query could be correctly parsed.

Onya
Woz

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Asia and Pacific Region
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved