Forum Moderators: open
some results will be of the form
"keyword1. keyword2" or
"keyword1, keyword2" etc...
firstly those are NOT matches for the exact match phrase "keyword1 keyword2"
secondly they are not highlighted using the toolbar highlight, or found when you click the toolbar word to find the instance.
it's particularly prevalent for names:
"John Smith" will find 'Murphy, John; Smith, Mark;' which is obviously about John Murphy and Mark Smith, not John Smith at all.
an exact match for "lennon paul" would be the phrase "word lennon paul word" NOT "word lennon : paul word"... if i wanted "lennon : paul" i'd search for it.
see the point?
<added>trying not to seem argumentative... it really is a deficiency, especially when googling for people</added>
keyword: 这是一个测试 (Chinese)
url: [google.com...]
with a few highlight in result, some not highlight
and: "这是一个测试" (Chinese)
url: [google.com...]
without even a single highlight, all result in black color
[edited by: Xuefer at 2:45 pm (utc) on May 18, 2003]
1. Encoding.
For some pages in the search result, the encoding is not clear. If you take the results from the 'debian' site and enter the url in Server Header Check [webmasterworld.com] you can see there is no encoding specified on the line
Content-Type: text/html <!--X-Content-Type: text/plain --> Google could have known the encoding if there would have been something like
Content-Type: text/html; charset=GB2312 2. Optimizing search
Sometimes Google 'helps' the user when a search is entered. If you search for week-end, Google will highlight 'weekend' and 'week-end'. If you search for "week-end" (i.e. with the quotes), Google will highlight the word 'week' followed by 'end' (no matter how these 2 words are sepated: space, dash ('-'), slash ('/') etc). So Google will make some kind of interpretation of the search term. Since Chinese has no spaces, finding words in a Chinese text is difficult for Google. If you click on the link after the URL ('Cached' in English version) of the msg00059.html of the debian site when entering the search you mentioned above, you can see different colors are used for groups of Chinese characters. These background colors can help you to understand how Google breaks up the search string. With knowledge about the Chinese langauge you could maybe explain why the second character is missing in the top part. In an English search some words like 'the' or 'of' are ignored in a search. For Chinese something similar could occur. If you are in mainland China (PRC), the Google's cache could be blocked. In that case you might need some proxy server.
Hope this helps.
1. Encoding:
in my example url: [google.com...]
ie=UTF-8 (input encoding)
oe=UTF-8 (output encoding)
and i think google work in utf-8 or ucs4 for internal encoding
how could it search the results for me?
so i guess there isn't any encoding problem
2. Optimizing search:
google didn't notice me some words is ignored
and what i said: "in search result, a few is highlighted, but most not"
i meant, in search result, a few search result(not keyword) is highlighted, with the others, most results, not
if it can search the result, and i can ctrl+f to find in result page, means the keyword is exactly in the results
it shouldn't be that hard to highlight it
maybe the highlight "engine" was optimized and not highly tested for all languages/charset
Please look at the cached version [216.239.33.100] of the msg00059.html [lists.debian.org] file. The SERP (Search Engine Results Page) is in UTF-8 (Unicode) because of the 'oe=UTF-8' in the url. But when I click on 'cached' my Japanese PC changes the encoding to BIG5, and the first character of the search string changes and gets a light blue background color. The second character looks similar to the 2nd character in the search string, but could internally have a different code (sometimes 2 similar characters have a different hexadecimal code), and has a white backgroud. The 3rd + 4th character share a light green background color. The 5th + 6th character share a light red background color.
Could you confirm this? Or does it show something different on your PC?
The fact that the 3rd + 4th character have the same background color, is for me an indication the form one word (but as I wrote befor, I cannot read Chinese), same for the 5th + 6th character. No doubt you can inform us, if this guess was correct or not.
what reason make me look at the "cached page", while i'm asking about the problem of "search result" page?
I perfectly understand that the problem was the highlighting of the SERP. But by looking at the background colors of a cached page, I hoped to better understand the way the search string was devided.
I agree with whats_up_skip that the 'Asia Pacific forum' is a better place for the thread. Maybe the moderators can change it.