Forum Moderators: open

Message Too Old, No Replies

google failed to highlight multibyte result?

it's annoying

         

Xuefer

1:11 pm on May 18, 2003 (gmt 0)

10+ Year Member



in search result, a few is highlighted, but most not

Yidaki

1:14 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Where - in the cache version? Which words / phrases are not highlighted? Is it a random phenomenon or repeatable?

vincevincevince

1:58 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



what i find annoying is when searching for
"keyword1 keyword2"

some results will be of the form
"keyword1. keyword2" or
"keyword1, keyword2" etc...

firstly those are NOT matches for the exact match phrase "keyword1 keyword2"

secondly they are not highlighted using the toolbar highlight, or found when you click the toolbar word to find the instance.

it's particularly prevalent for names:
"John Smith" will find 'Murphy, John; Smith, Mark;' which is obviously about John Murphy and Mark Smith, not John Smith at all.

heini

2:01 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Vince, I don't see that, except for results where the original query only appears in links to the page.

vincevincevince

2:11 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



try a search for "lennon paul"

none of the top results are exact matches for the phrase "lennon paul" (with quotes!)

why should "john lennon : paul ..." come up for a search for "Lennon paul"?

heini

2:15 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>why should "john lennon : paul ..." come up for a search for "Lennon paul"?

Erm...because it is an exact match?

vincevincevince

2:20 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



heini, it's not an exact match

an exact match for "lennon paul" would be the phrase "word lennon paul word" NOT "word lennon : paul word"... if i wanted "lennon : paul" i'd search for it.

see the point?

<added>trying not to seem argumentative... it really is a deficiency, especially when googling for people</added>

Xuefer

2:24 pm on May 18, 2003 (gmt 0)

10+ Year Member



not in cached page
in search result
the one u get by "input keywords and press search button"
repeatable

keyword: 这是一个测试 (Chinese)
url: [google.com...]
with a few highlight in result, some not highlight

and: "这是一个测试" (Chinese)
url: [google.com...]
without even a single highlight, all result in black color

[edited by: Xuefer at 2:45 pm (utc) on May 18, 2003]

heini

2:38 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Vince, that may be a limitation, but is not a highlight problem. It's the way Google handles queries.
I've not dug into it, but to my knowledge Google simply ignores punctuation in pages, when calculating the relevancy for a query.

Xuefer, not sure I follow you?

Xuefer

2:42 pm on May 18, 2003 (gmt 0)

10+ Year Member



is the reply mixed up?

to the above repliers except Yidaki:
my topic is "highlight"
not keywords
it there any relative between my topic and your replies?

Yidaki

2:48 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Xuefer, i guess you mean that "Paul" is highlighted and "Lennon" isn't? Or do you mean than "Paul" is highlighted in blue and "Lennon" in yellow (which is normal with phrase searches, i guess)? Don't you have a widget example?

heini

3:03 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Xuefer, you are right, two topics mixed in this thread.

I followed your examples, and was quite surprised, as this colored highlighting on serps is new to me. Seems to depend on the language setting.
So I have no idea why in some searches the highlighting does not happen.

vincevincevince

3:05 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry, I misunderstood your question...
It seems you are using " " around your query that is not being highlighted? And that causes the results not to have highlight?

this is something to do with the chinese language search... not something i know too much about, sorry

Xuefer

3:12 pm on May 18, 2003 (gmt 0)

10+ Year Member



my topic don't reference to the quotes :P
but multibyte keyword
the first example i gave, still with some result in "black" without highlight color
those results exactly contains the keyword(s)

Yidaki

3:26 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Since your example uses a foreign language / special character set, i'd email it to search-quality@google.com. Could be a bug with their foreign highlighting feature!?

<edit>reason: dammn speeling</edit>

[edited by: Yidaki at 3:35 pm (utc) on May 18, 2003]

Xuefer

3:32 pm on May 18, 2003 (gmt 0)

10+ Year Member



thx

i just notice the problem recently
but it's ok before

takagi

4:41 pm on May 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Xuefer, I cannot read Chinese so it is hard for me to give the correct reply. I think I could reproduce the problem you mentioned. I can only offer you 2 remarks:

1. Encoding.
For some pages in the search result, the encoding is not clear. If you take the results from the 'debian' site and enter the url in Server Header Check [webmasterworld.com] you can see there is no encoding specified on the line

Content-Type: text/html

and the source contains
<!--X-Content-Type: text/plain -->

Google could have known the encoding if there would have been something like

Content-Type: text/html; charset=GB2312

in either the header sent by the server or in the header of the HTML file.

2. Optimizing search
Sometimes Google 'helps' the user when a search is entered. If you search for week-end, Google will highlight 'weekend' and 'week-end'. If you search for "week-end" (i.e. with the quotes), Google will highlight the word 'week' followed by 'end' (no matter how these 2 words are sepated: space, dash ('-'), slash ('/') etc). So Google will make some kind of interpretation of the search term. Since Chinese has no spaces, finding words in a Chinese text is difficult for Google. If you click on the link after the URL ('Cached' in English version) of the msg00059.html of the debian site when entering the search you mentioned above, you can see different colors are used for groups of Chinese characters. These background colors can help you to understand how Google breaks up the search string. With knowledge about the Chinese langauge you could maybe explain why the second character is missing in the top part. In an English search some words like 'the' or 'of' are ignored in a search. For Chinese something similar could occur. If you are in mainland China (PRC), the Google's cache could be blocked. In that case you might need some proxy server.

Hope this helps.

Xuefer

2:31 am on May 19, 2003 (gmt 0)

10+ Year Member



thanks for your kindness takagi

1. Encoding:
in my example url: [google.com...]

ie=UTF-8 (input encoding)
oe=UTF-8 (output encoding)
and i think google work in utf-8 or ucs4 for internal encoding
how could it search the results for me?
so i guess there isn't any encoding problem

2. Optimizing search:
google didn't notice me some words is ignored

and what i said: "in search result, a few is highlighted, but most not"
i meant, in search result, a few search result(not keyword) is highlighted, with the others, most results, not

if it can search the result, and i can ctrl+f to find in result page, means the keyword is exactly in the results
it shouldn't be that hard to highlight it

maybe the highlight "engine" was optimized and not highly tested for all languages/charset

takagi

3:58 am on May 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Xuefer,

Please look at the cached version [216.239.33.100] of the msg00059.html [lists.debian.org] file. The SERP (Search Engine Results Page) is in UTF-8 (Unicode) because of the 'oe=UTF-8' in the url. But when I click on 'cached' my Japanese PC changes the encoding to BIG5, and the first character of the search string changes and gets a light blue background color. The second character looks similar to the 2nd character in the search string, but could internally have a different code (sometimes 2 similar characters have a different hexadecimal code), and has a white backgroud. The 3rd + 4th character share a light green background color. The 5th + 6th character share a light red background color.

Could you confirm this? Or does it show something different on your PC?

The fact that the 3rd + 4th character have the same background color, is for me an indication the form one word (but as I wrote befor, I cannot read Chinese), same for the 5th + 6th character. No doubt you can inform us, if this guess was correct or not.

Xuefer

4:21 am on May 19, 2003 (gmt 0)

10+ Year Member



i asked my friend to download that cached page for me its highlighting is ok

but i never meant that "cached page", just the result page failed to highlight
and what reason make me look at the "cached page", while i'm asking about the problem of "search result" page?

*_*

whats up skip

7:29 am on May 19, 2003 (gmt 0)

10+ Year Member



Think you should have run this question in the Asia Pacific forum. Lots of people with double byte languages and google there.

takagi

8:44 am on May 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



what reason make me look at the "cached page", while i'm asking about the problem of "search result" page?

I perfectly understand that the problem was the highlighting of the SERP. But by looking at the background colors of a cached page, I hoped to better understand the way the search string was devided.

I agree with whats_up_skip that the 'Asia Pacific forum' is a better place for the thread. Maybe the moderators can change it.