Google news: related stories: how?

Forum Moderators: open

Message Too Old, No Replies

Google news: related stories: how?

How do they decide which stories are related?

claus

12:46 pm on Dec 18, 2004 (gmt 0)

One thing that puzzles me a little bit when going to Google news is the bold green link "all 1,296 related". It's quite impressive, as i've yet to see a total miss.

How, exactly, do they determine which stories from around 4,500 different media are related, ie. the same story/topic with different wording/information? Near-duplicate pages? I mean, it's not as if they can rely on backlinking patterns or anything like that, it must be on-page factors/semantics or something.

Anyone seen any information on this somewhere? Anyone know anything? Theories?

---
Yes, i do believe that there are spill-over-effects from news to search (or the other way round), but let's keep this thread (if anyone replies) on news, okay?

ciml

6:02 pm on Dec 20, 2004 (gmt 0)

Great question.

Possibillities include:

Title
Text
Images
Link text (from within the news site)
?

hugo_guzman

10:06 pm on Dec 20, 2004 (gmt 0)

google news results are sorted using a ranking algorithm similar (or possibly identical) to their regular organic search algorithm.

claus

11:53 pm on Dec 20, 2004 (gmt 0)

hugo_guzman, i'm not sure i understand what you're saying - are you suggesting that keywords in anchor text is what they use to group different stories?

I don't think so myself, as the different headlines within one topic do not always have the same words in them, and with news sites headlines and anchor text is often the same (if the anchor text is not simply one of "read more", or "full story").

Here are the headlines of the two top stories right now:

-----------------
STORY 1
a) Poll: Most Americans Think Iraq War Not Worth Fighting
b) Poll: President's Year-End Job Approval ABC News
c) Bush confident in Iraqi elections despite Iraqi troops' ...

-----------------
STORY 2
a) Bush Defends Rumsfeld, Notes Iraq Difficulties
b) Second term near, Bush takes stock Christian Science Monitor
c) Bush defends embattled Rumsfeld as ``a caring fellow''
-----------------

Notice just how different those headlines are in terms of wording? I mean, if wording in headlines (assuming headline = anchor) was key, wouldn't you think that eg. (1c) could just as well be grouped with (2a)? And what makes (2b) qualify for a group with (2a) and (2c)?

When i click on one of those "all 1,185 related �" links the thing that strikes me most is that there are quite some differences in wording among those (here 1,185) headlines and articles.

prairie

11:56 am on Dec 22, 2004 (gmt 0)

Perhaps because news sites are so well structured they are relying on those clues.

claus

6:33 pm on Dec 22, 2004 (gmt 0)

okay let's bump this one with a little twist to it...

Remember that before everybody went sandboxing, we discussed all kinds of interesting things like themes, semantics, latent semantic indexing, "related sites", near duplicate content, and so on a while back? Even hilltops and neighborhoods crept into the discussion from time to time...

It seems to me, that in this little corner of GOOG Corp it has actually been proven possible to identify, say, "themes" from wide a range of pages and sites that are not otherways related. IOW it's a proof-of-concept.

So, without relying too much on backlinks (if at all, i assume), they can actually identify that two web pages are on the same subject, even though the pages are not related in other ways. Specifically, not any of these pages (stories) are optimized for the "theme" they are grouped in by KW in title, KW in backlinks or anything like that.

The theme goes beyond the keywords, it seems. Ie. the article might be optimized for the keywords "Blue widgets" (headline, mostly) but what it is ranking for is something else - a more general theme (say, "recent US widgettery") in which "blue widgets" play a part sometimes but not always.

Now, anyone got any recent thoughts on themes/semantics?
Or perhaps somebody can link to the relevant literature (ie. papers)? - as i'm not sure i've read all i'd like to read.

(no need to link to this one [searchengineworld.com] - i've definitely read that ;))

claus

1:08 am on Dec 30, 2004 (gmt 0)

*bump*

i'm just bumping this a last time - this can't be true, surely somebody out there must have been reading something somewhere on Google News and how it is done?

I'd really like to do a very similar thing (with a totally different focus) on a non-web software project, but i really don't know where to start - i'm considering cluster analysis, factor analysis, mds, and so on, as it's obviously some of that stuff that's behind it, but i would like some pointers just to get started in the right direction.

- i sure feel like reinventing the wheel, and i hate that kind of thing... I'm sure it'll be a nice round one at some stage, though...

Kirby

4:25 am on Dec 30, 2004 (gmt 0)

Newsknife asked a similar question and came up with this [searchenginewatch.com]. Can you tie this in with topic sensitive page rank?

sergi

10:13 am on Dec 30, 2004 (gmt 0)

I'd say that latent semantic indexing (http://javelina.cet.middlebury.edu/lsa/out/cover_page.htm) seems the most likely candidate.

claus

2:33 pm on Dec 30, 2004 (gmt 0)

LOL @ LSI big time :) You know the feeling when you're looking for something and you find out that it's been sitting there on your desk right in front of you all the time?

I must admit that i never got to read that paper back when all those discussions were hot, as i supposed (one should never suppose) that it had to do with building vast taxonomies, word lists, and such, but now that i've read it it i can see that it's just your basic MDS with another name and a bit of spice (spice being taxonomies, word lists, and such) :)

...only too bad that now i'll have to reinvent the wheel, ie. write my own algorithm, but nevermind, it's just for fun so i don't mind if it takes a year or never gets finished.

>> Newsknife

Interesting, but that's basically taking the Google news "as is" and monitoring it, which is something else than what i had in mind.

>> Can you tie this in with topic sensitive page rank?

Tie G News with it? Dunno. At first i didn't really think Gnews would use PR at all, but perhaps it does anyway. This would make the most relevant news stories stories from sources with the highest PR, which does not seem like a fair assumption initially. OTOH, you have to make some kind of assumptions otherwise you'd just have a big list of more or less randomly sorted groups of stories, and the PR approach works for web search, so why not.

If that's true, then newsknife could cut a few corners by visiting the news sites they're ranking with the G toolbar in stead of doing all that math. It wouldn't get that "scientific touch" though ;)

For the web, topic sensitive PR should be easy, btw. (ignoring scale for a moment). Just do like in the LocalRank patent, but select the BackSet from the same MDS/LSI/whatever cluster (in stead of the same base SERP). It would probably give poor results on a broad term like "widgets" but it should give increasingly better SERPS the more words you use, ie. some kind of "authority score". Of course you could tie the weights somewhat to the number of terms in the query, but it's all speculation and i really haven't thought about all the details involved, so perhaps i'm totally wrong.

hugo_guzman

5:30 pm on Dec 30, 2004 (gmt 0)

Here are a couple of excerpts from Google's "About Google News" page
(http://news.google.com/intl/en_us/about_google_news.html)

"Google News is highly unusual in that it offers a news service compiled solely by computer algorithms without human intervention. While the sources of the news vary in perspective and editorial approach, their selection for inclusion is done without regard to political viewpoint or ideology. While this may lead to some occasionally unusual and contradictory groupings, it is exactly this variety that makes Google News a valuable source of information on the important issues of the day."

"Question: Who edits the Google News homepage? One of the headlines is totally out of whack.
Answer: The headlines that appear on Google news are selected entirely by computer algorithms, based on how and where the stories appear elsewhere on the web. There are no human editors at Google selecting or grouping the headlines and no individual decides which stories get top placement. This occasionally results in some articles appearing to be out of context."