Kleinberg says "Word Bursts" define hot topics

Forum Moderators: open

Message Too Old, No Replies

Kleinberg says "Word Bursts" define hot topics

Brett_Tabke

2:16 pm on Feb 19, 2003 (gmt 0)

...a new approach to sorting information that relies on scanning documents for abrupt "bursts" in the usage of particular words may help. Jon Kleinberg of Cornell University described the technique yesterday at the annual meeting of the American Association for the Advancement of Science in Denver, Colo.

Scientific American [sciam.com] article.

Kleinberg [cs.cornell.edu] is well known in the search engine research community for work on the famous Hubs and Authorities [www10.org] search engine index theory and algorithm.

vitaplease

2:32 pm on Feb 19, 2003 (gmt 0)

Interesting.

Find the word bursts in recent Blogs/forums and send out the Fresh bots to already indexed pages which themselves have the same words occurring frequently/relevantly.

Though I wonder how much difference there is in word bursts of recent webdocuments and Zeitgeist type (search phrases suddenly/abruptly occuring more frequently) search query word bursts.

NFFC

3:45 pm on Feb 19, 2003 (gmt 0)

>Find the word bursts in recent Blogs

Or better still buy a blogger, move them to your servers and watch the web in real time. Hmmmmm :)

msgraph

6:14 pm on Feb 19, 2003 (gmt 0)

Interesting indeed.

This could help provide a better mix in results where a word has multiple meanings and interests. Particularly during some seasonal times.

>>>>Or better still buy a blogger, move them to your servers and watch the web in real time. Hmmmmm :)

hmmmm too!

NFFC

6:24 pm on Feb 19, 2003 (gmt 0)

>hmmmm too!

hehe, nice little transparent proxy server sitting on the front end, watching the traffic flow in and out and following the leads. Could be a killer app if combined with a search engine capable of updating almost on the fly.

I wonder if anyone has thought of that?

jeremy goodrich

6:30 pm on Feb 19, 2003 (gmt 0)

he he, combine that with something else - say, the ultimate in profiling - and you could really have something there...blog, profile, word burst, links, etc.

It's all coming together...hm, very interesting indeed.

msgraph

6:40 pm on Feb 19, 2003 (gmt 0)

>>>hehe, nice little transparent proxy server sitting on the front end, watching the traffic flow in and out and following the leads.

Yes, kinda like having your very own group of inner city youths who can tell you what's hot in fashion and music.

amoore

7:09 pm on Feb 19, 2003 (gmt 0)

I'll be danged. wordburst.com was registered yesterday. Someone always beats me to them.

currybet

9:58 pm on Feb 19, 2003 (gmt 0)

i think it sounds potentially great - but the examples given in that article were *so* lame - i'd want to see how it scaled up to deal with the whole of the web.

my feeling is that, as far as searching goes, it is a technique much more suited to small groups of documents with a controlled vocabulary, like intranets or large corporate sites, than trying to apply it to every document published on the web simultaneously.

Darkness

10:52 pm on Feb 19, 2003 (gmt 0)

Funny.. I was thinking of doing a similar thing for my site about a week ago. This thread got me inspired to get started on it and I've now got a 'Hot Words' section on the front page. Although it's a small data set with a obvious bias towards games and pc hardware the results are still better than I expected :)

brotherhood of LAN

12:46 am on Feb 20, 2003 (gmt 0)

isnt this what g and others have called a "driving query"?

you start on a page that has a high frequency of a certain word/phrase combination, and pages are spidered til' the frequency is below a certain point. Then another (if on topic) query is started from that page, as so on......

Would hot > topics > get > muffled > if the words are in a directory structure and repeated over loads of pages though...hope the concept doesntn eed anything nasty like PR to help it float ;)

Winooski

3:15 am on Feb 20, 2003 (gmt 0)

Darkness, welcome to WebmasterWorld!

amoore wrote:

wordburst.com was registered yesterday. Someone always beats me to them.

Well, as of 9:48 EST, "jonkleinberg.com" is still available. ;)

Sarah Graham wrote in the article:

...He [Jon Kleinberg] posits that the new approach could help narrow web searches by better recognizing the time context of a query....

The problem is, I don't think word bursts or any similar time-dependent tool would be implemented with too much precision by an SE like Google. There's a disincentive for Google to make it easy for us to search based on precise date ranges: date-stamping a page in the index gives us another datapoint to use in reverse-engineering the ranking algo.

brotherhood of LAN

5:28 am on Feb 20, 2003 (gmt 0)

One of his works are here
[cs.cornell.edu...]

Interesting to note about halfway down there is a bit about measuring social networks (ala blogs?)

so a word burst from a blog community could be an "out burst" or an endorsment about a given subject.

maybe its not long until we have a "ranters" algorithm....or a script that could weed out rants in the forums, very interesting work, but lots of math.

ggrot

5:36 am on Feb 20, 2003 (gmt 0)

New? I admit i didn't read the paper, but isn't this just the same as an engine counting word frequency?

JustTrying

6:53 am on Feb 20, 2003 (gmt 0)

This would be a fun application to show up at Google Labs. I would love to see this "burst" concept "make sense" of some data sets that I was able to feed into it.

I also think that this concept could have some really interesting "social understanding" possibilities. This would probably be a very good tool for uncovering "bias" in such places as the media or within universities.

I wonder if it also works with synonyms...

Winooski

7:06 am on Feb 20, 2003 (gmt 0)

ggrot, judging by the brief article, I believe calculating word burst metrics is an attempt to determine keyword trends over time. It's not enough that a set of pages has similar KF or KD for a given term, the pages would also have to be sufficiently near each other in index date.

I don't have a clue about how to express it mathematically, but I'm thinking a word burst query might sound something like: "Show me a graph of the keyword densities of all non-typical words occurring on all pages indexed between October 31 and December 31, 2002, where the keyword density is at least 1%". The important factors here are:

(1) What do you consider a "non-typical word"? The fewer the stop words you use as a filter, the more inundated you get with data.

(2) At what rate does the keyword density of a given word or phrase cease to be "noise" and become meaningful? 1%? 5%?

(3) What's a meaningful period of time to consider? A week? A month?

Another way I can see word burst statistics used is to start out with a specific keyword, then try to determine the trend for it: "Show me all pages with a keyword density of 1% to 2.5% for the term 'green widgets' where the pages have an index date of November 1, 2002, plus or minus a month or 1%, whichever comes first".

In this case, if the KD trend fell below 1% before the specified range of +/- 1 month, that would mean a short-lived trend, and if it didn't fall below 0.5% for the time period, it would mean a longer-lived trend. Whether that was meaningful or not would depend on what's historically typical for a keyword, though certain keywords probably never fall below a certain density threshold.

gethan

7:22 am on Feb 20, 2003 (gmt 0)

Analyze the news sites, certain news groups, forums and blogs. Apply a PR algo to give increasingly authoriative resources greater weight, apply a freshness algorithm, wordburst's in given week, day, hour - the hot stories and concepts organising themselves, where mentioned classifies them... like the concept.

Can't wait to see it's implementation on a well known search engine in the near future.

Hot search engine stories in the last week:
1: word bursts
2: overture buys altavista
...

Winooski

7:56 am on Feb 20, 2003 (gmt 0)

I don't think we should get too psyched about the application of word burst techniques with our favorite "well known search engine", because I don't think Google would ever give us access to sufficient tools to figure out how such techniques were used in indexing and ranking.

A recent reply from GoogleGuy [webmasterworld.com] to a question from me about narrowing down queries by date range seems to imply that Google isn't eager to have its users do precise date-matching. For this reason, I'm skeptical that it would ever implement a precise end-user word burst tool, which might run a similar risk of exposing part of its algorithms. (Mind you, this is assuming they went ahead and started incorporating word burst techniques into their algos in the first place. OK, it's too late, gotta get some sleep!)

vitaplease

8:16 am on Feb 20, 2003 (gmt 0)

If this plan gets worked out properly [webmasterworld.com] then such a taste of the times would make very interesting searching.

rubble88

8:46 pm on Feb 21, 2003 (gmt 0)

Much more here [eurekalert.org] including the "The 150 term bursts of highest weight in Presidential State of the Union Addresses, 1790-2002"

NFFC

7:25 pm on Feb 24, 2003 (gmt 0)

Daypop spots a bandwagon and climbs aboard;

[daypop.com...]

Winooski

7:42 pm on Feb 24, 2003 (gmt 0)

NFFC, nice find!

...And just think, they didn't have to buy Pyra to do it! ;)