Forum Moderators: open
...a new approach to sorting information that relies on scanning documents for abrupt "bursts" in the usage of particular words may help. Jon Kleinberg of Cornell University described the technique yesterday at the annual meeting of the American Association for the Advancement of Science in Denver, Colo.
Scientific American [sciam.com] article.
Kleinberg [cs.cornell.edu] is well known in the search engine research community for work on the famous Hubs and Authorities [www10.org] search engine index theory and algorithm.
Find the word bursts in recent Blogs/forums and send out the Fresh bots to already indexed pages which themselves have the same words occurring frequently/relevantly.
Though I wonder how much difference there is in word bursts of recent webdocuments and Zeitgeist type (search phrases suddenly/abruptly occuring more frequently) search query word bursts.
my feeling is that, as far as searching goes, it is a technique much more suited to small groups of documents with a controlled vocabulary, like intranets or large corporate sites, than trying to apply it to every document published on the web simultaneously.
you start on a page that has a high frequency of a certain word/phrase combination, and pages are spidered til' the frequency is below a certain point. Then another (if on topic) query is started from that page, as so on......
Would hot > topics > get > muffled > if the words are in a directory structure and repeated over loads of pages though...hope the concept doesntn eed anything nasty like PR to help it float ;)
amoore wrote:
wordburst.com was registered yesterday. Someone always beats me to them.
Sarah Graham wrote in the article:
...He [Jon Kleinberg] posits that the new approach could help narrow web searches by better recognizing the time context of a query....
The problem is, I don't think word bursts or any similar time-dependent tool would be implemented with too much precision by an SE like Google. There's a disincentive for Google to make it easy for us to search based on precise date ranges: date-stamping a page in the index gives us another datapoint to use in reverse-engineering the ranking algo.
Interesting to note about halfway down there is a bit about measuring social networks (ala blogs?)
so a word burst from a blog community could be an "out burst" or an endorsment about a given subject.
maybe its not long until we have a "ranters" algorithm....or a script that could weed out rants in the forums, very interesting work, but lots of math.
I also think that this concept could have some really interesting "social understanding" possibilities. This would probably be a very good tool for uncovering "bias" in such places as the media or within universities.
I wonder if it also works with synonyms...
I don't have a clue about how to express it mathematically, but I'm thinking a word burst query might sound something like: "Show me a graph of the keyword densities of all non-typical words occurring on all pages indexed between October 31 and December 31, 2002, where the keyword density is at least 1%". The important factors here are:
(1) What do you consider a "non-typical word"? The fewer the stop words you use as a filter, the more inundated you get with data.
(2) At what rate does the keyword density of a given word or phrase cease to be "noise" and become meaningful? 1%? 5%?
(3) What's a meaningful period of time to consider? A week? A month?
Another way I can see word burst statistics used is to start out with a specific keyword, then try to determine the trend for it: "Show me all pages with a keyword density of 1% to 2.5% for the term 'green widgets' where the pages have an index date of November 1, 2002, plus or minus a month or 1%, whichever comes first".
In this case, if the KD trend fell below 1% before the specified range of +/- 1 month, that would mean a short-lived trend, and if it didn't fall below 0.5% for the time period, it would mean a longer-lived trend. Whether that was meaningful or not would depend on what's historically typical for a keyword, though certain keywords probably never fall below a certain density threshold.
Can't wait to see it's implementation on a well known search engine in the near future.
Hot search engine stories in the last week:
1: word bursts
2: overture buys altavista
...
A recent reply from GoogleGuy [webmasterworld.com] to a question from me about narrowing down queries by date range seems to imply that Google isn't eager to have its users do precise date-matching. For this reason, I'm skeptical that it would ever implement a precise end-user word burst tool, which might run a similar risk of exposing part of its algorithms. (Mind you, this is assuming they went ahead and started incorporating word burst techniques into their algos in the first place. OK, it's too late, gotta get some sleep!)
[daypop.com...]