|Microsoft tries to one-up Google PageRank|
|Microsoft researchers and academic collaborators detailed an idea this week it calls BrowseRank that seeks to bring more of a human touch ... Essentially, the researchers tested out a system that replaces PageRanks' link graph --a mathematical model of the hyperlinked connections of the Internet --with what they call a user browsing graph that ranks Web pages by people's behavior. |
Read CNET Story [news.cnet.com]
OK, an interesting idea. Now let's also get MSN/LIVE to properly crawl and index pages, then maybe we'll have something...
|In our experiment, such data |
was recorded and collected from an extremely large group of users
under legal agreements with them. Information which could be
used to recognize their identities was not included. By integrating
the data from hundreds of millions of web users, we can build a
user browsing graph
"under legal agreements with them"
... this makes me nervous.
Because I'm one of those people that skim over all the Terms and Conditions and click "I Agree" at the end. Was I unknowingly one of the users in that "extremely large group of users"?
It's not clear whether their "extremely large" group of users is the same as the "hundreds of millions of web users" mentioned in the following sentence. "Hundreds of millions" kind of implies "everyone", like maybe everyone using IE, Hotmail, Windows Live Messenger, etc.
|... kind of implies "everyone", like maybe everyone using IE, Hotmail, Windows Live Messenger, etc. |
Discounting all Apple users really hampers their research.
Read through the actual white paper. The middle section is all about algorithms and equations, so unless you're interested in the math, you can skip it. But later on they do reveal that their data was gathered users of "a commercial browser", and their data set was truly enormous.
I SMELL A SPYBOT AND I THINK ITS NAME IS INTERNET EXPLORER
I'd love to get some confirmation of this suspicion...
I finished reading through it - all of it. The BrowseRank algorithm is a thing of beauty, and their methods are brilliant. This may not rock the world, but it may finally give Microsoft a pretty decent search engine.
As for the spying, I suspected Google of doing this with their toolbar a couple of years ago, but I never found evidence. My reasoning was highly conspiratorial, in seven points:
1) it's possible
2) collectively, they are very smart
3) a smart person would figure this out
4) it would make their SERPs more relevant
5) they would benefit from it
6) they have the means to do it
7) no one would know
If my suspicions are correct, Microsoft has that IE browser doing their spying and sending session behaviour data back to their data centers, which gives them vastly more reach than the limited # of people running the Googlebar. (And significantly higher adoption than Alexa, Stumble, and other toolbars)
So where'd they get the data?
|We used a user behavior dataset, collected from the World Wide |
Web by a commercial search engine in the experiments. All possible
privacy information was rigorously filtered out and the data was
sampled and cleaned to remove bias as much as possible. There
are in total over 3-billion records, and among them there are 950-
million unique URLs.
|we also obtained a large dataset from the same search |
engine, containing 8000 queries and their associated webpages.
The data they use seems to consist of session requests, sort of like server log files. But if they are using IE to spy on people, they can get more than merely a log of HTTP requests. Once you start snooping in and recording people's browsing sessions, why stop there? Surely you'd glean interesting data from other browser behaviour, such as:
1) time spent with the browser window or tab focused
2) keystrokes per page
3) on-page interaction, like interaction with Flash or Media players
4) mouseovers, mouseouts, focuses and blurs
5) pages people put in their Favourites or Bookmarks
6) words people enter into forms
7) pages that are open simultaneously in tabs
8) sites that people tend to keep open in a tab all day
9) pages that do a lot of AJAX async requests
10) pages hiding behind authentication
11) names of people you know
12) your address, phone number, shoe size, bank account balance, sexual fantasy preferences...
need I go on?
|The BrowseRank algorithm is a thing of beauty, and their methods are brilliant. |
Human behaviour analysis in my opinion is the right way for SE's to go. Whilst SE's have some extremely bright people doing incredibly clever things with automated processes it just isn't possible to get close to real human behaviour.
Will be a very interesting one to watch.
I can see the privacy concerns, but I do find the technology fascinating at the same time. The engineer in me wants to see it in action, another part of me finds it all a little scary...
|limited # of people running the Googlebar |
Google Search History- Preferences- Accounts- Gmail cookies, Analytics, AdSense, DoubleClick data, +stats of MySpace, YouTube, AOL, whoknowswhatelse... might be a colorful patchwork but their set of user behavior data is world class ( #1 as of current ). They're just not using it ( yet / to full potential ) on organic search.
Haven't they been using it for AdWords quality and relevancy checks with success? Basically apart of phrase based filtering and regionalization that's all they do to rank ads: watch what users do, analyze, react ( unless your business model is unwanted your bid price, placement, quality scores... all depend on historic user behavior data ).
Of course the dataset is nowhere as large or as 'interesting' as if *everyone* using IE was contributing to the database.
I like that list above... would be interesting if *I* knew all this.
Not sure if I want MS to know *heh*...
either way, as long as MS keeps spying only on things they could learn /use legally 'if they owned every website on the net'... it's fine I guess. And for that purpose ... not sure about this but...
how much more data would they need apart of what their Phishing filters dial back home with...?
Outstanding analysis httpwebwitch. Regarding whether or not using human behavior is superior to PageRank, I can only hope that it is nothing more than one factor (significant though it may be) out of a huge number of factors. IMO Google for too long placed too much emphasis on PR, and I'd not want MS to do the same with BR.
And why is that? Because human behavior may not -- in and of itself -- be the best gauge as to what is best. Two old quotes come to mind:
"No one ever went broke underestimating the intelligence of the American people." ~ H. L. Mencken
"People can easily be persuaded to accept the most inferior ideas or useless products." ~ Bartleby
We'll see how it plays out. Given how poorly MSN/LIVE has performed so far, they HAVE to do something, and this seems at first glance like a positive step in the right direction.
Paranoia...thou name is Webmaster...KF
|And why is that? Because human behavior may not -- in and of itself -- be the best gauge as to what is best. |
I 100% agree with that comment, Reno. This could never be used as the only method to power a top search engine. New content is a good example - how do you get that to fit in to the results? Just because no-ones been there yet, doesn't mean its not good.
Microsoft may be onto a good thing, and I'm looking forward to seeing if this does deliver better results, but they still need to improve their general relevancy and linguistic intelligence to compete with Google.
I thought Microsoft had done something similar, since their results have tracked changes in G and appeared to be influenced by other changes in traffic flow. I figured they were mining MSN ISP data to accomplish this. Maybe now they're looking at ISP partner data and Hotmail data too.
Regardless, BrowseRank sounds like a nice refinement as it would implicitly include measures of visibility/prominence, pertinence and attractiveness of links, besides just their existence.
Hm, this is interesting indeed. I do some searches, click on a site and IMMEDIATELY push the back button on my browser when I realize the site is just junk/ads/spam...I could see how that could help better the results by tracking it. All of it would have to be ANONYMOUS for sure though (this is very good) because you would have all the idiot spammers/scammers (who sit all day spamming/scamming with nothing else to do) voting/ranking their garbage. I'm sure there will be a way to figure out how to do this even anonymously soon as they figure it out, no doubt. An interesting idea though to be sure. At LEAST MSN is thinking while Yahoo! is busy "digging for gold," I guess...LOL.
It's a good principal. I know a few black hats that will love it - but not so many that will have the resource to get around it.
It's also something that other "would be" search engines would have trouble emulating, because not many organisations would be able to get enough link-walking data to do it.
I like it. Quite a lot.