Forum Moderators: open
Found this one this morning, and at the top of the page, it says the author is now at google...
I'm going to read it this afternoon...I'll post my thoughts on it after that. In the meantime, if anybody has anything on this one already, chime in.
Cheers,
Jeremy, the Artist Formerly known as Han Solo
Although as we have stated many times none of the research papers discussed here will give us a "blueprint" of the current algorithms, I believe that much of Mr. Bharat's work gives us a good handle on how information retrieval will evolve. Put plainly I believe that through Mr. Bharat we can take a peek at the future and set in motion the processes that will ensure long term success in our chosen field. With that in mind let me pick up a couple of points contained within the Hilltop document:
"We felt than an expert page needs to be objective and diverse: that is, its recommendations should be unbiased and point to numerous non-affiliated pages on the subject. Therefore, in order to find the experts, we needed to detect when two sites belong to the same or related organizations....We define two hosts as affiliated if one or both of the following is true:
They share the same first 3 octets of the IP address.
The rightmost non-generic token in the hostname is the same"
No reading between the lines required there!
"To locate expert pages that match user queries we create an inverted index to map keywords to experts on which they occur. In doing so we only index text contained within "key phrases" of the expert. A key phrase is a piece of text that qualifies one or more URLs in the page. Every key phrase has a scope within the document text. URLs located within the scope of a phrase are said to be "qualified" by it. For example, the title, headings (e.g., text within a pair of <H1> </H1> tags) and anchor text within the expert page are considered key phrases. The title has a scope that qualifies all URLs in the document. A heading's scope qualifies all URLs until the next heading of the same or greater importance. An anchor's scope only extends over the URL it is associated with. "
The emphasis is mine but I think that it clearly shows the move away from the "simple" indexing methods of recent times. In a perverse way as the use of "off the page criteria" in ranking a page becomes increasingly important, the "one the page criteria" of those sites that link to you becomes more important.
More of Krishna Bharat's work can be found here [research.compaq.com].
They share the same first 3 octets of the IP address.
The rightmost non-generic token in the hostname is the same
That gave me some pause, as how random are web pages, really? If you start thinking about the math you would need to describe the interrelations between various websites, and then started mapping things by IP address, and comparing those sums to c classes allocated to some companies...you could find some very, very interesting similarities, as I'm sure they did.
Chaos theory, for example, says that 95% of everything is predictable...and the other 5% is unpredictably, by any laws of physics...at least, I believe I'm summarizing correctly. Given this, and the very nature of the human designed aspect of mapping domain names to ip addresses, it makes perfect sense that some academic would come up and publically decry this.
And it fits with the anatomy of a large scale engine paper, where the google founders expressly mention companies that try to manipulate search engines.
Now, though, if we all band together around topic A, interlink, and decide to design our pages in a similar fashion, we can cheat google all we want.
Or am I reading that wrong? So what I do is host my sites for topic A on five different servers, with 5 different domains, and interlink them, designing all 5 with roughly the same layout, per Brett's excellent post on themes in sites. This bears some thought...I think I may have found a creative use for the freely available data dump I've been playing with :)
Thanks for the further link. I didn't see any topics there that immediately sparked my interest, but at some point, I'm sure I'll try and dig through them.
Cheers,
Jeremy, The Artist formerly known as Han Solo
This is useful - Topic distillation ... Inktomi right?
There is value for the individual or organization that creates resource lists on specific topics since this boosts their popularity and influence within the community interested in the topic. The authors of these lists thus have an incentive to make their lists as comprehensive and up to date as possible. We regard these links as recommendations, and the pages that contain them, as experts.
Analyze the links to the top sites and get links from those sources at whatever the cost I guess. Getting tougher to manipulate results...
This is based on the assumption that the title text is more useful than the heading text, which is more useful than an anchor text match in determining what the expert page is about.
--Seems to be very incosistant on the WWW and maybe not a good assumption but it seems to be working for them. Alta don't look so hot in those charts...
This system makes it necessary to tightly screen who you solicit links from.
Or are you quoting from the hilltop paper something I didn't read yet?
Cheers,
Jeremy, aka, the Artist formerly known as Han Solo
If you've followed some of the things that google does, as far as filtering especially...hilltop makes sense, in a few ways, as far as how they are clustering their data.
PageRank is too limited by itself, and doesn't provide the algorithmic or mathematical basis by which to sort the data. It's just an amalgamation of hyperlinks, and then deleting those which are obvious spam...and throwing the text into word sorts, and using bolded, header, etc. text for extra point value.
That part of their algo is pretty easy, in my view. However, if you can recreate your own "hilltop" my thought is that you can pretty much go google to your hearts content.
What I mean to say is that Hilltop is a good way of going beyond what PageRank does to achieve a more organized data set for user queries. PageRank could be reverse engineered with a simple goal of creating your own DMOZ, and then creating the mass of links towards it that would qualify it as a good random surfer starting point, and from your personally created DMOZ you could point your new found link weight towards the domains you want promoted, and there you go. "Instant" PageRank.
This is one of my favorite parts of this business. ;)
Cheers,
Jeremy, The artist formerly known as Han Solo
If you compare the methods referenced in section 1.1 Related Work with the graphs, you'll notice that the directories are missing, but you're left with 3 engines and 3 methods (besides hilltop). DirectHit and Google are covered in both sections. That would seem to imply that the two leftovers, AV and Topic Distillation (term vector database), would also be a match. Although talk about AV's use of the TVD is nothing new, this seems to add a little more validity to the theory.
2.1 Detecting Host AffiliationWe define two hosts as affiliated if one or both of the following is true:
They share the same first 3 octets of the IP address. The rightmost non-generic token in the hostname is the same. We consider tokens to be substrings of the hostname delimited by "." (period). A suffix of the hostname is considered generic if it is a sequence of tokens that occur in a large number of distinct hosts. E.g., ".com" and ".co.uk" are domain names that occur in a large number of hosts and are hence generic suffixes. Given two hosts, if the generic suffix in each case is removed and the subsequent right-most token is the same, we consider them to be affiliated.
This was the only part of that paper that caught my attention.
For example:
cheap-computers.com
discount-computers.com
If that's right then it seems to mean that getting links from a site with a similar domain name to yours in 2nd half will be affiliated (and of course your own multiple domains that are similar in 2nd parts).
Or am I misreading?
...would all be considered affiliated, while...
....would not be considered affiliated. Hyphens don't count. They're talking about the domain name sections as separated by periods.
This also means that everyone with a regular free hosting account (www.myhost.com/~mysite/) or "subdomain hosting" website (mysite.myhost.com) would be considered affiliated with everyone else on that host's domain...
edited by: mivox
LM-"think like a school of fish".
The best profs I ever had would not tell you the answer, they would point you in the direction of the answer. The prof would also provide you with the knowledge to be able to find the answer.
Brett has given so many "clues" as to where to look, you should be able to say, I already know.
This is also why Google has a problem ranking "Free Sites"(hint).
Given that this paper is "at least 3 years old" they are a little slow implementing things by one of their better engineers, IMHO. Academic work is only important once there is a commercial application...till then, it gets dusty, moldy, and perhaps is part of a piece of paper one might hang in an office (think 'stanford degree' :) ).
So this being the case time wise, hmmm...
You might note one of the citations at the bottom
"Chakrabarti et al 99] S. Chakrabarti, M. van den Berg and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. In the 8th World Wide Web Conference, Toronto, May 1999. [cs.berkeley.edu...] " source [cs.toronto.edu...]
That they mention '99 in the paper, means it can't be "at least 3 years old" I think.
hmmm what do you mean?
for one of my keywords (100k results)
the no1 site url is a members.aol.com/xxxxxxx site
i was thinking of creating a hub site by setting up a load of sites on free isp space and pointing them to my hub... would i be wasting my time???? i thought they would inherit some of the isp's high rank, have i missed the point?