Anybody read about the hilltop algorithm before?

Forum Moderators: open

Message Too Old, No Replies

Anybody read about the hilltop algorithm before?

jeremy goodrich

3:42 pm on Mar 27, 2001 (gmt 0)

[cs.toronto.edu...]

Found this one this morning, and at the top of the page, it says the author is now at google...

I'm going to read it this afternoon...I'll post my thoughts on it after that. In the meantime, if anybody has anything on this one already, chime in.

Cheers,

Jeremy, the Artist Formerly known as Han Solo

JamesR

5:09 pm on Mar 27, 2001 (gmt 0)

Jeremy, you decloaked your nic! Thanks for the post, I'll try to get to that article a little later also.

NFFC

7:52 pm on Mar 27, 2001 (gmt 0)

Jeremy nice link, in my book Krishna Bharat sits right at the leading edge of current ranking theory, he is a must read!

Although as we have stated many times none of the research papers discussed here will give us a "blueprint" of the current algorithms, I believe that much of Mr. Bharat's work gives us a good handle on how information retrieval will evolve. Put plainly I believe that through Mr. Bharat we can take a peek at the future and set in motion the processes that will ensure long term success in our chosen field. With that in mind let me pick up a couple of points contained within the Hilltop document:

"We felt than an expert page needs to be objective and diverse: that is, its recommendations should be unbiased and point to numerous non-affiliated pages on the subject. Therefore, in order to find the experts, we needed to detect when two sites belong to the same or related organizations....We define two hosts as affiliated if one or both of the following is true:

They share the same first 3 octets of the IP address.
The rightmost non-generic token in the hostname is the same"

No reading between the lines required there!

"To locate expert pages that match user queries we create an inverted index to map keywords to experts on which they occur. In doing so we only index text contained within "key phrases" of the expert. A key phrase is a piece of text that qualifies one or more URLs in the page. Every key phrase has a scope within the document text. URLs located within the scope of a phrase are said to be "qualified" by it. For example, the title, headings (e.g., text within a pair of <H1> </H1> tags) and anchor text within the expert page are considered key phrases. The title has a scope that qualifies all URLs in the document. A heading's scope qualifies all URLs until the next heading of the same or greater importance. An anchor's scope only extends over the URL it is associated with. "

The emphasis is mine but I think that it clearly shows the move away from the "simple" indexing methods of recent times. In a perverse way as the use of "off the page criteria" in ranking a page becomes increasingly important, the "one the page criteria" of those sites that link to you becomes more important.

More of Krishna Bharat's work can be found here [research.compaq.com].

jeremy goodrich

9:49 pm on Mar 27, 2001 (gmt 0)

Thanks for the detailed analysis. When I posted this, I emailed Brett a question, and ironically, you highlighted some of the same text:

They share the same first 3 octets of the IP address.
The rightmost non-generic token in the hostname is the same

That gave me some pause, as how random are web pages, really? If you start thinking about the math you would need to describe the interrelations between various websites, and then started mapping things by IP address, and comparing those sums to c classes allocated to some companies...you could find some very, very interesting similarities, as I'm sure they did.

Chaos theory, for example, says that 95% of everything is predictable...and the other 5% is unpredictably, by any laws of physics...at least, I believe I'm summarizing correctly. Given this, and the very nature of the human designed aspect of mapping domain names to ip addresses, it makes perfect sense that some academic would come up and publically decry this.

And it fits with the anatomy of a large scale engine paper, where the google founders expressly mention companies that try to manipulate search engines.

Now, though, if we all band together around topic A, interlink, and decide to design our pages in a similar fashion, we can cheat google all we want.

Or am I reading that wrong? So what I do is host my sites for topic A on five different servers, with 5 different domains, and interlink them, designing all 5 with roughly the same layout, per Brett's excellent post on themes in sites. This bears some thought...I think I may have found a creative use for the freely available data dump I've been playing with :)

Thanks for the further link. I didn't see any topics there that immediately sparked my interest, but at some point, I'm sure I'll try and dig through them.

Cheers,

Jeremy, The Artist formerly known as Han Solo

JamesR

11:20 pm on Mar 27, 2001 (gmt 0)

Some thoughts...

This is useful - Topic distillation ... Inktomi right?

There is value for the individual or organization that creates resource lists on specific topics since this boosts their popularity and influence within the community interested in the topic. The authors of these lists thus have an incentive to make their lists as comprehensive and up to date as possible. We regard these links as recommendations, and the pages that contain them, as experts.

Analyze the links to the top sites and get links from those sources at whatever the cost I guess. Getting tougher to manipulate results...

This is based on the assumption that the title text is more useful than the heading text, which is more useful than an anchor text match in determining what the expert page is about.
--Seems to be very incosistant on the WWW and maybe not a good assumption but it seems to be working for them. Alta don't look so hot in those charts...

This system makes it necessary to tightly screen who you solicit links from.

jeremy goodrich

1:58 pm on Mar 28, 2001 (gmt 0)

James, are you quoting the hilltop paper, or another one? I confess, talking about the google algorithm, as the hilltop paper does, and then jumping into Inktomi like that, I'm not following at all.

Or are you quoting from the hilltop paper something I didn't read yet?

Cheers,

Jeremy, aka, the Artist formerly known as Han Solo

JamesR

4:58 pm on Mar 28, 2001 (gmt 0)

Sorry Jeremy, it is the Hilltop paper, section 1.1 on current connectivity methods, the mention of topic distillation and Kleinberg I believe are what Inktomi is currently using. Anyone please correct me if I am wrong. The rest of what I wrote had to do with the actual Hilltop algorithm. Also, the Hilltop algorithm isn't necessarily Google as NFFC mentioned but just a theory paper by one of the engineers. Google currently uses PageRank, one of the connectivity methods used in comparison to the Hilltop algorithm in the paper. Bharat is just trying to improve upon the current approaches.

jeremy goodrich

5:14 pm on Mar 28, 2001 (gmt 0)

Thanks for the clarification.

If you've followed some of the things that google does, as far as filtering especially...hilltop makes sense, in a few ways, as far as how they are clustering their data.

PageRank is too limited by itself, and doesn't provide the algorithmic or mathematical basis by which to sort the data. It's just an amalgamation of hyperlinks, and then deleting those which are obvious spam...and throwing the text into word sorts, and using bolded, header, etc. text for extra point value.

That part of their algo is pretty easy, in my view. However, if you can recreate your own "hilltop" my thought is that you can pretty much go google to your hearts content.

What I mean to say is that Hilltop is a good way of going beyond what PageRank does to achieve a more organized data set for user queries. PageRank could be reverse engineered with a simple goal of creating your own DMOZ, and then creating the mass of links towards it that would qualify it as a good random surfer starting point, and from your personally created DMOZ you could point your new found link weight towards the domains you want promoted, and there you go. "Instant" PageRank.

This is one of my favorite parts of this business. ;)

Cheers,

Jeremy, The artist formerly known as Han Solo

seth_wilde

6:02 am on Mar 29, 2001 (gmt 0)

Topic distillation is part of the term vector database (section 3.2) [www9.org] which is another paper Bharat co-wrote.

If you compare the methods referenced in section 1.1 Related Work with the graphs, you'll notice that the directories are missing, but you're left with 3 engines and 3 methods (besides hilltop). DirectHit and Google are covered in both sections. That would seem to imply that the two leftovers, AV and Topic Distillation (term vector database), would also be a match. Although talk about AV's use of the TVD is nothing new, this seems to add a little more validity to the theory.

Brett_Tabke

12:00 am on Jul 4, 2001 (gmt 0)

hmm, explain how this is any different than Hubs and Authorities, or different from what Google has been doing for three years already? Google has been spidering directories to death since day one. Talk to anyone who runs an independent directory and you'll find someone quite po'd at Google's link raiding. This where the whole idea of Google as the strip miners of the web came from.

toolman

1:39 am on Jul 4, 2001 (gmt 0)

2.1 Detecting Host Affiliation
We define two hosts as affiliated if one or both of the following is true:
They share the same first 3 octets of the IP address.
The rightmost non-generic token in the hostname is the same.
We consider tokens to be substrings of the hostname delimited by "." (period). A suffix of the hostname is considered generic if it is a sequence of tokens that occur in a large number of distinct hosts. E.g., ".com" and ".co.uk" are domain names that occur in a large number of hosts and are hence generic suffixes. Given two hosts, if the generic suffix in each case is removed and the subsequent right-most token is the same, we consider them to be affiliated.

This was the only part of that paper that caught my attention.

jilla

12:14 am on Jul 7, 2001 (gmt 0)

In terms of what TOOLMAN
quoted, in the 2nd point,
does that mean if you have
2 different domains with unique IP numbers but IF they have the same word before the com that they are considered affiliated:

For example:

cheap-computers.com
discount-computers.com

If that's right then it seems to mean that getting links from a site with a similar domain name to yours in 2nd half will be affiliated (and of course your own multiple domains that are similar in 2nd parts).

Or am I misreading?

mivox

12:52 am on Jul 7, 2001 (gmt 0)

"Rightmost non-generic token" means the whole second-level domain. When you "register a domain name" at NetSol or somewhere, you're registering the "rightmost non-generic token" (.com, .org, etc., are all "generic top-level domains"), so...

el-cheapo.computers.com

www.computers.com

super-deluxe.computers.com

...would all be considered affiliated, while...

www.discount-computers.com and

www.cheap-computers.com

....would not be considered affiliated. Hyphens don't count. They're talking about the domain name sections as separated by periods.

This also means that everyone with a regular free hosting account (www.myhost.com/~mysite/) or "subdomain hosting" website (mysite.myhost.com) would be considered affiliated with everyone else on that host's domain...

edited by: mivox

startup

3:12 am on Jul 7, 2001 (gmt 0)

Excuse Me!
"HillTop" is at least 3 years old.
The practical applications of this paper have been given to you in many responses.
EG:
-Separate domains
-Unique IPs
-.com, .co.uk, .ca, or whatever you can add

LM-"think like a school of fish".
The best profs I ever had would not tell you the answer, they would point you in the direction of the answer. The prof would also provide you with the knowledge to be able to find the answer.
Brett has given so many "clues" as to where to look, you should be able to say, I already know.

This is also why Google has a problem ranking "Free Sites"(hint).

jeremy goodrich

3:17 pm on Aug 8, 2001 (gmt 0)

Start up, it is interesting to note that, for example, the c class filtering was not fully in place by google until between Sept and Dec of 2000.

Given that this paper is "at least 3 years old" they are a little slow implementing things by one of their better engineers, IMHO. Academic work is only important once there is a commercial application...till then, it gets dusty, moldy, and perhaps is part of a piece of paper one might hang in an office (think 'stanford degree' :) ).

So this being the case time wise, hmmm...

You might note one of the citations at the bottom

"Chakrabarti et al 99] S. Chakrabarti, M. van den Berg and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. In the 8th World Wide Web Conference, Toronto, May 1999. [cs.berkeley.edu...] " source [cs.toronto.edu...]

That they mention '99 in the paper, means it can't be "at least 3 years old" I think.

toolman

4:31 pm on Aug 8, 2001 (gmt 0)

Speaking of Hilltop...wouldn't it be easy for an engine to categorize sites by dns servers at the host level?

ggrot

5:01 pm on Aug 8, 2001 (gmt 0)

Or administrative contacts in the whois lookup.

click watcher

4:37 pm on Aug 30, 2001 (gmt 0)

<quote>This is also why Google has a problem ranking "Free Sites"(hint).</quote>

hmmm what do you mean?

for one of my keywords (100k results)

the no1 site url is a members.aol.com/xxxxxxx site

i was thinking of creating a hub site by setting up a load of sites on free isp space and pointing them to my hub... would i be wasting my time???? i thought they would inherit some of the isp's high rank, have i missed the point?

NFFC

9:46 pm on Mar 7, 2002 (gmt 0)

>wouldn't it be easy for an engine to categorize sites by dns servers at the host level?

Yes. :)

rcjordan

9:50 pm on Mar 7, 2002 (gmt 0)

(Heh! This thread's a year old) Point made.