Current Academic Link Analysis Research

Forum Moderators: open

Message Too Old, No Replies

Current Academic Link Analysis Research

What might they be working on at the 'Plex?

johnser

1:27 am on Oct 9, 2004 (gmt 0)

Been having a look to see what's the current state of research in the whole area of link analysis.

From my travels, it seems like analysis of links in a Semantic environment seems to be where its going.

Other things include Markov chains, block level link analysis (semantic stuff), Link Analysis Ranking (LAR)

I also found these particularly relevant:
[informatics.indiana.edu...]
[informatics.indiana.edu...]

Have you any thoughts on how links are going to effect SERPs in the future & how the SEs operate?

Oliver Henniges

11:03 am on Oct 9, 2004 (gmt 0)

Whow. I did not find the time to read that stuff yet. But whow.
Just wanted to make sure this thread is not going to vanish in the "only-one-message"-bin.

marin

2:14 pm on Oct 9, 2004 (gmt 0)

Hi Johnser,

Could be "block level link analysis" part of the new msn preview?

Google seems to focus more on meaning:
"We want to be able to search and find these [entities] and the relationships between them, rather than you typing in the words specifically" recently Peter Norvig said [eweek.com]

BroadProspect

5:56 pm on Oct 9, 2004 (gmt 0)

great findings, this clustering method (2nd article) can be used for the "related sites" .
but we are really intrested in how it can be used for determining the SERP ordering...
IMHO, it can not be used for the 1st iteration of determing the SERP listings, this will still be done by the regular algo, BUT , it may well be that the initial list (like the 1st 1000 URLs) can be divided into the clusters (there can be pages which belong to multiple clusters), in each cluster the Local Rank can be calculated and that combined with the page rank will determine the SERP ordering. and the last stage can be removal of too indentical contant pages (as decribed in the 1st paper)
as a result ....
It will give a huge boost to good autority sites and will drop many of the hubs out of the SERPs and the SERP will look VERY uniformly theamed, the only concern is that if a term has multiple meanings one of them may highly dominate the initial SERP's pages
/BP

cabbie

12:11 am on Oct 10, 2004 (gmt 0)

>>>It will give a huge boost to good autority sites and will drop many of the hubs out of the SERPs and the SERP will look VERY uniformly theamed, the only concern is that if a term has multiple meanings one of them may highly dominate the initial SERP's pages <<<

Call me crazy but this is what I think I am seeing.

Oliver Henniges

9:39 am on Oct 10, 2004 (gmt 0)

OK. I breathed both papers quickly, correct me if I got anything wrong. From the many things that came to my mind whilst reading let me only point out a few:

I was always wondering what the 'related:'-search command was good for, since it mainly seemd to mirror odp-structures. Now, with Menczer basing their "semantic-similarity"-coefficient on distance within the odp-tree (which is not a tree), I get the feeling that such issues have been incorporated in some way into the ranking-algos, thereby lowering the impact of backlinks and PR. This would explain why - as pointed out elsewhere - so many dmoz-dupes and pseudo-directories polluted recent search results on less-competitive terms. If this be the case i would estimate current state of google findings as a first approach to such cluster-analysis, imperfect, and to be improved in the near future.

What I found unsatisfactory was that to my opinion it was not always clear in all cases whether Menczers analysis refers to domains or single urls, whether e.g. link-distance refers to domain or URL. For example it sounde to me that in the first paper they investigated only the starting index-page of the 100.000 URLs (domains?) found in the odp. But maybe I just didn't understand correctly.

Considering what has been observed so far in other threads about recent shifts in the serps it seems quite likely that Menczers content-similarity-coefficient or a similar measurement presumably based on probabilistic lexical analysis of page-pairs is now added to the ranking algos and perhaps even TPR. It seems as if content in some manner now influences the way backlinks are valued.

From a linguistic point of view I find it quite problematic that probabilistic issues creep in thru the backdoor 50 years after chomsky's attacks on skinner. This can only be an intermediate state and as soon as possible should be substituted by integrating analysis of sysntactic and semantic features of the natural language of the body-texts right into the html-parsing-algorithms.

Such a shift towards cluster-analysis of lexical distance, odp-structure-distance, link-distance (and maybe even other features) in combination with doc_z's hints on continuous crawls (see [webmasterworld.com...] ) might also indicate that it is a first workaround for the performance and capacity issues on crawling and backlink-evaluating the whole internet, which becomes more and more problematic..

If analysis of the relevance of a website is from now on performed on such a high level of cluster-analysis, I'd come to a conclusion, which goes well in accordance with other observations:

size DOES matter!

If you promote your own website, make sure you begin to cooperate as soon as possible with those you have so far called your "competitors". Build larger units until your cluster becomes an authority of its own or you will vanish. If you write or promote different websites for different customers make sure you concentrate on topic-related parts of your customers until you - as connecting them all - will be viewed as an authority in your region of the net.

synergy

3:40 pm on Oct 10, 2004 (gmt 0)

Here is an interesting read from Mike Grehan [e-marketing-news.co.uk]

claus

11:41 pm on Oct 10, 2004 (gmt 0)

>> it seems like analysis of links in a Semantic environment seems to be where its going.

I think the good people @ GOOG are learning a lot from other fields than semantics, some very remotely connected to the concept of "search as we know it" eg. behavioral sciences, mechanics, etc. <pun>perhaps even rocket science</pun>

Let's not forget that it's a very academic environment they're working in, so they do actively pursue a lot of research, of which some proportion is applied. Still, it's also very much "hands on", so my best guess when confronted with two ranking methods is: Simplicity wins.

Semantics and linguistics is of course interesting when considering the topic of a page, and i'm sure you can also derive quite complicated algorithms to determine if a page is really about something or just mumbo-jumbo. Still, there's differences: Topic extraction should be orders of magnitude simpler than "natural text" validation, especially as what's natural text on the web would not be natural text in a textbook, newspaper, or post-it note. And then, there's natural differences between info sites, e-shops, artistic sites, entertainment sites, news sites, and the lot.

Where semantics fail is in the search box. Enter one to three words, some perhaps even ambiguous: I enter "orange widgets" meaning widgets for oranges, and i get widgets for apples that are orange coloured. Then again, i might be totally wrong here. I'm quite sure that "they're doing something" here as well. Without that, we would never have gotten the define tool or the calculator.

>> Have you any thoughts on how links are going to effect SERPs in the future & how the SEs operate?

That's a very very ....i'd say extremely broad question. Anyway, that's just semantics, as the saying goes. Personally, i've got a lot of thoughts about this, but most will be off topic for the thread headline, so i'll stick with links, SERPS and the present, as that's a more limited scope.

I think most readers of this forum can agree that "links are not just links", ie. some links are worth more than others. I'm not thinking about high-PR versus low-PR links here, rather it's more along the thread topics of (all fictional):

- "does links from links.html count?"
- "i see pr0n sites in my log stats"
- "does it pay to pay for links"
- "sitewide links vs. one high PR link"
- "are links s*ndb*xed?"
- "my 73,000 backlinks doesn't show anymore"
- "does Amaz*n benefit from aff links?"
- "anybody know a good recip/directory/GB/[insert word here] script?"
- "how to find good link partners?"
- "how to write a javascript link?"
- "what's the current rate for a PR "X" link?"

I'm sure you've seen something similar to the above somewhere close recently. If that massive interest in linking is not a (very very... i'd say extremely) big red flag to a business based on the value of links, well...

So, i think it would be very stupid for the good people @GOOG Corp not to consider making some differences to the way links are treated and assigned importance. I also think.. that is, i don't think, that these people are stupid.

I recall that some months ago i complained publicly in these forums about how easy "it" had become (and i wasn't the only one to do that). There was a period in which anchor text was essentially all it took to get rankings, and it was so obvious that in hindsight i think it was "too obvious" ie. something was brewing in the back office.

I'm probably totally wrong about this, as it would imply that they had a period where something else on the inside had very much focus, and demanded some attention shifts away from serps for a while.

So, to add insult to injury, here's some wild speculation: I think we see the first weak signs that links (and pages/sites that give them as well as receive them) are not treated totally equal these days. Also, i think this trend will continue.

---
Added: Markov chains? That reminds me so much of some of the best classes i had back in the school days. Haven't seen them applied anywhere since then, though.

cabbie

4:22 am on Oct 11, 2004 (gmt 0)

Thanks Claus and Oliver for your perspectives.
I am sure this is going to be mandatory understanding for anyone trying to take shortcuts to search engine success.(also known as spam).
I fear that soon it is going to be rocket science to game google and I am dreading the day when I am going to have to do some hard legitimate website buiding to rank in google.:(

robertito62

5:19 am on Oct 11, 2004 (gmt 0)

> I get the feeling that such issues have been incorporated in some way into the ranking-algos, thereby lowering the impact of backlinks and PR.

mmhhh.. don't think so. From that paper, there are now right backlinks and wrong backlinks. You either have them or you don't. pr is the icing...

Oliver Henniges

7:50 am on Oct 11, 2004 (gmt 0)

> Topic extraction should be orders of magnitude simpler than "natural text" validation,

FACK. Nevertheless we should keep in the back of our minds that google claims to heavily work on such issues as automated translation and I suppose the insights gained on those fields will from now on continuously improve the ranking algos.

> I think we see the first weak signs that links ..are not treated totally equal these days. Also, i think this trend will continue.

Why are you calling this a "wild speculation?"

> simplicity wins

Yo. So I'd regard the "lexical similarity" - issue a good place to start with.

Does anyone know whether any source code executing that vector space analysis and discrete cosine transformation is available somewhere in the net? I don't think it is wasted time to do some empirical research on whether such an algo might explain the reported shifts in ranking.

As to the paper synergy pointed to:

Reading about the analogies of growing social and internet networks I immediately had to think of what happens to the brain in the first two years of a child. It is a period of massive growth of synapses which almost comes to an end around the second birthday. From then on new links still emerge but only on the costs of others. I think the internet-linkomania will also come to an end the next years.

If google now switches over to an added lexical analysis this would perfectly correspont to what in research on first language acquisition is called the "first word spurt" and which is said to happen between the 18th and 24th month of life.

another good point to start with would to me thus be to swallow the results of the keyword-suggestion tool as long as we can, because I think in the near future it will be faked like the link:-command and TPR.

claus

9:29 pm on Oct 11, 2004 (gmt 0)

>> wild speculation

uhm.. perhaps i exaggerated a little, by adding "wild". Sometimes some people tend to accept speculation as truth because it sounds believable to them, regardless if there's proof or not. I try to admit being speculative when i have no proof, but unfortunately i don't always remember to make a point of it. However, speculation rarely occurs without reason - it's just not always that the reason turns out to be what you think it is initially.

Oliver Henniges

9:38 am on Oct 12, 2004 (gmt 0)

To prove any of these issues would mean to know someone who knows someone (who.. less than six degrees) who works at google's. Or to disprove Goedel and reestablish hilbert's programm. "Quite likely" is more than we can expect, and that's what it is to me.

> Markov chains? .. back in the school days. Haven't seen them applied anywhere since then, though.

I saw them applied in Shea/Wilson's "illuminatus" Don't spit on the floor. lol. And reading that was the reason why I paid no attention to vector spaces and statistics back in those schooldays, so:

How the hell again do you calculate the cosine value between two vectors in more than three dimensions? I stared at these formulars and felt like an idiot. for dummies please.

Seems as if this thread has been too academic from the start. PDFs are dead ends. What a pity.