Why AOL and MSN execs need to be mathematicians - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Why AOL and MSN execs need to be mathematicians

Or how to torpedo the Google juggernaut

1
2
»

Bottler

1:48 am on Apr 1, 2004 (gmt 0)

10+ Year Member

It's becoming very clear that Google is increasingly dependent on DMOZ for its intellectual property leeching source. Remind yourself that Google's Pagerank calculation all has to start with initial rank vectors and right now that somewhere includes primarily government and education institution sites and some human-edited directories. The human-edited directories it has used in the past has included Looksmart, Yahoo, DMOZ until Looksmart wisely shut them off for their leeching activities. Now it might be Yahoo's turn as it positions itself more firmly in competition against Google.

What's left? DMOZ? An idea now fully controlled by Netscape and in turn by AOL. Consider a scenario in which AOL decides that the licence for use of ODP data does not extend to public companies, or companies exceeding a certain amount of traffic with commercial purposes, or to any organization which intends to use ODP data for any commercial use. Licences for such uses need to be negotiated separately with AOL/Netscape.

If I was an AOL exec, I would be rubbing my hands with evil lust thinking about Google's exceedinly heavy dependence on ODP data. If I was an Microsoft exec I would be clandestinely courting AOL to secure some more exclusive arrangement for DMOZ. Unfortunately AOL management seem to be a little "old school" these days. Oh I am sure Google has something worked out with AOL but I doubt AOL execs realize just how critical Dmoz is to its accuracy in the face of its never-to-end anti-SEO war.

This is a huge opportunity for AOL. It would take years for Google to achieve the momentum of DMOZ if it started its own directory effort even if they bought a directory like Looksmart. I wonder how long it will take them to wake up from their slumber.....

MarkHutch

5:53 am on Apr 1, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Nice post, but I don't agree with what you're saying. I tested a search engine program once about five years ago when we thought it would be a good idea to try getting into this market. What we discovered was that we could find billions of pages on the net by entering in as few as 100 URL's with the software set up to find and follow links. If your point is that the DMOZ database is more important than others, then that can and is discussed here on a regular basis.

Bottler

6:05 am on Apr 1, 2004 (gmt 0)

10+ Year Member

What we discovered was that we could find billions of pages on the net by entering in as few as 100 URL's

It is has very little to do with crawling. It has all to do with ranking. There are many search engines out there that crawl hundreds of thousands of pages but very few claim the quality or popularity of Google.

I am not just saying DMOZ is important. I am saying it is now critical to Google's quality. Without it, Google's results would likely be very low quality particularly in commercial areas. Commercial searches constitute a very large fraction of all searches and are the foundation of Google's existing business model.

Essentially Google is now a business which is critically dependent on one supplier for providing value to its customers. Anyone who has studied a bit of business valuation theory knows that that puts their business model in a very shaky competitive position. It's a matter of time before AOL, Yahoo and MSN become less blase about that fact.

MarkHutch

6:20 am on Apr 1, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Your point is that DMOZ is a better directory than just finding pages via crawl? If that's what you're saying, then you have a valid point. My view is that dmoz is a good thing, but I don't put that much weight on it. Maybe Google does and should, but I think a good algo could be run just by finding sites the old fashion way and ranking them on a different basis.

ILuvSrchEngines

6:28 am on Apr 1, 2004 (gmt 0)

>I am not just saying DMOZ is important. I am saying it is now critical to Google's quality. Without it, Google's results would likely be very low quality particularly in commercial areas. Commercial searches constitute a very large fraction of all searches and are the foundation of Google's existing business model.

I actually agreed with most of your first post. You went and blew it a little on your second post though. The Google filter is specificly targeted at commercial phrases and mom-and-pop commercial web sites. They could care less if a commercial business did not show up in the natural results (unless they are a major advertiser in which case they would perhaps by chance magicly end up with a VERY GOOD DMOZ listing IMO).

Bottler

6:43 am on Apr 1, 2004 (gmt 0)

10+ Year Member

Not entirely sure you've thought through how PageRank works all the way back. Critical to the way PageRank is calculated are respected sources. The PageRank vectors are initially (and in some cases periodically) fed by the value of these initial sources.

Here's an example. Let's say I wanted to rank the intellectual value of every poster on WebmasterWorld. One way I could do this is by simply drawing up a large network of directed connections between individuals where each connection represents a reply by one poster to another poster. I could then assign values to each connection based on the value of the poster who replied and say that a percentage of this value is transferred to the original poster who is being replied to. After applying some normalization at each step, I could build up a value rating for every poster on WebmasterWorld.

But there is a problem with this model. It is recursively defined and needs starting values. Do I simply give everybody the same initial starting value? If I did this, it might turn out that a lot of dumb guys like to hang out with each other here responding to each others posts humurously or out of boredom and members of this dumb guy group would rise to the top as most valuable posters. Alternatively I could ask Brett Tabke and GoogleGuy to identify (in secret of course ;))a small collection of posters they consider intelligent and knowledgeable on WebmasterWorld. We could assign these people an initial value rating to seed the algorithm described in the previous paragraph. From here we can recursively construct a value rating of all users.

Herein lies Google's problem. With the number of initial authorities like Yahoo, Looksmart and DMOZ diminishing for them, the likelihood of results being vastly artificially boosted by the popularity of the mere connectedness of the community in which they belong would be very high. Consider for example the large network of teenage gamers out there on the net who have websites and link to each other. Without the appropriate source values, our results could be dominated by the collective opinions of warez kiddies, free porn junkies and Britney Spears fans.

Intellectual property value sources are critical to Google's rankings and Google's serious options for seeding are running out.

The Google filter is specificly targeted at commercial phrases and mom-and-pop commercial web sites. They could care less if a commercial business did not show up in the natural results

Well if they don't care, then they should care because their users definitely care. When I search for "cheap jewelry", who wants to see some site about a flea market in downtown Nowheresville that some University student bought some fake second-hand jewelry from for his girlfriend one drunken weekend? Or when I search for "fast cars", who wants to see a government report about accident rates on major highways? Commercial searches constitute a huge percentage of all searches.

1milehgh80210

7:49 am on Apr 1, 2004 (gmt 0)

10+ Year Member

They could care less if a commercial business did not show up in the natural results"

Agree. I have a feeling they would like the 'relevant' commercial results to be shown on the right hand side of the page.

SyntheticUpper

8:33 am on Apr 1, 2004 (gmt 0)

10+ Year Member

Google's Pagerank calculation all has to start with initial rank vectors

I think you'll find this is false. The PR algo allows you to start at any point in order to calculate web-wide PR. It is then normalised.

vitaplease

8:51 am on Apr 1, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

You might want to have a look at this overview article on Pagerank (pdf)

Deeper Inside PageRank [webmasterworld.com]

chapter 8 discusses Updating the Pagerank vector.

also this recent paper: Combating Web Spam with TrustRank [webmasterworld.com] discusses the selection of "reputable seed pages".

If you skip the formulas there are some interesting things around.

DMOZ or no DMOZ, Google already has plenty archived seed selection start options for crawling IMO.

SyntheticUpper

8:56 am on Apr 1, 2004 (gmt 0)

10+ Year Member

Hmm,

that's interesting. I've often wondered if they use university sites as start points (esp. Ivy League)

I had a site with an indirect link from an Ivy League prof's page, and it was crawled every 2 seconds (o.k., that's a slight exaggeration :)

Bottler

9:54 am on Apr 1, 2004 (gmt 0)

10+ Year Member

also this recent paper: Combating Web Spam with TrustRank discusses the selection of "reputable seed pages".

Oh my God. I just read this paper. I can't believe how mindboggingly naive some of its underlying assumptions are. Check this out - "Since trust flows out of the good seed pages, one approach is to give preference to pages from which we can reach many other pages". No further comment necessary.

Furthermore check this out. "In order to get rid of the spam quickly, we removed from our list of 25,000 sites all that were not listed in any of the major web directories". More evidence of their heavily reliance on leeching intellectual property value from the major directories.

As for seeds, in this paper the seeds they refer to are only potential page candidates that are likely not to be spam (based on silly assumptions such as above) to kickstart an algorithm for spam detection based on propagation of "non-spaminess" measures. The paper does not discuss PageRank algorithm seeding.

Great papers though for finding out how they are thinking. Thanks.

[edited by: Bottler at 11:17 am (utc) on April 1, 2004]

kaled

10:59 am on Apr 1, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I've only skimmed this thread so apologies if I'm not saying anything new.

DMOZ/ODP editors are unpaid. It is assummed that ALL organisations may utilise its data freely commercial or otherwise.

Whilst one company may own the hardware, logos, domain name, etc, they do not own the contents. Banning Google from use of ODP data would certainly be immoral, it might even be illegal.

Also, my understanding of Page Rank logic may be limited, but I very much doubt that the starting point is critical beyond the first few iterations. If Google dropped ODP data, the only sites to suffer would be those that have few other backlinks. The whole PR calculation NOT fall apart.

Kaled.

SyntheticUpper

11:32 am on Apr 1, 2004 (gmt 0)

10+ Year Member

As for seeds, in this paper the seeds they refer to are only potential page candidates that are likely not to be spam (based on silly assumptions such as above) to kickstart an algorithm for spam detection based on propagation of "non-spaminess" measures.

Fascinating. If so, algo changes could be caused by differing selections of 'seed' pages.

apollo

11:41 am on Apr 1, 2004 (gmt 0)

10+ Year Member

Kaled, have you ever read Tom Sawyer where he gets the neighborhood kids to pay him to let them paint the fence because he makes them believe it is fun?

AOL is Tom Sawyer, DMOZ editors are the neighborhood kids painting the fence.

The only thing that AOL owes DMOZ editors is a big "Thanks for painting my ($200 million) fence"

Bottler

11:53 am on Apr 1, 2004 (gmt 0)

10+ Year Member

but I very much doubt that the starting point is critical beyond the first few iterations.

I'm not talking about the initial PageRank vector as used in the power method for the conditioned transition matrix. I am talking about the conditioning vector or what Google refers to lately as the "personalization vector" which is used in every iteration.

It's the vector of probabilities of where a user jumps to if she is bored with the current page or there are no links on the page.

doc_z

12:22 pm on Apr 1, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

It's becoming very clear that Google is increasingly dependent on DMOZ for its intellectual property leeching source. Remind yourself that Google's Pagerank calculation all has to start with initial rank vectors ...

Critical to the way PageRank is calculated are respected sources. The PageRank vectors are initially (and in some cases periodically) fed by the value of these initial sources.

If you're referring to the original PageRank algorithm then the solution of the linear equations (for d<1; d = damping factor) is independent of the initial vector. Moreover, the solution is unique and even normalization doesn't play a role. (Normalization would only be important for the case d=1 which corresponds to the calculation of eigen vectors.) In practice one is using the vector of the last calculation to speed up the calculations.

Even in the TrustRank model with artificial sources initial conditions don't play a role.

But there is a problem with this model. It is recursively defined and needs starting values. Do I simply give everybody the same initial starting value?

There is no recursively definition. Of course, iteration schemes to calculate the solution of the linear equation are recursive. However, the definition of PR isn't recursive.

By the way, in principle you could also calculated PR within one step without initial guess as already mentioned here: Google meets Heisenberg [webmasterworld.com]. Of course, in practice this wouldn't work for such a large system.

kaled

1:17 pm on Apr 1, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Apollo,

Your analogy (Tom Sawyer) is interesting. You may have hit the nail on the head.

Kaled.

rfgdxm1

3:15 pm on Apr 1, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>Consider a scenario in which AOL decides that the licence for use of ODP data does not extend to public companies, or companies exceeding a certain amount of traffic with commercial purposes, or to any organization which intends to use ODP data for any commercial use. Licences for such uses need to be negotiated separately with AOL/Netscape.

The ODP Social Contract forbids this. If AOL tried that, ODP editors could sue them for violating this.

Bottler

8:09 pm on Apr 1, 2004 (gmt 0)

10+ Year Member

the solution of the linear equations (for d<1; d = damping factor) is independent of the initial vector.

See my previous message.

Even in the TrustRank model with artificial sources initial conditions don't play a role.

Sorry this is blatantly wrong but I'm only prepared to debate it offline as this is not the research forum.

The ODP Social Contract forbids this. If AOL tried that, ODP editors could sue them for violating this.

I'm confident their lawyers would attempt to argue some breach Google is guilty of. For example, the Contract states "We do not permit unattributed use of our data, and will request data users to place the attribution on their site or remove the data entirely if they wish not to comply. We consider unattributed use a legal infringement of the free use license, and contrary to the ODP's purpose as an Open Source inspired initiative". Google may acknowledge use of the data in its directory but what about in its search rankings? It might be successfully argued they are in breach of this Contract and misusing the intellectual property of Dmoz editors for their commercial benefit. Besides, what damages could they argue?

doc_z

9:31 pm on Apr 1, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Herein lies Google's problem. With the number of initial authorities like Yahoo, Looksmart and DMOZ diminishing for them, the likelihood of results being vastly artificially boosted by the popularity of the mere connectedness of the community in which they belong would be very high. Consider for example the large network of teenage gamers out there on the net who have websites and link to each other. Without the appropriate source values, our results could be dominated by the collective opinions of warez kiddies, free porn junkies and Britney Spears fans

I'm not talking about the initial PageRank vector as used in the power method for the conditioned transition matrix. I am talking about the conditioning vector or what Google refers to lately as the "personalization vector" which is used in every iteration.

Google's personalized PageRank vectors is based on the topic related PR vectors. This model doesn't base on authorities, i.e. the weight (the transition probability in the random surfer model) is topic dependent but there is no higher weight to qualitative good pages. Therefore, there is no need for authorities and no dependency from directories. You just have to determine the topic of each web page.

Of course, even in this case there are no recursive definitions, just recursive iteration schemes (the same that can be used for the non-personalized case). And even in this case you can calculate the final PR within one step without iterations.

Even in the TrustRank model with artificial sources initial conditions don't play a role.

Sorry this is blatantly wrong but I'm only prepared to debate it offline as this is not the research forum.

I was talking about the initial vector used for iteration schemes to calculate TrustRank/PR (!) such as Jacobian iteration. In this case (which I was referring to) the statement given above is obviously valid.

Bottler

9:43 pm on Apr 1, 2004 (gmt 0)

10+ Year Member

Google's personalized PageRank vectors is based on the topic related PR vectors

And these are based on DMOZ subtopic-based conditioning. Come on - let's not have a technical debate until people do their homework properly and argue honestly. I'm not here to give people math lessons. I'm here to provoke discussion about an important issue that effects many people here.

My primary motivation is the strong belief that a Google monolopy is unhealthy for everybody just as much as a Microsoft one is. It is in all of our interests to see healthy competition in this field.

Google is actively undermining the commerciality of webmasters in order to boost their own corporate profits. Doesn't this begin to smell of misuse of monopolistic authority to anybody here?

bignet

10:57 pm on Apr 1, 2004 (gmt 0)

10+ Year Member

you mean shooting themselves in the feet? adwords is popular cos google provides more relevant results than the competition.

DerekH

11:23 pm on Apr 1, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Bottler wrote

My primary motivation is the strong belief that a Google monolopy is unhealthy for everybody just as much as a Microsoft one is

Microsoft products are not free to the end-user. Google is.
That's a very big difference.

DerekH

BigDave

11:25 pm on Apr 1, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Come on - let's not have a technical debate until people do their homework properly and argue honestly. I'm not here to give people math lessons.

LOL. I just find it hilarious that you quote doc_z and follow it up with this.

While I have had disagreements with doc_z, it has almost always been on semantic or opinion issues. He (or she) certainly knows their math, and seems to be quite familiar with all the appropriate papers.

I'm here to provoke discussion about an important issue that effects many people here.

So, is provoking discussion the same as discussing? It doesn't sound as friendly to me.

My primary motivation is the strong belief that a Google monolopy is unhealthy for everybody just as much as a Microsoft one is. It is in all of our interests to see healthy competition in this field.

Well, that certainly is a "provoking" statement. Perhaps you shoud have just titled this thred more appropriately rather than getting upset when people actually try and discuss the technical merits of what you post. It will make it a lot less confusing.

If you don't want a technical discussion, then do not post your technical plan of attack as the main topic. Post your political position, which it sounds like what you really want to discuss.

Google is actively undermining the commerciality of webmasters in order to boost their own corporate profits.

And the other big online companies aren't? I think amazon is a lot closer to monopoly power in a lot of online areas than Google is.

MikeBeverley

9:57 am on Apr 2, 2004 (gmt 0)

10+ Year Member

It would take years for Google to achieve the momentum of DMOZ if it started its own directory effort even if they bought a directory like Looksmart.

You can bet that if Google announced it wanted some editors to work on a new Google Directory that you would have 100,000 applications sent within the first 10 seconds even if they weren't willing to compensate you financially.

'Years'? It would take only a few months to get up to Dmoz's level.

I feel that Google would benefit from making it's own directory. Looksmart was one the most corrupt and easily fooled directories I'd ever come across. Dmoz is getting that way aswell, I can't even count the number of bribes I was offered to pass duplicate inclusions or to put keywords in the linking text rather than the company name.

doc_z

8:04 pm on Apr 2, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

And these are based on DMOZ subtopic-based conditioning.

It seems that you're referring to Haveliwala's papers. His ODP biasing ansatz is not only dealing with topic-sensitive PageRank but also with an authority system, i.e. most entries of the PageRank vector are zero independent from the topic. However, there is no need to mix these different issues. As long as one is only interested in topic-sensitive PageRank and not in an authority system, you can determine the topic from on page factors. Therefore, ODP might be important for an authority system but not for purely topic-sensitive PageRank.

By the way, even in Haveliwala's and other Stanford papers there are no recursive definitions just recursive methods to solve linear equations.

Here is a brief, rough overview for those who are interested in this topic:
- The standard (original) PR algorithm uses a single constant PageRank vector, i.e. the self-contribution of each page is the same. (In the random surfer model this corresponds to a uniformly distributed probability to be teleported to a page.)
- In the authority system only selected sources have a weight, i.e. most of the entries in the PR vector are zero. (Therefore, one is only teleported to a selected set of pages.)
- Topic sensitive PageRank uses a set of vectors. (The transition probability is topic dependent.)

I'm not here to give people math lessons

Before giving math lessons you should at least avoid such mistakes as calling a simple set of linear equations 'recursive'.

rfgdxm1

9:09 pm on Apr 2, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>I feel that Google would benefit from making it's own directory.

Yeah, sure. Google would really benefit if instead of people complaining about alleged ODP editor corruption at Webmasterworld, they were moaning about the corruption of the new Google directory editors. I think not.

IITian

9:15 pm on Apr 2, 2004 (gmt 0)

10+ Year Member

>You can bet that if Google announced it wanted some editors to work on a new Google Directory that you would have 100,000 applications sent within the first 10 seconds even if they weren't willing to compensate you financially.

You are right. They all will add their own sites and then leave.

Bottler

9:54 pm on Apr 2, 2004 (gmt 0)

10+ Year Member

If any AOL or MSN people are reading, hire a decent mathematician and do the research. Without Dmoz, Google's search quality is in a lot of trouble....

[edited by: tedster at 6:22 pm (utc) on April 5, 2004]
[edit reason] TOS #16 [/edit]

MikeBeverley

10:05 am on Apr 3, 2004 (gmt 0)

10+ Year Member

I think the idea of a manually reviewed search engine (such as the way Yahoo is heading with SiteMatch) is going to put them well ahead in terms of relevancy for commercial searches. Unfortunately the large portion of the web is built up of informational sites which do not submit to search engines and which do not submit to directories. Therefore Googles method of 'crawl-em-all-sort-em-out-later' attitude will have to be adopted by MSN and AOL. Does anyone actually use directories for searching the web on a regular basis?

[edited by: WebGuerrilla at 5:23 pm (utc) on April 5, 2004]
[edit reason] TOS #16 [/edit]

This 34 message thread spans 2 pages: 34

1
2
»