Forum Moderators: open
What's left? DMOZ? An idea now fully controlled by Netscape and in turn by AOL. Consider a scenario in which AOL decides that the licence for use of ODP data does not extend to public companies, or companies exceeding a certain amount of traffic with commercial purposes, or to any organization which intends to use ODP data for any commercial use. Licences for such uses need to be negotiated separately with AOL/Netscape.
If I was an AOL exec, I would be rubbing my hands with evil lust thinking about Google's exceedinly heavy dependence on ODP data. If I was an Microsoft exec I would be clandestinely courting AOL to secure some more exclusive arrangement for DMOZ. Unfortunately AOL management seem to be a little "old school" these days. Oh I am sure Google has something worked out with AOL but I doubt AOL execs realize just how critical Dmoz is to its accuracy in the face of its never-to-end anti-SEO war.
This is a huge opportunity for AOL. It would take years for Google to achieve the momentum of DMOZ if it started its own directory effort even if they bought a directory like Looksmart. I wonder how long it will take them to wake up from their slumber.....
What we discovered was that we could find billions of pages on the net by entering in as few as 100 URL's
It is has very little to do with crawling. It has all to do with ranking. There are many search engines out there that crawl hundreds of thousands of pages but very few claim the quality or popularity of Google.
I am not just saying DMOZ is important. I am saying it is now critical to Google's quality. Without it, Google's results would likely be very low quality particularly in commercial areas. Commercial searches constitute a very large fraction of all searches and are the foundation of Google's existing business model.
Essentially Google is now a business which is critically dependent on one supplier for providing value to its customers. Anyone who has studied a bit of business valuation theory knows that that puts their business model in a very shaky competitive position. It's a matter of time before AOL, Yahoo and MSN become less blase about that fact.
I actually agreed with most of your first post. You went and blew it a little on your second post though. The Google filter is specificly targeted at commercial phrases and mom-and-pop commercial web sites. They could care less if a commercial business did not show up in the natural results (unless they are a major advertiser in which case they would perhaps by chance magicly end up with a VERY GOOD DMOZ listing IMO).
Here's an example. Let's say I wanted to rank the intellectual value of every poster on WebmasterWorld. One way I could do this is by simply drawing up a large network of directed connections between individuals where each connection represents a reply by one poster to another poster. I could then assign values to each connection based on the value of the poster who replied and say that a percentage of this value is transferred to the original poster who is being replied to. After applying some normalization at each step, I could build up a value rating for every poster on WebmasterWorld.
But there is a problem with this model. It is recursively defined and needs starting values. Do I simply give everybody the same initial starting value? If I did this, it might turn out that a lot of dumb guys like to hang out with each other here responding to each others posts humurously or out of boredom and members of this dumb guy group would rise to the top as most valuable posters. Alternatively I could ask Brett Tabke and GoogleGuy to identify (in secret of course ;))a small collection of posters they consider intelligent and knowledgeable on WebmasterWorld. We could assign these people an initial value rating to seed the algorithm described in the previous paragraph. From here we can recursively construct a value rating of all users.
Herein lies Google's problem. With the number of initial authorities like Yahoo, Looksmart and DMOZ diminishing for them, the likelihood of results being vastly artificially boosted by the popularity of the mere connectedness of the community in which they belong would be very high. Consider for example the large network of teenage gamers out there on the net who have websites and link to each other. Without the appropriate source values, our results could be dominated by the collective opinions of warez kiddies, free porn junkies and Britney Spears fans.
Intellectual property value sources are critical to Google's rankings and Google's serious options for seeding are running out.
The Google filter is specificly targeted at commercial phrases and mom-and-pop commercial web sites. They could care less if a commercial business did not show up in the natural results
Deeper Inside PageRank [webmasterworld.com]
chapter 8 discusses Updating the Pagerank vector.
also this recent paper: Combating Web Spam with TrustRank [webmasterworld.com] discusses the selection of "reputable seed pages".
If you skip the formulas there are some interesting things around.
DMOZ or no DMOZ, Google already has plenty archived seed selection start options for crawling IMO.
also this recent paper: Combating Web Spam with TrustRank discusses the selection of "reputable seed pages".
Furthermore check this out. "In order to get rid of the spam quickly, we removed from our list of 25,000 sites all that were not listed in any of the major web directories". More evidence of their heavily reliance on leeching intellectual property value from the major directories.
As for seeds, in this paper the seeds they refer to are only potential page candidates that are likely not to be spam (based on silly assumptions such as above) to kickstart an algorithm for spam detection based on propagation of "non-spaminess" measures. The paper does not discuss PageRank algorithm seeding.
Great papers though for finding out how they are thinking. Thanks.
[edited by: Bottler at 11:17 am (utc) on April 1, 2004]
DMOZ/ODP editors are unpaid. It is assummed that ALL organisations may utilise its data freely commercial or otherwise.
Whilst one company may own the hardware, logos, domain name, etc, they do not own the contents. Banning Google from use of ODP data would certainly be immoral, it might even be illegal.
Also, my understanding of Page Rank logic may be limited, but I very much doubt that the starting point is critical beyond the first few iterations. If Google dropped ODP data, the only sites to suffer would be those that have few other backlinks. The whole PR calculation NOT fall apart.
Kaled.
As for seeds, in this paper the seeds they refer to are only potential page candidates that are likely not to be spam (based on silly assumptions such as above) to kickstart an algorithm for spam detection based on propagation of "non-spaminess" measures.
Fascinating. If so, algo changes could be caused by differing selections of 'seed' pages.
AOL is Tom Sawyer, DMOZ editors are the neighborhood kids painting the fence.
The only thing that AOL owes DMOZ editors is a big "Thanks for painting my ($200 million) fence"
but I very much doubt that the starting point is critical beyond the first few iterations.
It's the vector of probabilities of where a user jumps to if she is bored with the current page or there are no links on the page.
It's becoming very clear that Google is increasingly dependent on DMOZ for its intellectual property leeching source. Remind yourself that Google's Pagerank calculation all has to start with initial rank vectors ...
Critical to the way PageRank is calculated are respected sources. The PageRank vectors are initially (and in some cases periodically) fed by the value of these initial sources.
If you're referring to the original PageRank algorithm then the solution of the linear equations (for d<1; d = damping factor) is independent of the initial vector. Moreover, the solution is unique and even normalization doesn't play a role. (Normalization would only be important for the case d=1 which corresponds to the calculation of eigen vectors.) In practice one is using the vector of the last calculation to speed up the calculations.
Even in the TrustRank model with artificial sources initial conditions don't play a role.
But there is a problem with this model. It is recursively defined and needs starting values. Do I simply give everybody the same initial starting value?
There is no recursively definition. Of course, iteration schemes to calculate the solution of the linear equation are recursive. However, the definition of PR isn't recursive.
By the way, in principle you could also calculated PR within one step without initial guess as already mentioned here: Google meets Heisenberg [webmasterworld.com]. Of course, in practice this wouldn't work for such a large system.
The ODP Social Contract forbids this. If AOL tried that, ODP editors could sue them for violating this.
the solution of the linear equations (for d<1; d = damping factor) is independent of the initial vector.
Even in the TrustRank model with artificial sources initial conditions don't play a role.
The ODP Social Contract forbids this. If AOL tried that, ODP editors could sue them for violating this.
Herein lies Google's problem. With the number of initial authorities like Yahoo, Looksmart and DMOZ diminishing for them, the likelihood of results being vastly artificially boosted by the popularity of the mere connectedness of the community in which they belong would be very high. Consider for example the large network of teenage gamers out there on the net who have websites and link to each other. Without the appropriate source values, our results could be dominated by the collective opinions of warez kiddies, free porn junkies and Britney Spears fans
I'm not talking about the initial PageRank vector as used in the power method for the conditioned transition matrix. I am talking about the conditioning vector or what Google refers to lately as the "personalization vector" which is used in every iteration.
Google's personalized PageRank vectors is based on the topic related PR vectors. This model doesn't base on authorities, i.e. the weight (the transition probability in the random surfer model) is topic dependent but there is no higher weight to qualitative good pages. Therefore, there is no need for authorities and no dependency from directories. You just have to determine the topic of each web page.
Of course, even in this case there are no recursive definitions, just recursive iteration schemes (the same that can be used for the non-personalized case). And even in this case you can calculate the final PR within one step without iterations.
Even in the TrustRank model with artificial sources initial conditions don't play a role.Sorry this is blatantly wrong but I'm only prepared to debate it offline as this is not the research forum.
I was talking about the initial vector used for iteration schemes to calculate TrustRank/PR (!) such as Jacobian iteration. In this case (which I was referring to) the statement given above is obviously valid.
Google's personalized PageRank vectors is based on the topic related PR vectors
My primary motivation is the strong belief that a Google monolopy is unhealthy for everybody just as much as a Microsoft one is. It is in all of our interests to see healthy competition in this field.
Google is actively undermining the commerciality of webmasters in order to boost their own corporate profits. Doesn't this begin to smell of misuse of monopolistic authority to anybody here?
Come on - let's not have a technical debate until people do their homework properly and argue honestly. I'm not here to give people math lessons.
LOL. I just find it hilarious that you quote doc_z and follow it up with this.
While I have had disagreements with doc_z, it has almost always been on semantic or opinion issues. He (or she) certainly knows their math, and seems to be quite familiar with all the appropriate papers.
I'm here to provoke discussion about an important issue that effects many people here.
So, is provoking discussion the same as discussing? It doesn't sound as friendly to me.
My primary motivation is the strong belief that a Google monolopy is unhealthy for everybody just as much as a Microsoft one is. It is in all of our interests to see healthy competition in this field.
Well, that certainly is a "provoking" statement. Perhaps you shoud have just titled this thred more appropriately rather than getting upset when people actually try and discuss the technical merits of what you post. It will make it a lot less confusing.
If you don't want a technical discussion, then do not post your technical plan of attack as the main topic. Post your political position, which it sounds like what you really want to discuss.
Google is actively undermining the commerciality of webmasters in order to boost their own corporate profits.
And the other big online companies aren't? I think amazon is a lot closer to monopoly power in a lot of online areas than Google is.
It would take years for Google to achieve the momentum of DMOZ if it started its own directory effort even if they bought a directory like Looksmart.
You can bet that if Google announced it wanted some editors to work on a new Google Directory that you would have 100,000 applications sent within the first 10 seconds even if they weren't willing to compensate you financially.
'Years'? It would take only a few months to get up to Dmoz's level.
I feel that Google would benefit from making it's own directory. Looksmart was one the most corrupt and easily fooled directories I'd ever come across. Dmoz is getting that way aswell, I can't even count the number of bribes I was offered to pass duplicate inclusions or to put keywords in the linking text rather than the company name.
And these are based on DMOZ subtopic-based conditioning.
It seems that you're referring to Haveliwala's papers. His ODP biasing ansatz is not only dealing with topic-sensitive PageRank but also with an authority system, i.e. most entries of the PageRank vector are zero independent from the topic. However, there is no need to mix these different issues. As long as one is only interested in topic-sensitive PageRank and not in an authority system, you can determine the topic from on page factors. Therefore, ODP might be important for an authority system but not for purely topic-sensitive PageRank.
By the way, even in Haveliwala's and other Stanford papers there are no recursive definitions just recursive methods to solve linear equations.
Here is a brief, rough overview for those who are interested in this topic:
- The standard (original) PR algorithm uses a single constant PageRank vector, i.e. the self-contribution of each page is the same. (In the random surfer model this corresponds to a uniformly distributed probability to be teleported to a page.)
- In the authority system only selected sources have a weight, i.e. most of the entries in the PR vector are zero. (Therefore, one is only teleported to a selected set of pages.)
- Topic sensitive PageRank uses a set of vectors. (The transition probability is topic dependent.)
I'm not here to give people math lessons
Before giving math lessons you should at least avoid such mistakes as calling a simple set of linear equations 'recursive'.
I think the idea of a manually reviewed search engine (such as the way Yahoo is heading with SiteMatch) is going to put them well ahead in terms of relevancy for commercial searches. Unfortunately the large portion of the web is built up of informational sites which do not submit to search engines and which do not submit to directories. Therefore Googles method of 'crawl-em-all-sort-em-out-later' attitude will have to be adopted by MSN and AOL. Does anyone actually use directories for searching the web on a regular basis?
[edited by: WebGuerrilla at 5:23 pm (utc) on April 5, 2004]
[edit reason] TOS #16 [/edit]