| This 38 message thread spans 2 pages: 38 (  2 ) > > ||posting off |
|Google as a black box|
Let's face it: we're all blind guessing at the way it really works.
| 5:09 pm on Feb 2, 2002 (gmt 0)|
First of all, please let me spend a few words to thank the senior members for their often brilliant and sometimes passionate contributions. I think this forum really is an invaluable resource for anyone interested in the SE world and its current leader. Or at least, one of the very best Google-oriented discussion boards I've seen so far. So let's keep up the good work.
Now, back on topic.
I've spent a fair amount of time reading previous posts here, and one thing that really struck me is the fact that about 99% of the assertions about Google's ranking mechanism(s) in this forum are nothing more than realistic hypotheses. After all, the only publicly-available official documentation of the PageRank algorithm appears to be the famous recursive formula from the early paper The Anatomy of a Large-Scale Hypertextual Web Search Engine [www-db.stanford.edu]. Just about everything else I've been reading so far about PageRank seems to be mere inference based on personal (=limited) experience, if not plain speculation.
About PageRank, the best independent analysis that I've been able to find is PageRank Explained [goodlookingcooking.co.uk] by Chris Ridings, who --while illustrating some really interesting theories indeed-- clearly states that "There is, at this point in time, not enough information for us to be 100% certain about anything. I am merely presenting theories, based upon the best information available, which seem to largely hold true".
What's perhaps even more important, from what I've been able to observe from several web site rankings (including my own), PageRank is obviously a single part of a wider, more complicated and partly obscure mechanism. As Chris intelligently argues, "PageRank has its place in the ranking process. That place is not as big as many might imagine. Its significance in the ranking algorithm is less than many other factors [...]".
So, let's be honest: what do we know about what's really going on inside the Google black box? Very little, IMHO. I find it really funny to read about some SEO geek hacking, reverse-engineering, spoofing or otherwise exploiting the Googlebar just to see the actual PR values associated with each URL. Besides, I'm pretty sure the guys at the GooglePlex have lots of fun reading our posts, too. :)
Google's engineers have always been very careful about giving away potential hints (let alone detailed explanations) about the way the SE really works. And they have justified their reserve saying that such information may seriously harm Google's search quality (and thus its users' experience) if disclosed. I deeply respect and appreciate that position, which I think reveals a very mature and responsible corporate philosophy. One may argue that Google's secrets are mainly aimed at retaining their current leadership in the SE market, but then how many other SE's do you know of which are so constantly focused on providing oustanding quality search results, and so genuinely concerned about spamming, cloaking, and other unethical so-called "SEO" practices?
About self-appointed SEO professionals, I really wish those guys could learn something from the WebSeed incident [webmasterworld.com]. Let aside the questionable ethics of behaviours such as setting up a link farm in order to boost a single web site's ranking, and even looking at things from a strictly opportunistic point of view, following Google's tips on how to get a good ranking has proven to be the most effective web page optimization strategy by far. From a fresh interview with Google software engineer Matt Cutts [clickz.com]:
|When asked how to gain high rankings, Cutts replied, "The guidelines are pretty simple: Stay away from hidden text, hidden links, cloaking, sneaky redirects, lots of duplicate content on different domains, and doorway pages. [...] The best use of a Webmaster's time is building good content." |
So why not just stick to Paul Boutin's wise SEO guidelines [hotwired.lycos.com] instead?
One last thing: I'd like to be proven wrong about the lack of information regarding Google's actual ranking mechanisms, so if anyone knows of any official (or equally reliable) resource about obscure subjects such as theme assessment and the way Google extracts contextual information from web pages and hyperlink structures, please post it here. Although I have hardly any interest for SEO techniques, I am about to write my graduation thesis on Google, so any additional references would be welcome. Thanks.
| 5:24 pm on Feb 2, 2002 (gmt 0)|
about 99% of the assertions about Google's ranking mechanism(s) in this forum are nothing more than realistic hypotheses.
If you leave away the word "realistic", then I fully agree with you!
Not that there aren't any realistic hypotheses voiced here, but those certainly make only a fraction of your 99%. Some of the statements recently floating around were nothing but theories about speculations, based on unfounded assumptions derived from bad observation... ;)
| 5:28 pm on Feb 2, 2002 (gmt 0)|
A mystery wrapped inside a riddle covered by an enigma.
| 5:42 pm on Feb 2, 2002 (gmt 0)|
Any hypothesis even if it is accurate is fleeting, google is a moving target. That is why on every update you will see a flurry of posts about "hey my site got banned, and I didn't do anything".
The operative word is excess, very easy to spot patterns that exist in excess, so lets say..you finally crack the nut and a handful of folks notice and they in turn copycat your effort and others copycat them and well, you get the picture..all of a sudden you have EXCESS, and the whole hypothesis is no longer valid.
Here is my off the wall hypothesis in regards to trying to "shortcut" the algo:
Success will eventually breed failure.
| 5:50 pm on Feb 2, 2002 (gmt 0)|
|about 99% of the assertions about Google's ranking mechanism(s) in this forum are nothing more than realistic hypotheses. |
If you leave away the word "realistic", then I fully agree with you!
Actually I said that 99% of them are nothing more than realistic hypoteses --which implies that some of that 99% may be less-than-realistic hypoteses. ;)
Anyway, I just wanted to stress out the fact that the currently available "official" information about Google's inner mechanisms appears to be really scarce. Maybe I'm missing something, who knows?
| 6:06 pm on Feb 2, 2002 (gmt 0)|
>Maybe I'm missing something, who knows?
The concept of research. :)
|brotherhood of LAN|
| 6:21 pm on Feb 2, 2002 (gmt 0)|
The fact that all webmasters who have a concern with Google (i.e.) all of us is not only protecting the integrity of their engine, but also promotes it too surely!
I suppose you are right in a way about all of us guessing. It is worth pointing out though, the results of Google searches are there to be seen, if I took the first 100 pages for a particular keyword and studied them (and those who link to them) furiously, then I may not be 100% correct with my hypothesis on the way the search engine works, but maybe I can get CLOSE to 100%
one thing i do know is that the QUANTITY of links pointing to your domain are nowhere near as important as the quality of the links. You can have 100 links pointing to you on google and get PR6
All in all, I would LIKE the inner workings of Google to remain secret, because I would hate to think that other webmasters would have the time to exploit it when Im fine with the way Google works just now :)
Keep up the good work Google
| 6:55 pm on Feb 2, 2002 (gmt 0)|
|All in all, I would LIKE the inner workings of Google to remain secret, because I would hate to think that other webmasters would have the time to exploit it when Im fine with the way Google works just now |
Hey, me too! ;)
Being a Communication Sciences student and part-time webmaster, I have an almost purely academic interest in Google's mechanics. So I was just wondering if anybody knew of any interesting research paper on that subject. Any pointers?
| 7:02 pm on Feb 2, 2002 (gmt 0)|
john316:[quote]Any hypothesis even if it is accurate is fleeting, google is a moving target.[quote]
Yeah. Or maybe we're just running around in circles, with Google standing in the middle, watching us, and laughing. ;)
| 7:03 pm on Feb 2, 2002 (gmt 0)|
I already posted some interesting pointers to articles and thesis from/about google.
Remember, Google hire a huge amount of PhD and PhDs have to write paper and publish them. Collect some name of people who work at google
from article and do a search on these name and restrict domain at .EDU
you will find a lot of good papers!
| 7:09 pm on Feb 2, 2002 (gmt 0)|
Giacomo, for a start I'd suggest using the sitesearch [searchengineworld.com], or reading back especially in the SEO Research Topics [webmasterworld.com].
If you are seriously interested you could try to trace the academic careers of the people involved.
It's all human made. Even Google.
| 7:29 pm on Feb 2, 2002 (gmt 0)|
Many thanks for the advices, ROLAND_F + heini: I will dig deeper into this forum's message base before I start searching the Web for that subject.
|brotherhood of LAN|
| 8:08 pm on Feb 2, 2002 (gmt 0)|
Perhaps archiving some of the older posts would make them easier to find and mean you experts dont have to answer the same question twice :)
| 8:48 pm on Feb 2, 2002 (gmt 0)|
Library link on the top menu bar.
| 9:51 pm on Feb 2, 2002 (gmt 0)|
Giacomo, I see where you're coming from, but not everything written here is blind. With our eyes half open people sometimes make false deductions, but these are not in themselves necessarily false assumptions. The more deductions we make and the more hypotheses we test, the more chance we have of understanding.
"maybe we're just running around in circles"
To some extent we are, but we get closer.
For example, we've known for a long time (from huge numbers of collective observations) that Google likes <TITLE>s; this was unsurprising as earlier engines did too, and the original Google (called 'Backrub' at the time) ordered results by TITLE. We also know how PageRank was originally designed to work (the exact formulae are in the papers by Mr Brin and Mr Page that you're about to find;)).
We live and learn. I think I can confidently say that I know one aspect about the latest round of zero PageRanks that I didn't know yesterday (although I suspected): "Poison words" on a page can prevent PageRank being given on to other pages, even though the page with the poison word has the PageRank it deserves and ranks well.
As far as I'm concerned, the idea that Google have Duck taped the big sites (Yahoo, DMoz, About) is out the window. Even the Open Directory is not immune to poison words. I've nothing against the people mentioned, but I don't want to stick to someone's guidelines. Even the guidelines from Google don't tell us how they police the rules (or how to avoid being in trouble for looking like a rule-breaker).
The ideas about "cross linking", or "loose affiliation" are still just suspicions as far as I'm concerned, but they seem to be the hottest bet in town and when someone finds the one, clear, simple example that differentiates an affected site from a similar, but unaffected site; that person may be able to see the answers if he or she looks hard enough. (I write answers, as I no longer believe that there is a single process at work.)
| 11:38 pm on Feb 2, 2002 (gmt 0)|
Ok, so we might be running around in spirals, not circles. Well, may our path be centripetal then. ;)
ciml: I know Brin and Page's paper and its original definition of PR (I mentioned it in my first post). Of course, the vital importance of keywords in the <TITLE> tag is pretty self-evident each time you search with Google, and with other SE's as well. No one is going to have any doubts about that, and I guess the same could be said for the negative effect of widely-recognized "poison words". However, personally I would think twice before stripping relevant keywords off my web pages just because I found them on a blacklist somewhere. "PR0 paranoia" already seems to be a widespread disease in the webmaster community: let's just try not to make it epidemic. ;)
I can see your point when you say we have to make assumptions (which might as well prove themselves wrong in the long term) in order to test our hypoteses and eventually learn something. I totally agree. Besides, that's the way empiric research is done.
The problem is, most of the assumptions I've seen the SEO people make have more to do with superstition than science IMHO. :)
That said, I would like to thank the nice guys who run WebmasterWorld for the directions and pointers they gave me: digging deeper in the library, I found tons of interesting information about Google and related topics, as well as some really intriguing threads from the past. There's even a Google FAQ page! and a research summary! and a SE newsletter! and a SE glossary!... Awesome stuff that's going to be of great help for my thesis. Talk about content quality? This site deserves PR12. :)
| 12:18 am on Feb 3, 2002 (gmt 0)|
>mere inference based on personal (=limited) experience
SEO Methodologies : A brief history of our industry:
1995) The early days of Yahoo.
Optimization was born out of the roots of AAA, A#1, and Acme style yellow pages/white pages alphabetical optimizations.
1996) Blind luck and keyword seasoning to taste.
The early days were stabs in the dark using simple keyword seasoning. Poke it here, and look for a reaction there. The first concepts of density and location started to be used.
You could still get a site listed in Yahoo by merely submitting it. As long as it wasn't too gaudy, you were in within 72 hours.
Late 96) The first papers begin to appear on the web about text matching, data mining, and interviews with se programmers.
Light bulbs of understanding begin going off around the early seo community. People began to realize just how databases work to match text and how they would be applied to the greater database of the web.
1997) The first algo crackers appear.
If it's a machine, we don't need to test it by blind experimentation, we can decode the algo mechanically. The first algo crackers were quite rudimentary by simply studying the make up of pages in the results many of the major clues to the algo's could be understood.
More specifically, several seo's decoded all 35 parameters to Excite and were able to build pages precisely to the algo; thus, generating #1 pages at will.
The first major "page jacking" and "bait and switch" incidents begin to happen. SEO's get code stolen and copied.
Mid 97) Several se's begin using Yahoo as a QA check. Thus, getting into Yahoo became paramount. Yahoo is flooded with submissions. Best guess is they processed less than 5% of submissions in 97 and 98. Impromptu Yahoo flame clubs formed anywhere there was a discussion about promotion.
Se's begin waking up to the fact that their sites are "portals" (in one door and out the other). Se's begin their first attempts at keeping people on the site in various ways. Some were intentional algo manipulations designed to keep people around the se and searching longer than they should have. (There are some big time stories here if any se techs would like to talk)
Late 97) Along came Infoseek's daily refresh. Submit it by 8am and you were in the db and pulling referrals by late afternoon. It was the first time "joe optimizer" could play the game without being a programmer. SEO explodes as people began to see simple and easy results in 24hours on Infoseek.
Spam becomes a very serious problem for the SE's as unscrupulous spam sites began to understand algos and how to manipulate them. Hotbot and Altavista were next to useless in late 97 due to spam (last half of 97 and most of 98 were the dark ages for se's).
The first "clustering" of results appears and has a major affect on algo decoding.
More page jacking incidents happen regularly. Hardly any top SEO doesn't have top ranked pages stolen and copied. Often copied into foreign domains out of jurisdiction.
Algo crackers begin to talk about the first cloaked pages appearing in the insurance and auto sectors. I am captivated by it.
Referrals begin to skyrocket for SEO's. 1k, 2k, and even 5k per site per day is not uncommon.
98) Let's get serious.
After several papers were delivered at the WWW conferences, it became clear se's were going to move to Off-The-Page criteria. Prerequisites such as link pop, directory listings, and listings age were going to be main parts of the new algo's.
Decoding algo's became very sophisticated in mid 98 and 99. Several optimization firms hired programmers to write efficient algo crackers.
It is also the first time I know of where a search engine used multiple algos for different top ten positions. Just because you could figure out what make a page #2 doesn't mean you have a clue about #3 which was positioned using different criteria.
The big push of "shop the competition" is born as several se's use the old "tell on your neighbor" ploy to clean up their results because their algos couldn't.
Page jacking and site theft is rampant. You can't put a top page on Altavista without it being stolen. Entire sites are mirrored as a means of "bumping off" the competition due to alta's horrible dupe page detector. Much the same occurred with Inktomi.
The big rounds of submission spamming wars begin as people spam the submit urls with your pages. Some say it worked for several years to get competition banned in the se's. Finally in late 98 se's begin to understand what is happening and put a stop to it by limiting submissions.
Se's begin to modernize with multi-languages, word lists (term vectors), and other language expertise - the era of the word guru is born.
Google hits the scene [web.archive.org] in earnest. Their first build of 25million urls makes it clear they have a future. I review it and am the first (beep beep) to propose link programs. People begin thinking in earnest about link pop and how to effect it.
Spam page/doorway page auto generators show up on the web every where and some are very good.
Referrals hold steady for those that know the game and stay off the radar. Using quality seo - that doesn't look like seo - rules the day.
Hello ODP! The first independent, free, "open source" directory is born. They represent a huge threat to the traditional directories. Out of "no where" comes the first ODP flames at a time when everyone was in love with the ODP (was it an anti-odp plant by a competitor, or was it real? You make the call).
Late 98-early 99) Altavista fights back with "too many urls" and bans huge segments of sites and sites with auto doorway page generators. Other engines begin out-and-out wars against seo. If a site said "we optimize" or "we promote" anywhere on it, they were banned in massive quantities. Much of that same mentality still exists today in many search engine offices.
Many seo firms begin falling out of the search engines in record numbers. Hardly any seo firm isn't affected. Loss of rankings on entire client lists is common. This is why you find old pro's who never talk about clients or link their websites with clients and why those that now know algo's cold - rarely talk in those terms.
Although the algo crackers are at their peak of performance, their utility falls as off-the-page factors such as link popularity become main stream in the se's. Decoding what makes a page top ten has never been more difficult. Those that know, now spend 10 times (literally) as much time to acheive half the rankings they did in 98. Algo crackers are not much more than statistic generators now.
Google's PageRank begins to bear fruit while the other se's self destruct under management chaos and mountains of red ink.
The Hubs and Authorities model is clearly a winner at Google. It universally clears out junk from the bloated db's and identifies the core mega sites in each keyword sector.
Although there has never been more competition, referrals hold steady through 98 and into early 99 across most of the engines.
Cloaking becomes almost mandatory on many se's to protect rankings and code. It is unfortunately used by those not so interested in those factors and more interested in spamming for the sake of instant successes.
Late 99) The effects of the end of SEO begin to sink in.
Goto begins to make it's major push. SEO's begin ppc'izing their billing with store-front redirect sites showing up every where.
Link pop schemes explode.
Other se's cut huge swaths out of their db's for unknown reasons. Part of it was size, some of it was spam, and some what just because they could.
SEO and traditional algo decoding techniques as we knew it, are all but relegated to the ash heap of history.
Referrals begin to plummet as competition sky rockets and the web matures. I secretly think 99 was when people "settled in" to a daily routine and began using search engines less and less. It was no longer this huge mystery that needed to be explored - they now used it to do productive things. eg: sites such as news take off in record numbers.
2000) A fairly deep shudder goes through the remaining industry as the end of what was left of Infoseek is gone.
The paid for play schemes and ppc schemes crank up in rapid succession in 2000. From Ink, to Alta, to even buying banners based on keywords - ppc and pfp is every where.
Meanwhile back in the real search industry, surfers look for an engine that actually works at finding them info - Google solidifies its position as the new defacto se.
The link pop craze of 99 begins to fade as it becomes very clear they are risky items - too easily tracked.
The last gasp for link pop programs is the building of fake awords programs, fake guestbooks, fake directories, and fake forum systems just to build fake link pop.
2001-) Bought and Paid For listings are everywhere. Goto is on all the major hubs from Yahoo, AOL, to even MSN. People abandon other se's such as Hotbot, Altavista, and Excite in record numbers. It's an exodus.
SEO is we knew it, is all but over. We are down to talking about the few remaining free specific engines and their systems. There is now a major difference in how se's work and how to "work them".
Welcome to the era of "All Google All The Time".
Many seo's have sleepless nights as we realize it is "Google or Bust".
Through all those trials, tribulations, education, and experience, those that know how certain se's really work, are not eager to share those hard fought for lessons. Knowledge is power - knowledge is money - hang on to both.
It's been this way for years with every engine. Subjects talked about here are under a great deal of pressure from different angles and agendas. Yes you are correct, we are most often used as a sounding board for ideas and theories.
There are very definitely people reading this right now who know exactly what it takes to get a top ranking on Google. They may not be able to actually produce that ranking due to factors that are not easily controllable; such as directory listings or off the page context criteria.
A second part is continued in a stand alone post 26 steps to a successful site [webmasterworld.com] with Google alone).
| 3:23 pm on Feb 3, 2002 (gmt 0)|
Hey Brett, thanks a lot for your historical account of the SE vs. SEO war!
Ah, the good ol' times when a listing in Yahoo!'s business category was still FFA... ;)
I totally agree knowledge = power, especially in this field.
What you say about "secret" SEO strategies being copied as soon as they are deployed, and then rendered ineffective by subsequent algo adjustments on the SE part, is much interesting. I wonder if Chaos Theory [directory.google.com] might offer a good interpretative paradigm for SE's tendency to react to "perturbations" (e.g., algo crackers) by reorganizing themselves into something new and eventually returning to a stable state...
But I guess that would a bit off-topic here, so I'm off to read your fantastic Successful Site in 12 Months with Google Alone [webmasterworld.com] tutorial. :)
Many thanks for sharing your knowledge with us, and keep up the good stuff!
| 11:54 am on Feb 4, 2002 (gmt 0)|
Giacomo, sorry I didn't spot your Anatomy of a Search Engine link.
I'm glad to read that you had already found "poison words" to be widely recognised; to be honest I only discovered Google's use of them since the January update. My point, though, was less about their presence than the way that small parts of DMoz.org seem to be affected. People deduced that Google has 'duck taped' "anti-penalties" to some high profile domains; I now disbelieve that but I wouldn't say that those people had 'assumed' that Google was letting some sites off the hook, deductions were made from some evidence that points in that direction.
I guess I'm just trying to say that even though all of us are wrong sometimes, it's still worth trying. If we just follow Paul Boutin's wise SEO guidelines then we'll miss some important factors. Not because Paul Boutin's advice is bad, just because there's no one source of SEO information. Your idea that maybe chaos theory can describe the SEO's affect on search engines is interesting; if I hadn't done so badly with nonlinear dynamics as a student I'd have something to say on the matter.
"PR0 paranoia" is well founded, IMO. It seems widespread amongst people who use artificial PageRank inflation, but it's also affecting people who just happen to run a number of related services. As Brett says above, 'Many seo's have sleepless nights as we realize it is "Google or Bust"'.
I agree with your other comments. Webmaster World (along with the early Brin/Page papers and information from Google itself) is probably the best place to find information about Google. The people who I know have spent a lot of time researching the subject tend to end up posting here eventually.
I've had StickyMail about my "poison words" comment. I should have linked to Brett's article on poison words [searchengineworld.com], anyone who wants to see it in action should check the Open Directory pages with 'guestbooks' in the TITLEs. (Those pages are OK, they just don't pass their PR on.)
Brett, nice history of SEO. I think that a large part of the reason for these forums' rise to popularity is your personal breadth of knowledge.
> 98) [...] "sites are mirrored as a means of "bumping off" the competition due to alta's horrible dupe page detector."
It seems funny that, for Google, this started to happen in earnest in late 2001.
| 12:00 am on Feb 5, 2002 (gmt 0)|
I only scanned through this thread, but would like to point out a couple things I have learned over the last year or so:
1) You have to have something to beat out your competition.
2) Most SEO experts are not experts at Google.
3) 90% of everything you need to do well in Google has been posted in this forum. Unfortunately 80% of the stuff that doesn't work has also been posted in this forum.
4) People put way to much emphasis on the small stuff.
5) Google isn't that much different than it was when it started.
There are only a few "tricks" I use that haven't been mentioned directly on this forum - all of which can be figured out (and no I am not telling you my "tricks").
This is the best place for information - people just need to apply it in a logical manner.
I think the number 1 mistake people make is about PageRank - they think it changes each month. It doesn't. YOURS might - but overall PageRank is the same.
I just see some really nice people banging their head against the table for no reason.
Title, PR, Anchor Text, Big Text, URL, and according to some people theme. Once you have those and it isn't working THEN worry about your perpetual motion pagerank machine.
Everything you need to know to do well in Google is already on here.
| 1:20 am on Feb 5, 2002 (gmt 0)|
Thanks for your wisdom, Chris.
Basically I agree on everything you said.
Just one question:
|I think the number 1 mistake people make is about PageRank - they think it changes each month. It doesn't. YOURS might - but overall PageRank is the same. |
I have to admit this confused me a bit. I mean, if we see PageRank as the probability that a "random surfer" may browse a given web page, then of course the sum of the PR values of all of the Web's pages must be 1 by definition. From The Anatomy of a S.E. [www-db.stanford.edu]: Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.
So the PR values we see in the Googlebar are probably a nice-looking "translation" of the actual PR values (which are likely to be some very ugly decimal number, something like 0.00002342233...).
But then, is it not the relative PR that counts for a given page? If so, we should not care much about the "overall PageRank" remaining the same, as long as our own PR is higher than our competitors'. Right?
I mean, the total amount of energy in a closed system remains unchanged, but there can be warmer and colder spots here and there, now and then. Besides, PageRank is a representation of the Web which bears more than one similarity with the dynamics of chaotic systems IMHO.
| 11:46 am on Feb 5, 2002 (gmt 0)|
> The only accurate prediction, or information on Google is Google.
If you read back through Chris_R's posts you'll find plenty of useful information that you will find direct from Google. That said, there's plenty of useful information at google.com/webmasters/ that people don't take enough notice of, and much of the truly interesting information was written more than three years ago by Brin/Page.
> I mean, the total amount of energy in a closed system remains unchanged, but there can be warmer and colder spots here and there, now and then.
A very good analogy. Google evenly removes a proportion of PageRank, and then evenly distributes it (though the redistribution need not be even); rather like a perfectly conductive (and therefore uniformly hot) heat sink, slightly insulated (85%-ish) from the main body but completely insulated from the outside world. This converges into a steady state. It might make more sense to replace the heat sink analogy with QED, but I don't understand that. Anyway, the 'rank source' is exactly the reason that 'perpetual motion PageRank' machines don't work (nice term, Chris_R), but it's also how you can 'create' PageRank without inbound links.
To get hot you can collect enough items together to share their "random surfer" heat. As the total number of items is now over two billion, you'd need a huge (awesome) number of pages. Depending on the base of the logarithms in the ToolBar, it might take 10,000 pages to generate just PR5; it's easier to get a link from someone with PR6. I've noticed that some significant adult-related terms have quite low PageRank; this may have much to do with the difficulty in finding good inbounds.
> 4) People put way to much emphasis on the small stuff.
> 5) Google isn't that much different than it was when it started.
> Everything you need to know to do well in Google is already on here.
I agree with everything else you wrote, but the differences between the early Google and the current Google can make a huge difference (eg. if you loose all your PageRank). It may well be the ‘small stuff’ that helps us find the key to avoid being the next victim (of what most of us now believe to be) an aggressive anti-spam policy.
As with most aggressibe policing methods, a few innocents get rounded up and I'd rather not become one of them (assuming, of course, that I'm an innocent).
| 3:38 pm on Feb 5, 2002 (gmt 0)|
Nice analysis, ciml. However, as I said in my other post (PageRank Feedback Loops [webmasterworld.com]), this is just pure theoric speculation which is probably not going to be of any help in the "real world" of a 2-billion-page Web which is constantly changing and growing at an impressive rate.
Still, I'd like to point out what seems to me a unique feature of PageRank: i.e., the fact that PR's accuracy or reliability appears to increase proportionally as the Web (the pool of available data for PR calculation) grows larger and larger. Or at least, that's what should be inferred from the original recursive formula. That's a really outstanding characteristic that is intrinsic to the PageRank algorithm, and probably one of the main reasons of Google's enormous success over its competitors, especially those not incorporating link structure analysis in their ranking mechanisms. It's not a coincidence, IMO, that Google's rise went hand in hand with an increase in the Web's growth rate: while the major SE's struggled to keep up with it, Google simply exploited it in a very smart way for its own success.
Therefore, I wasn't very surprised to read what Google's CEO Eric Schmidt said about growth being the biggest challenge for Google today. Personally I would add:
2) staying fresh and up-to-date;
3) discerning "real" content from spam.
Besides, that's just confirmed by what Google appears to be doing lately...
|To get hot you can collect enough items together to share their "random surfer" heat. As the total number of items is now over two billion, you'd need a huge (awesome) number of pages. Depending on the base of the logarithms in the ToolBar, it might take 10,000 pages to generate just PR5; |
You can easily reach that threshold (10,000 unique pages) with a database-driven, intensively-crosslinked dynamic web site (and good content) nowadays. However, I agree that quality inbound links and good directory listings are still essential (not only for PR, but also for your site's traffic in the first place). I mean, Google is currently the most important source of referred traffic, and will probably stay so for a long time, but (luckily enough) it's not the only one, especially for highly-specialized, "niche" web sites. So in the end I still agree with those (like Paul Boutin in his article [hotwired.lycos.com]) who have warned us from focusing exclusively on Google-specific SEO "tricks", often neglecting other factors which may reveal themselves even more important for a web site's success in the long term.
So I don't believe the "small stuff" is ever going to make a big difference after all.
| 5:24 pm on Feb 5, 2002 (gmt 0)|
> ...pure theoric speculation which is probably not going to be of any help in the "real world" of...
This is the part I don't understand. Rank source, reaching a steady state (OK, "convergence", same thing), rank sinks ("PageRank perpetual motion"), normalisation and logarithmic graphs were all in the early Brin/Page papers.
I believe that an understanding of these things is of help. There's plenty of advice around about setting up umpteen domains to 'inflate' PageRank; how do we know that they are wrong and Paul Boutin is right?
Two quotes, one about creating fake domains:
|The Google guys giggle at this obvious scam: If you understand how vectors work, spreading your pages across multiple domains, or building duplicate sites, does no better than if you'd simply added those pages to your original domain. |
The other is about META keywords:
|Repeating the most important keyword twice seems to work with some search engines, but repeating more than that will cause some of them to ignore the whole page. |
You and I both know which of these makes sense, but they came from the same article. Also, some country-wide accommodation sites suffer from the 'grouping' (sorry, I forget the correct term) done by domain, while some of those using a different domain for each establishment are doing much better. Also it can be easier to get umpteen listings in Web directories for umpteen different domains (one for each establishment) than for one domain (with umpteen different establishment 'sites').
An important part of 'research' is finding which articles/people/viewpoints to believe, but also when to take a different approach.
> PR's accuracy or reliability appears to increase proportionally as the Web (the pool of available data for PR calculation) grows larger and larger
Indeed. "What can you do with a Web in your Pocket?" (by Brin, Page and others) tells us that "size does matter. The extraction experiment [Authors and Titles] would likely have failed if the WebBase had been one third of its current size".
Still, I'm not sure that Google's doubling in size from 1 billion to 2 billion documents will have made PageRank itself a proportionately better indicator of importance. The early PageRank (which wasn't only used as a Web search engine) used Stanford WebBase, which contained 'only' 25 million documents.
| 8:51 am on Feb 6, 2002 (gmt 0)|
One way to peek inside the Black Box would be to read the patents Google holds or bought from Outride.
Curiously, Google no longer refers to PageRank as patent-pending (they used to). That suggests they either withdrew the application (because they instead chose to make it a trade secret) or it was awarded but assigned to another entity.
AFAIK the Google founders themselves have only one US patent [patft.uspto.gov], but it was pre-Google.
| 12:15 pm on Feb 6, 2002 (gmt 0)|
>>That suggests they either withdrew the application (because they instead chose to make it a trade secret) or it was awarded but assigned to another entity.
Method for node ranking in a linked database [patft.uspto.gov]
| 2:34 pm on Feb 6, 2002 (gmt 0)|
Some day I am going to learn how to do quotes on this system - until then.
Yes, I agree with you. You got the gist of what I was saying - it is a closed system. yes - there can be cold and warm spots as you put it, but what I am talking about is that I notice a lot of speculation about PR - when there is no evidence to come to that conclusion:
1) My site went from 2 to 10
2) I didn't chaneg anything
3) Therefore PageRank as a System has been devalued.
That is the type of conclusions I see alot.
You are of course correct - there are some other issues that keep me coming back and trying to learn more. I guess my point was - I don't have any secrets - and I do very well with google. Either I am incredibly lucky - or the basic stuff is enough for most people. I don't try for theme pyramids and all that other jazz. I know some that do and say they do well. I follow what is in the papaers - and that seems to work for me. When I have the opportunity - I try things like themes, but I don't worry if I can't.
I have had a few of my sites get zeroed out. I don't know why. I don't see a common factor among them. Hopefully google is backing off their anti spam campaign. I still say that spam is not the problem. Spam bothers the webmasters more than the users.
I have a friend of mine that sells - shall we say widgets on line. He get furious when his competiton advertises widgets - as he sells PURE widgets and they sell accessories for widgets - he writes goto to complain and get REALLY upset at something that seems like splitting hairs to me.
| 2:54 pm on Feb 6, 2002 (gmt 0)|
|There's plenty of advice around about setting up umpteen domains to 'inflate' PageRank; how do we know that they are wrong and Paul Boutin is right? [...] Also, some country-wide accommodation sites suffer from the 'grouping' (sorry, I forget the correct term) done by domain, while some of those using a different domain for each establishment are doing much better. Also it can be easier to get umpteen listings in Web directories for umpteen different domains (one for each establishment) than for one domain (with umpteen different establishment 'sites'). |
Sorry Calum, but I have to disagree on that one.
A company may of course have, e.g., multiple web sites in different languages, each one on a separate domain, but that's far different from setting up a link farm or multiple domains with duplicate content IMO, and Google should be able to understand that difference pretty well. In case it doesn't and someone gets unjustly penalized, they should contact Google and explain the situation providing enough detail for them to improve their spam filters. It's that simple.
| 10:05 am on Feb 7, 2002 (gmt 0)|
Chris, I also don't think I have any secrets that I haven't mentioned here, but am concerned about spam penalties because it PR0 would have serious affects for my customers. I tend to make related sites (the owners don't seem to mind me working for their direct competitors) and what I do (linking to relevant content) could look very similar to those multi-domain spam campaigns Google seeks to hunt down and destroy. So far I've had no sites PR0-penalised (just a few pages on one), but if I did I have then I might have half the sites I manage wiped-out in one go.
The 'theme pyramids' approach just describes a sensible architecture for a large collection of Web content, I don't see Brett's article as implying the creation of 'empty' pages to fit the pyramid; rather that the pyramid is an effective way to structure your pages to help a search engine understand the relationship. At present, I don't believe that Google does understand it well, but I expect that either it or the next generation engines will.
Giacomo, because of my geographical location I have been asked to put up six different Web sites for six different hotels in "mycountry" over the past four years. Each project is stand-alone and 'bespoke'. The owners tend to like to own their domains, and I certainly advise them to. Is it wrong to use a separate domain for each?
Another local "Web designer" (I hate that term) has tended to use his own domain for similar projects (stand-alone sites for individual establishments), presumably because he doesn't run his own servers and hosting was quite expensive when he started.
He can get two listings on Google for "hotel in myarea", I can get twelve (just counting those six sites). It's much easier for me to get directory listings than him. Is he deserving of two listings and am I deserving of 12 just because of our URL choices? He links back to his tourism 'portal page', I link back to my regional 'portal page'. Is he right to link and am I wrong just because of our URL choices? These choices were made in 1997 (mine) and 1996 (his), so Google spamming wasn't on our minds.
Also, take national directories of hotels. Some use a directory (as in 'folder') for each entry, some use a domain for each entry. These also were going before Google started so the way that Google deals with duplication and spam can unintentionally hurt them, too. Understanding these factors is an important aspect of Internet marketing in 2002.
| 10:19 am on Feb 7, 2002 (gmt 0)|
I have recently had a decent amount of offers to do SEO. I have turned them down for the same reasons you mention. I just don't like dealing with having to worry about that stuff.
Having sites banned is of course the worst thing that can happen. Luckily, I have only had a very small percentage where this has occured (I still do not know why for sure). They all belonged to me - so no biggie - I am sure if they belonged to a customer - things would be much different.
I guess my point WAS NOT that theme pyramids and the like are bad. I too think a general open logical strategy is good for the long term.
Just that - as of right now - they are not needed. Everything you need to do well in Google is known. There may be some other things that could give you an edge, but by following what was written years ago - and REALLY reading it CAREFULLY - you can do VERY WELL.
The people that run Google - for the most part - are very logical - so when in doubt, I ask myself - what is the most logical thing for Google to do in this situation?
This PR0 thing is relatively new. At least google has been nice enough to let us know it IS A SPAM filter - they also were nice enough to let us know the backed off on one of their tests for spam.
I have lots of questions about this, but since we still don't know anything - the best course of action is to be careful. We must assume that google will do this in an intelligent and logical manner. If not, what is the point anyway? I just do what I know - what has been proven to work, try not to break any rules, and pray around the ~25th that everything will be ok.
| This 38 message thread spans 2 pages: 38 (  2 ) > > |