| This 48 message thread spans 2 pages: 48 (  2 ) > > || |
|New Models for Thinking about the Google Algorithm|
We don't KNOW how Google works internally. We try to understand it as a "black box" - that is, we look at what kinds of data are going in and then we look at the SERPs that come out. Then we build our own theories, our own mental models of what "might" be going on inside that black box to make such-and-such an input give us back such-and-such an output.
The more diverse our data samples, the more we stand a chance of catching some fringe behaviors that require us to revise our mental model. This can be one advantage of a forum community, as well as many kinds of networking.
The challenge here can be that our entire tool-kit for building our black box models may be missing some important or even essential, elements.
For example, many people seem to think of Google as a rather linear score-keeper. The central model here is that Google takes a url, runs it through different tests, and each test then gives that url either some plus points towards its scoring, or some minuses of some kind -- then we add up all the points and the url gets a final score to use in ranking.
This is not a very strong or predictive model these days. In fact, I think Google has moved pretty far away from any internal methods that could be well-modeled by this particular kind of thinking.
It may have been functional for pre-Google search engines -- and even for early Google, perhaps. But today, this kind of thinking creates more and more departure from the effects we see in the real SERPs.
The algo elements now are quite diverse and complex when compared to simple text-matching and yes/no scoring. Have you noticed that even as Google tell us they are moving away from using many penalties, we seem to need MORE "penalties" to describe what we notice?
My own cure for the "common model" involves study of the academic papers and patents, plus close listening to statements from Google reps -- not so much what they say as the word choices they make, the "how they say it". Whatever they are doing (and precise reverse-engineering of the algo is pretty hopeless today) it will color the style of commuication, the internally used jargon and techology inevitably leaving it tell-tale signs.
My main point is that to comprehend today's Google, at least beyond all but the most superficial level, requires some fresh tools and some re-programming of our analysis habits that are now many years behind Mountain View.
How are you all coping with the "new Google"? What kind of thinking have you needed to put on the scrap heap?
Have you ffound something new that really helps?
I know it sounds obvious but I think building for the user is becoming increasingly important. Google has access to a whole new raft of data as to how users interact with a site. If users are turned off by your site it stands to reason that future first time visitors will be also. With that type of information Google would be crazy not to use it to rank your site.
Hmmm .... I kinda agree with frakilk. Remove the complexity at the initial level and build from that. If you can't see it - you haven't got the answer IMO
To build the model, I believe that to start with what the user would wish to see and Google interprets to form a synergy. Then start building in key factors of importance as headlines and supporting these with factual [ supported evidence from the forums , patents , experimentation, Google guidelines ] comments and seperately intuitive/speculative assumptions [ requires high risk and further experimentation ].
I think a deep technical pursuit of the inner workings of Google is like trying to figure out how the universe operates and may not be effective beyond a certain point. But certainly continued discovery missions through analysis and communication with other site owners/ webmasters is essential.
[edited by: Whitey at 10:34 pm (utc) on Feb. 11, 2007]
WOW Ted – this thread is almost ‘Dave Baiting’…. Count me in.
It is far too common for people to simplify the SEO process to make it a malleable concept. As Mr Pasternack and others have assured us, it is not rocket science. It is though, the science of document (information) indexation and retrieval. It is the art of Search engineering and is certainly NOT a simplistic science by any means.
As with Ted, I am a major technical document hound. It is truly a core understanding for the professional SEO. Folks will tell you (cop out ) “because a Patent was obtained, doesn’t mean it was used” – agreed. That’s not the point though. It helps you, over time, begin to think like a search engineer. I dare such much of how I approach a SEO program is intuitive because I have begun to see a site as the algo would. The more you read.. the easier they are to digest…
People also seem to think the algos are intermixed, swapped in and out and the like. It is more an instance of layering new methods into/onto the existing ones. With each layering and associated ‘dial turning’ the story changes and the process becomes more complex. As searchers evolve so do the search engines… so must we….
If it was ‘easy’ to be #1 – we’d all be there – and that can never happen.
no concrete information in this post, just some philosophical rumblings, so if that doesn’t interest you skip it
These days problem with evaluating “black box” model as large as Google search algo is lack of ability to do controlled experiment. It’s really easy to confuse correlation and causality. While it is more or less clear where G is going (start with original G patents and follow the trail, as well as chronological developments in related fields) it is not clear exactly where they are on that path, and hence how are they doing it. Part of the confusion in studying the “black box” with a lot of dials and indicators (to borrow terminology from DFSS and some statistical disciplines) is that most of us don’t know when and which knob of evaluation criteria (some call it filter, some penalty,..) gets turned. Today values on set of knobs may produce one set of results, however by turning knob(S) tomorrow another set of results is produced, which to us can look like whole new penalty, when in fact it’s not…This is not to say that G is not periodically adding new “knobs” to their control set.
Perhaps another useful insight, beside patents and reading between lines of googlers’ statements, is look at advanced features of their CSE offering.
For those who haven’t stayed in “deep contact” with algo development from beginning, or don’t want to invest considerable amount of time reading papers, guesstimating changes in G’s algo might be futile, although it’s always fun.
If point of understanding Google algorithm is to climb in SERPs, what interests me at the moment is what SEO (and other) steps to take with upcoming “personalization of Google search” where everyone will most likely get different SERP page. For us, ready or not, it’s coming and I would rather be prepared – but this is perhaps for different thread.
I agree that the complexities Google algorithms mean it's almost impossible for us to reverse engineer them. I'd go as far as saying the interactions between the different algorithms being used for a variety of tasks are too complex for even Google to fully predict their outcome.
I take this stance as I've been working on a few pieces of fairly heavy code recently, dealing with Information Retrieval tasks that will certainly be used in some of Google algo's. What have I found? Well, one thing's for sure, if you set an algorithm loose on massive amounts of data it WILL still manage to throw up unexpected results. I know this to be so as the output of the work I've been doing has been hand reviewed, but Google has so data flying about they can't possibly spot all of the anomolies that creep in.
Even if you take the stance that each algorithm is 99% effective, you can quickly get to a very unpredictable position if you have many teams working on many different projects. The more times you run data through the mill, the more unpredictable the results become.
Many are guilty of taking a simplistic view precisely because there is too much variability. The problem that we have is we still see patterns and act on them, but now we are facing the ever-increasing likelyhood that the root cause is two or three steps removed from the behaviour being exhibited in the SERPS.
I'd like to coin a phrase to describe the problem:
It's supposed to reflect the fact that there isn't just one algo anymore, and analysing multiple changes is exponentially more difficult than it used to be.
Is it really necessary to understand every nuance of the Google algorithm? Call me a pragmatist, but if you apply Pareto's Principle (80 percent of the effect comes from 20 percent of the factors) to the algorithm you'll probably never worry about the details in the black box. In fact, you could probably build about 10 major categories and only focus on 2 of them to achieve superior results over your competition. You could do this by simply focusing your own effort on the 20% that gives you 80% of the benefit while the competition is trying to figure out every detail.
In addition, by focusing on the 20% you'll most likely always stay within Google's guidelines and thus avoid filters and penalties.
|I think a deep technical pursuit of the inner workings of Google is like trying to figure out how the universe operates and may not be effective beyond a certain point |
Do I really need to understand the formation of stars, nebulae, and galaxies in order to plant my apple seed and grow apple trees?
Do I even have to understand photosynthesis to benefit from the process?
And continuing the metaphor, I disagree with entire beginning premise.
Tedster "thinks" Google has changed so much. (in fact, this sounds similar to Google propaganda to me)
Really? I haven't noticed anything remarkable in the past 3 years.
Much like "primitive" farmers were able to produce results believing all kinds of fallacies about the sun, an SEO who observes RESULTS can make logical conclusions.
And the BASICS haven't changed.
The "whys" and "hows" are nice to know and impress other uber-geeks with in dinner conversation, but I don't notice a huge difference in how Google ranks sites from 2004 and now.
|These days problem with evaluating "black box" model as large as Google search algo is lack of ability to do controlled experiment. |
I agree very much. That's why I thrive on finding those "fringe" situations, especially if they hang around long enough to feel that they are clearly not just some data burp. For example, on a SERP where a long-time client holds #1, we saw a #2 of 4 years drop to #4 in early January, and it's staying there. It's now outranked by two lower PR urls with fewer direct backlinks, and lots lower in on-page factors, too.
Now that makes me pay attention. Others report that SERPS which were relatively clean and relevant are showing more spam/useless urls.
Or, how (and why) did Google boost pure informational sites above commerce site on some searches -- but not others.
The number of such anomalies since Nov/Dec feels very important - but I don't really see the pattern yet. Now I am almost hyper-alert to understand what seems like something different going on.
|The "whys" and "hows" are nice to know and impress other uber-geeks with in dinner conversation, but I don't notice a huge difference in how Google ranks sites from 2004 and now. |
Indeed. Way back in 2004 it semed that Google was trying to rank sites from top to bottom. Now Google seems to rotate the rankings, seemingly at random, so that one day you're at the top, another day near the top and yet another day you are nowhere to be seen. One could say that Google is doing this to thwart those trying to game the system.
You could also say that it seems as if Google no longer knows how to rank sites and just throws SERPs out there hoping their just good enough to fool the casual user into thinking their relevant.
Maybe there are more options but it seems to me as if Google isn't shooting for one, perfect SERP anymore. They just hope it averages out to something sort of good, in general. It's as if they gave up and threw the towel in.
[edited by: Atomic at 12:01 am (utc) on Feb. 12, 2007]
|The number of such anomalies since Nov/Dec feels very important |
No offense Tedster, but where are your TESTS done on any of your "observational" anomalies?
Tell me you've been keep track of 1000 sites over a period of 3 years before you start saying "Google's different"
Of course, it's different. They tweak and change things all the time, but "noticing" a site here and there change rankings (from #2 to #4 doesn't even come close to qualifying as an anomaly) is not even close to a scientific method and creates a continuing environment of "throwing chicken bones" in an industry that's inherently based on LOGICAL, PROVABLE MATHEMATICS.
IMO a great deal has changed since 2004.. look at last years new infrastructure fiasco ( ol Big Daddy)...
I personally, following along, was not adversely affected in ours nor our clients sites... we enjoyed the commotion from the side-lines.
So do I believe involving technical search egineering in ones SEO studies helps? You bet. I am a better SEO for it (just have to work on my link baiting abilities... stong on the technical, weak on the BS)
Around the office I like to call it - Predictive SEO
Doing SEO for a site with today AND tomorow in mind ( we have NEVER done recips in campaigns for example... no pain last year anywhere)
So watch for the trends and try a little 'predictive SEO' of your own.. he he....
|look at last years new infrastructure fiasco |
Exactly, it was a roll out of new infrastructure thereby explaining the issues.
But show me some SERPS (more than 100) where I will not be able to "explain" why the top 10 are ranking where they are?
Did I miss where Apple got booted from the "computers" SERPS?
Or are we talking about "how to buy mauve widgets in timbuktu" SERPS?
If you want to make the argument that Google's gotten worse at unpopular multi-phrase searches, then I might agree, but otherwise I don't see anything worth "re-evaluating"
[edited by: whitenight at 12:22 am (utc) on Feb. 12, 2007]
At the end of the day there has to be at least SOME meeting of the minds between search engine engineers and the people building the websites. Why? Because in our quest to "build websites for users," Google stands as the gatekeeper between us and the user. One could build the Taj Mahal of websites, but if it doesn't rank, no one will see it; no one will link to it.
Very few people seemed to yet realize the monumental changes Google made in recent months. They don't yet realize what's going to hit them soon. I personally think this is completely uncharted territory we are entering.
Buried somewhere in one of those "950" threads, steveb made the excellent observation that this latest change heavily effects "niche authority sites," and he is correct. Translated, that means "having a well established website built for users" no longer works in certain cases.
|For example, on a SERP where a long-time client holds #1, we saw a #2 of 4 years drop to #4 in early January, and it's staying there. It's now outranked by two lower PR urls with fewer direct backlinks, and lots lower in on-page factors, too. |
Now that makes me pay attention. Others report that SERPS which were relatively clean and relevant are showing more spam/useless urls.
I agree that something big in the SERPs is probably going on. I've seen a lot of the spam sites ranking and I've seen a big shift in several of my keywords. However, I still believe that the fundamentals are still the same and that my site will be fine if I continue to focus on the fundamentals rather than some fancy new SEO play. Google will work out the kinks and in the mean time I'm going to write new content and get good quality back links.
My question to tedster is, "Did you use any fancy SEO plays (obviously purchased links, unrelated links, keyword stuffing) to get those sites to #1 and #2?".
Reading white papers and patents can go a long way and is well worth the effort.
IMHO the Google algo(s) is/are too complex to begin to grasp in entirety, or even largely in part, because so many of the factors can interact with one another to product any given effect.
What's plain with penalties or heavy-handed filtering, in some cases, is seeing a direct correlation with guidelines violation, like a PR0 penalty I've seen just recently. Yes, the PR0 penalty is still alive, well and kicking but there's nothing new with the cause - not in this case anyway.
I think it started out with IDF, probably around 2001-2, if I remember right, and for the most part, I've made a "hobby" of collecting and studying white papers and patents for several years, printing them out so it's easy to cross-reference elements and factors in common between some of them.
IR isn't the same thing as SEO, it's a science discipline and in my mind the very foundation of learning how search works is to try to get into their heads to learn how they think, and what they look for when building algos.
|Did you use any fancy SEO plays (obviously purchased links, unrelated links, keyword stuffing) to get those sites to #1 and #2? |
I need to clarify. The #2 url in this case (that recently fell to #4) is not our client. It's their long time competitor, someone we thought was invincible at #2 for a number of reasons. Those reasons include strong, global name recognition that provides them with natural, ongoing growth in great backlinks.
Something about this particular fall from #2 to #4 still makes no "sense" by most SEO thinking that I'm familiar with, that's why I mentioned it. It's one of those "fringe cases" with something important to teach me about ranking, something that is not in my current black box model.
With regard to the #1 position, we did get pretty fancy with link building for this client, tapping into various communities directly and also through linkbaiting with some controversy. But we did nothing off-theme and nothing purchased - except for some new paid directories. We did make use of lots of supportive sites, but none of them are fluffy. They all make sense to visitors and serve a distinct business purpose.
This area of off-theme detection in backlinks is definitely one to watch for new analysis approaches by Google, I think. We can already do so much more with a handful of solid and on-theme links than we can with a truckload of "whatever", but sometimes the thematic connection can still be stretched pretty thin and be effective. I do expect that situation to shift even more.
|the very foundation of learning how search works is to try to get into their heads to learn how they think, and what they look for when building algos. |
That's how I see it. The first shift that this kind of study took me through was leaving behind the mind-set of "What can we get away with?" and moving over to "How can we send the strongest, clearest possible signals?" From there begins the study of what signals are being measured, and how.
I now feel that the head-on battle against spam probably means less to Google than finding improved ways to measure relevance. Ideal relevance detection would naturally elimate spam, because it eventually becomes easier to create real relevance and quality than it is to fake it. Google's not there yet, but I think they're a lot closer.
|I now feel that the head-on battle against spam probably means less to Google than finding improved ways to measure relevance. Ideal relevance detection would naturally elimate spam, because it eventually becomes easier to create real relevance and quality than it is to fake it. |
Hmm. Then this is a GOOD thing, no?
Many of here have been yelling, that this was the way to go for years now.
As far as to your original point, I look at it like Counting Cards in BJ. It's obviously impossible to win every single hand...
But there are various systems to 'beat the house' Some complicated, some simple.
Using the simple systems of "counting cards" in SEO means there's going to be more variance, but it's easier to keep a handle on.
Using the complicated systems of SEO counting cards, means "more understanding" and less variance, but alot harder to keep everything organized.
The question is: Which methods were you using to analyze the Google deck of cards?
Ie. when the algo changes it can result in dramatic shifts (run of bad luck) for the more simple SEO counting systems.
So you can switch to the more complicated systems or let the law of large numbers do their thing and let the odds work back into your favor.
But i think it's assuming alot to say Google is now using a 60 card deck just because the numbers aren't going the way you expect them to.
(lol, any game theory people out there understand this analogy?) :P
One big thing a lot of people are missing is human editing. We spun off a website a few months ago, small informational site that is basically a hobby site with unique content that was wrote by the customer we built it for. It had a lot of text on each page and more information for the subject than sites like wiki have.
Now, a week after we submitted the site to google, googlebot started to crawl it. The site had no back links at the time. Since the site had a lot of great information on it, I am more than sure the current google algo flagged it as being a great site worthy of top ten for its perspective keywords.
Low and behold, from a google ip address someone hand typed in the site! I believe it was a human reviewer for serps. The site did not have adsense on it or anything that google needed to check it for. The reviewer surfed around on the site, added it to their favorites and left. Within two days it was ranking top ten. So overall the site made it to the top in less than a month.
So there is a human editing factor in the serps and I believe some penalties are placed on sites via humans. The algo brings up information, but I believe the top ten is controlled somewhat by humans.
|So there is a human editing factor in the serps and I believe some penalties are placed on sites via humans. The algo brings up information, but I believe the top ten is controlled somewhat by humans. |
Great. So my 4 year old site updated daily with links from all over the world has gone to the end of the results because someone at Google doesn't like it... but just some days, because others works fine.
I want those 4 years of my life back.
|Or, how (and why) did Google boost pure informational sites above commerce site on some searches -- but not others. |
these purely informational sites have adsense ads on them.
thx for sharing this information, trinorthilghtning. So the patent tedster pointed to in [webmasterworld.com ] is definitely in use now?
What I'd like to know from those of you, who have studied the patents in more debth and detail, is: those figures: "network 100", "client 110", "servers 120-125": Are these the same or similar throughout all the papers? Have you tried to draw a sort of flow-chart ACROSS all papers? Maybe this is a way to get an idea of how the various modules of googles search engine are organized internally. Could you alternatively tell me how the hell I can get access to the drawings in the appendices;)?
I'd second whitenight insofar as the way we used to analyze googles "behaviour" in the past is not very scientific, despite the fact that most of the serp results are calculated purely automatical, and thus should lead to predictive theories much easier than e.g. psychological research.
It seems as if noone ever tried to systematically summarize the "case Studies" reported over and over in the postings. Would that make sense at all?
I think there definitely are some relatively new factors in play, or at least phasing in. We've discussed these two in particular: History and Age Data [webmasterworld.com] and Human Editorial Input [webmasterworld.com].
Recently Google reps and others have made remarks that keyword density is no longer a workable approach to understanding rankings. I've watched keyword density metrics since back in the mid 90's when I used WPG and first woke up to the metric.
It has seemed clear to me for a while that kwd for Google had gone the way of the dodo bird -- but again, what has taken its place? The demise of keyword density is one area where I think PaIR based techiques are showing their muscle.
All I know is that counter spam is working better than ever. Google seems to have no solution for it.
>>Recently Google reps and others have made remarks that keyword density is no longer a workable approach to understanding rankings
IMHO it goes further back than just recently actually, and the concept of KWD being a significant factor has been seriously debunked for at least a few years that even I know about - not by SEO people, by IR people.
Checking for KWD still has value, with Brett's KWD tool anyway, and it may be OK to call it that for identification, but not actually for the density itself. It was evident around the time of Florida that rather than density it was number of occurrences; in fact, NFFC and I discussed it at length and he even came up with a maximum number, which when checking, turned out to be as spot on as it can get.
While "human factors" may provide data, examples and information to use , there's NO WAY it could or would be used to *fix* search results. Human evaluation can provide valuable input as data for statistical analysis, and the data derived from the analysis can be used for setting algo parameters and/or upper and/or lower limits.
IMHO it isn't even actually phrase analysis, though that's part of it; it's also about keyword relationships in document sets.
|Human evaluation can provide valuable input as data for statistical analysis, and the data derived from the analysis can be used for setting algo parameters and/or upper and/or lower limits. |
Are you suggesting they've taught the algo how to learn...WOPR [en.wikipedia.org]?
|IMHO it goes further back than just recently actually, and the concept of KWD being a significant factor has been seriously debunked for at least a few years that even I know about - not by SEO people, by IR people. |
Marcia makes a valuable point.
And in the hope to help people understand, I will eliminate my usual sarcasm and hyperbole.
I would suggest EVERY SEO who is serious about "figuring out Google" take a 3-month period and
>> do NOT read any SEO boards
>> do NOT listen to any Google employees
>> Do nothing but pick 30 non-related (to your sites, that way you can be objective) popular phrases and 20-30 related "long tail" phrases and study the top 20 SERPS.
The algo is RIGHT THERE in the SERPS.
Using the 80-20 principle, simply forget about pinning down the 20% of the algo that takes too much time and changes too often.
Nail down the factors that make up 80% of the SERPS.
THROW OUT everything you've read, believed, or were "told by a trusted source" before.
IF you have a theory, test it by seeing if those factors are consistent in 90% of the sites ranking the top 20.
If they are not, you're throwing chicken bones and refine your theory and test again
I guarantee you will have a MUCH better understanding of the algo and what's going on than you do now.
>>taught the algo how to learn
That would be how a neural network operates (like MSN Search), but my wildly theoretical guess would be that the human reviewed sites might provide a fairly accurate enough data set to use for testing. Coupled with that, using their own criteria, and maybe ODP data (which is arranged in a perfectly logical categorical taxonomy) would make a perfect data set in a controlled environment, for semantic analysis of all sorts - and phrase composition, IDF, etc. Then, coupled with what data they have for clickthroughs and bounce rates in both the SERPs and advertising programs, there's enough for a reality check - especially if taxonomies are being created on the fly for pre-processing.
Two things Google is fanatical about: programmatic solutions that scale, and statistics. But this thread is about models to try to figure out what makes Google tick:
|Have you found something new that really helps? |
So it's not new but what I've found really helps is reading and trying to absorb what's in papers and patents that are actually put out from inside the heads of the people who put web search together.
My own concept of the algo is far more intuitive than mathematical. I've muddled through reading the papers and patents and while I generally understand the narrative parts the math is often over my head.
That gives me a rather quirky picture of what Google "should" be doing, almost anthropomorphic, and while it would be too fuzzy to satisfy some of the really analytical types among us, it's enough to keep my own work aimed in directions that have been profitable.
My own experience would support Whitenight's suggestion that it's more productive to focus on the 80% we understand, rather than worrying too much about the last 20% which will change frequently anyhow.
Most important of all is to DO something with what we understand. I have lots of things on my to-do list that I know will make a positive difference, but they will only help if I actually get them done! :)
|Indeed. Way back in 2004 it semed that Google was trying to rank sites from top to bottom. Now Google seems to rotate the rankings, seemingly at random, so that one day you're at the top, another day near the top and yet another day you are nowhere to be seen. One could say that Google is doing this to thwart those trying to game the system. |
For the serps I look at I would have to disagree. Over the past year or so, it seems that the top 8 or 9 always remain, meanwhile an 'outsider' will show up every now and then... however, the #1 and #2 spot always remain the same, while 3-10 shuffle ( always with the same sites + the odd ball listing ). I've seen this for multiple examples.
| This 48 message thread spans 2 pages: 48 (  2 ) > > |