Forum Moderators: martinibuster

Message Too Old, No Replies

Google Algos for Dummies

A simplified explanation of how Google's algorithms work.

         

shafaki

10:23 am on Aug 3, 2005 (gmt 0)

10+ Year Member



As most publishers are not into programming, and even those who do programming many don't do AI, neural networks and similar stuff, so I thought about giving a simplified guide in the spirit of "For Dummies" to explain in simple terms how Google's algorithms work (be they for ranking, matching or just about anything).

What triggered this post and kept me thinking about writing it for a while was the repeated questions and talk in the style of:
- what is the threshold for keyword density in a page above which Google penalizes the webpage?
- what is the CTR above which Google bans you or makes clicks invalid?
- a web site stays for 90 days in the sandbox before it gets ranked.

Such questions and statements asking for or giving FIXED numbers just miss the point behind how Google's algorithms work.

Google's algorithms do NOT use fixed values for determining stuff while ranking, matching, indexing or penalizing. The beauty of Google's algorithms, which makes them that scalable (aka work on a HUGE scale) and flexible (changing rapidly and adapting to changes in the webosphere), is that they use automatic learning. No fixed numerical values are hard wired into the algos, but all values are kept afloat changing and adapting as new 'knowledge' is gained and as new changes take place in the webosphere. Not only are the values of the functions used in the algos kept afloat to vary automatically and be adjusted by the system, but the functions themselves are highly complex and are themselves constructed automatically. The functions can be so complex even for the human brain to understand (those who know about artificial neural networks know what I mean).

An interesting clear example that showcases Google's ability to automatically learn relationships between concepts is the Google Sets. Give it a spin:
[labs.google.com...]

Again, Google Sets learns the relationship between those concepts automatically from the vast text content present on the web. Yes sure it might not be perfect or comprehensive (same thing with all Google's technologies and those of other search engines), but it does an interesting job specially when you try to search for stuff like names of companies, sets of products or other stuff which takes advantage of Google's ability to do automatic learning.

Let me give another hypothetical example to serve as a concrete, yet hypothetical, case for us to see more solidly how the Google algorithms might work. Let us ask ourselves this question: Does the name of a file (ex: Blue_Widgets.html) have an effect on how Google matches the page? Or let's ask the question in a different way: To what extent does the name of the file affect Google's matching of it? The answer to this question cannot be a single fixed value like 5% or 15%. The reason behind this is twofold:
1- The weight Google would give to a file name might interact with other elements. For instance, Google might give the file name a higher weight for a certain page due to the presence (or lack of) some elements on that page, while giving the file name a low weight in case of another file.
2- For Google, the value of the file name in matching might change as the system automatically learns with time. For instance, lets say all webmasters of the world started using garbage names for their files and stopped using relevant names completely. The software will gradually learn that the file name has minimal or no value in determining the matching, so it will drop it out altogether from its matching (or give it a value of 0, which is equivalent to dropping it out).

I hope that by showing this last (hypothetical) example and the other explanation in this post, I have made the point clear about how Google's algorithms work for those who are non-programmers or those who do programming but have not done neural networks or AI programming.

I also hope that this post would help reduce the amount of questions (and statements) asking about or mentioning fixed values for anything related to Google's algos.

stuartc1

10:44 am on Aug 3, 2005 (gmt 0)

10+ Year Member



Interesting post shafaki.

I agree with what you have stated and some very good points in there. However I should add that what you have describe is probably only a very small fractions of the algo 1%? there are many other factors in there. For me an important one is the fact that the algo introduces other branches of factors based on current knowledge. An example is this: a new website is built with average keyword density on multiple topics, some good inbound links are found by goggle on multiple topic sites, the PR on on the linked pages are set high. There are no links in the dmoz directory. At this point google has learned about the site and indexed most of it, it also takes note of the sites that link to it and what their topic is (if known). So it guesses based on on site keywords and sites linking to it what the topic/theme of the site is. This guess will probably be fairly poor at this point, so it so gives the site some presense in the search results to test the water. Over time it takes note of searches used to find the site which visitors have clicked on, it attempts to record if the visitor found the site useful (i.e. if the hit the back button, when, done other searches, clicked on other results etc.). I uses this data to build up a largeer picture of the site.

The point made above move me on to my main point. Based on the current knowledge google has on a site, it will introduce new algo factors. For example if it is 80% certain of the sites topic, it will give more weight to want visitors think of the site. Or if it doesnt know the topic it will see what keywords work best to visitors to try and obtain the topic. Another factor which brings PR into the equation is if google have a pretty good idea of the sites topic, it will give more weight to the PR passed from on topic sites, and less weight from off topic. The list goes on and on.

Before I get flammed - this is only my thoughts, I obviously do not know the inner workings of how google technology works. Im only speculating on what I have observed from looking at the it from a blackbox point of view. Hey, it might all be wrong :) perhaps they have a list of algo factors for me and have fooled me into observing this ;)

BTW... I thing this should be moved out the Adsense forum!

jcmiras

12:02 pm on Aug 3, 2005 (gmt 0)

10+ Year Member



I was also thinking of that before that Google my be employing somekind of "learning process". Maybe they are using something like artificial neural network or genetic algorithm of fussy logic or something else. But the point is, the algorithm's behaviour is unpredictable but the output is concrete defined.

DamonHD

12:33 pm on Aug 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi,

I'm not taking the p**s, but I do like the idea of "fussy logic"... If only my brain would employ some of it more often! B^>

Rgds

Damon

John Carpenter

1:29 pm on Aug 3, 2005 (gmt 0)

10+ Year Member



shafaki, I don't mean to be rude, but, unless you work for Google, how can you claim you know what algorithms Google uses and how they work? I definitely doubt they would be silly enough to reveal the technical details of their algorithms in publicly accessible documents.

jcmiras

2:36 pm on Aug 3, 2005 (gmt 0)

10+ Year Member



opps... sorry, it must bu "or fuzzy logic"

shafaki

5:11 pm on Aug 3, 2005 (gmt 0)

10+ Year Member



John Carpenter

from your response, it is crystal clear to me that you are not into such programming and type of algorithms (or algorithms all together). what i have mentioned above says NOTHING about how the exact algorithms work, i am only stating a FACT about the type of algorithms Google and other search engines use. it is NO secret that they use automatic learning, and they well mention that in their tech pages. my intention was just to make this piece of info which is abvious and well known to the techkies also known to others who are not into such type of programming or are not into programming all together.

John Carpenter

5:26 pm on Aug 3, 2005 (gmt 0)

10+ Year Member



it is NO secret that they use automatic learning, and they well mention that in their tech pages.

Any references?

lammert

6:05 pm on Aug 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



i am only stating a FACT about the type of algorithms Google and other search engines use

The nice thing of facts is that there is some proof, a document, webpage etc from an authority source (Google) that describe these facts. Please tell us what reference material you used, otherwise I have to conclude that it is rather fiction in your memory, than a fact. (although fiction and facts can be identical in some situations, or might look identical).

FYI, Pagerank--the only part of the Google algorithm we know for sure because it has been well documented--has nothing to do with fuzzy logic but is a clean mathematical formula.

Good observation of a black box--as we all do with Google--doesn't always retrieve the truth. For example by examining what he observed, Isaac Newton developed the classical mechanics. He thought that his theory described all mechanics, but some centuries later Albert Einstein developed the theory of relativity. The classical mechanics then proved to be just a subset of the theory of relativity, when the speeds of objects is much smaller than the speed of light.

So IMHO, without having insight information from Google directly, or using a testable hypothesis, you should talk about thoughts rather than facts.

<added>
FWIW: I studied both Physics and Artificial Intelligence at Groningen University in the Netherlands.
</added>

wyweb

6:58 pm on Aug 3, 2005 (gmt 0)



Any references?

I'd like to see this too..

Kendal

8:12 pm on Aug 3, 2005 (gmt 0)

10+ Year Member



it is NO secret that they use automatic learning, and they well mention that in their tech pages. my intention was just to make this piece of info which is abvious and well known to the techkies also known to others who are not into such type of programming or are not into programming all together.

If it's no secret, then provide the source from Google that supports your post.

shafaki

10:02 pm on Aug 3, 2005 (gmt 0)

10+ Year Member



stuartc1

All what you have mentioned is only a fraction (maybe less than 1%) of the methods Google uses in their algos which I have hinted upon in my first post.

The reason behind my post was NOT to mention any discovery of mine! I twas just to make known to the non (AI) programmers what is alredy well known to those who do such types of programming. My intention was not to deliver any new information, but simply to transition the information from the domain of the AI-and-similar-stuff programmers to the domain of daily publishers who might not even know programming.

lammert

12:04 am on Aug 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



domain of the AI-and-similar-stuff programmers

www.AI-and-similar-stuff.com? ;)

What I tried to explain in my previous post, is that what might look the right model to describe a black box is not necessarily the right model, or not fitting in all situations. Newton had a good model, but only for "slow" mechanics.

There are two methods to proof how the Google algorithm works.

  • The first one is by an authorized document from Google.
  • The second method is a scientific approach with the Hypothetico deductive method [google.com].

All other methods are not facts as you call them, but speculation.

shafaki

1:21 am on Aug 4, 2005 (gmt 0)

10+ Year Member



There is a difference between trying to understand natural laws such as those of physics and man-made machines which are composed of elements put together (such as code bits in the case of Google). In the former case you keep on devising new models and you naver reach a stable one, you just use the models to try and understand, predict and deal with the natural phenomena. You will never be able to model them perfectly (nor to simulate them completely) simply because they are not 'made' of putting together of elements, but we break them down into such elements for our analytical minds to be able to comprehend. As for man made stuff, such methods are not used. It seems you have given the wrong example by completely being (sorry my linguistic ability is not providing me with a lighter word right now though I tried hard to get one) blind to the difference between the two: naturally occuring phenomena and man made stuff.

Visi

1:53 am on Aug 4, 2005 (gmt 0)

10+ Year Member



So let me comprehend this discussion. The un-natural development of man made algo's is behind the fact that I cannot get listed on Google?

And all this time I thought that my creative mind....some good content and a few links would resolve the problem:):)

Enough of the "intermission"....gentlemen continue your discussion:)

shafaki

2:16 am on Aug 4, 2005 (gmt 0)

TheGuyAboveYou

2:20 am on Aug 4, 2005 (gmt 0)

10+ Year Member



Very good post!

I will look at this link and be back in a couple weeks :).

shafaki

2:25 am on Aug 4, 2005 (gmt 0)

Kendal

4:03 am on Aug 4, 2005 (gmt 0)

10+ Year Member



http://www-db.stanford.edu/~backrub/google.html

No where in this paper does it state that fixed values are not used, nor does it talk about using AI nor does it mention artifical learning.

As far as your other reference, which is a collection of articles by Google employees, in what specific article does it state that there are NO fixed values used in Googles current algo as you assert?

TheGuyAboveYou

4:32 am on Aug 4, 2005 (gmt 0)

10+ Year Member



The answer, most likely, is that the algorithm uses
fixed methods along with some type of learning or
periodic calibration of constants.
I have seen papers in the past on neural networks applied to ranking documents. There are many papers published but how many of these ideas from papers actually get implemented and to what degree is not known. Unless you work on the code. If you post here you probably don't work on the code. I would imagine :).

nancyb

5:05 am on Aug 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hey guys, take it easy. shafaki didn't say anything about google's algorithm, s/he only offered some insight into what search engine algorithms do and don't do in general.

Take it as a brief (I'm sure that is an understatmenet of some monumental proportions) explanation of artificial neural networks. His intent was to offer an explanation why asking for specific numerical keyword density figures was fruitless, not to suggest they were non-existent.

TheGuyAboveYou

5:11 am on Aug 4, 2005 (gmt 0)

10+ Year Member



I agree. I thought it was a good observation. Around here your posts get a lot of scrutiny which is good.

moneyraker

5:59 am on Aug 4, 2005 (gmt 0)

10+ Year Member



[shafaki didn't say anything about google's algorithm]

I think Shafaki did, as early as in the first sentence of his first post.

I personally do appreciate Shafaki sharing his deep thoughts with us, but I think the others are simply reacting to the way he writes these ideas. He is stating everything as 'facts' without establishing where he obtained those facts. If only he inserted a phrase such as 'In my opinion,' or 'Based on what I read in' or 'I infer that...', the others would be less reactive.

lammert

6:30 am on Aug 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



but I think the others are simply reacting to the way he writes these ideas.

Agreed. In my opinion, the factors he mentioned can be a substantial part of Google's algorithm. But as a programmer in the field which he is mentioning, I know that there are several ways to reach the goal and it is not easy by observating the outside of a black box to know what is in it.

Look at it as the discussion of the Google patent [webmasterworld.com] some months ago. We know the factors mentioned there are valuable, yet we do not know which of these factors are implemented, or if they are implemented at all.

So my reactions are more about the language he used than a denial of the information.

shafaki

6:43 am on Aug 4, 2005 (gmt 0)

10+ Year Member



No, they don't use fuzzy logic in Google's algos, and I've never mentioned that. It was a comment on my post that refered to fuzzy logic, then another one after it refuting it, then another making fun of it, but as for me, I never mentioned it in my previous posts.

Fuzzay logic is the foundation of fuzzy algebra and creates a whole new way in which calculations are done and solutions for 'unconventional' problems are reached. In traditional logic (Boolean logic, zero and one), all what was present was True and False, nothing in between. As for fuzzy logic, this dicotomy is broken, and there is instead a spectrom of values between the zero and the one, such as 0.5, 0.9, 0.3 ... etc. Such has created a whole new set of algebra built on it and hence new methods for solving problems too.

Just mentioning that as a disclaimer.

John Carpenter

11:03 am on Aug 4, 2005 (gmt 0)

10+ Year Member



Hey guys, take it easy. shafaki didn't say anything about google's algorithm, s/he only offered some insight into what search engine algorithms do and don't do in general.

The problem is that he stated he knows how Google algorithms work without giving any proper references (note the topic of this thread -- Google Algos for Dummies). Another problem is, that even if there were some publicly accessible documents containing such technical details, nobody, except _some_ Google employees, can tell you what they actually use and whether they have some secret algorithms (I believe they do -- otherwise spammers would bring their SE to its knees). That's the problem I had with his post.