Forum Moderators: Robert Charlton & goodroi
Google happens in the space between browser and search engine and destination content server, as an enabler or middleman between the user and his or her online experience. source [oreilly.com]
I type a query from any part of the world and within a blink of an eye the results appear. What is the whole mystery behind this? I am not planning to build one more Google but am bit curious to get an insight.
Regards,
getxb
[url=mms://videosrv6.cs.washington.edu/talks/Colloquia/JDean_041021_OnDemand_100_256K_320x240.wmv]Jeff Dean of Google - free online video of a behind the scenes look at Google[/url]
Found that link on seobook [seobook.com]
Also, they really only have to grab 10 pages at a time, or however many results you want displayed (up to 100). The number of pages found could be an estimate based on some other nifty algorithms.
Dunno ... what I do know is that Google isn't as accurate as it used to be, but perhaps that to be expected as the web gets larger and larger.
We basically don't know exactly how they do it, and search engines vary in their approach to information retrieval. The only thing that's solid (though dated) is the famous "Anatomy" article: [infolab.stanford.edu...]
The only thing that's solid (though dated) is the famous "Anatomy" article: [infolab.stanford.edu...]This article is very good, but it is also very basic - problems of scaling search engine way past 10 bln pages requires a hell of lot of additional things that they did not cover back then.
This article is very good, but it is also very basic - problems of scaling search engine way past 10 bln pages requires a hell of lot of additional things that they did not cover back then.
Definitely true... but I don't think they've formally uncovered exactly what those additional things are, so unfortunately were somewhat left guessing.
There are some good notes on scalability in that article too though.
That article was written by someone who really knows - Anna Patterson. Patterson is affiliated with Stanford, and has worked at archive.org, then at Google (see the phrase related patents) and now she's with the would-be search engine contender Cuill [webmasterworld.com].
Every step she details in the article is part of the chain that needs to be optimized to give the end user a speedy result for their search.
Let me share some info as well. Here's what Google has to say ..
By collecting flocks of pigeons in dense clusters, Google is able to process search queries at speeds superior to traditional search engines, which typically rely on birds of prey, brooding hens or slow-moving waterfowl to do their relevance rankings.
When a search query is submitted to Google, it is routed to a data coop where monitors flash result pages at blazing speeds. When a relevant result is observed by one of the pigeons in the cluster, it strikes a rubber-coated steel bar with its beak, which assigns the page a PigeonRank value of one...
source [google.com]
Interesting and closely matches with some detailed explanations cited so far. But isn't it amazing that even searches conducted inside my reputed org network takes more time than Google! Even editing a small html file in my notepad and previewing it in a browser takes more than one sec!
Sometimes seems am talking/chatting with Google. The lightning speed of Big G is still a mystery to me.
Regards,
getxb
[edited by: tedster at 6:17 pm (utc) on Nov. 3, 2007]
[edit reason] reduced length of the quote [/edit]
Even editing a small html file in my notepad and previewing it in a browser takes more than one sec!
Enter the cloud. :-)
caching
Organic SEs don't update their live search databases in anything approaching real time. So it's a fair bet that the results for an even vaguely popular search are already sitting in RAM, having been cached for a previous searcher in your language.
Or if not in RAM, in pre-compiled form distributed over an array of disks, which would be almost as fast.
In addition to that, there are a lot of smaller parts of the search process that can be cached and/or pre-computed and stored, even when the query in question hasn't been.
I of course don't know that Google is doing this, but I can say without much doubt that the results of any given search are cached at least temporarily so that they don't have to be re-computed as the user navigates around within them.
Out of curiousity I just tried a G search for a long (definitely not cached) phrase and the results took .23 seconds. The same search at Y! search took .30 seconds. So really the difference isn't that large.
At least not always... the next search I tried was .20 compared to .86.
Considering the many parameters one can set and also that a lot of search-quality related technology they added the last three years seem to apply changes after the system retreives the data... ( right? )
Is it safe to say that what they store in their cache is more like a full list of candidate results, to which they would be able to apply the whatever filters, regional and personal search preferences, etc. ... everything we know that's done on the fly?
It would take a lot of resources to store different results even for every classic, day-one Google parameters such as, for example language, country or offensive material filter settings. It's most likely they store a raw list of all these cached together...
...and so, the actual result list cached for even the most popular searches could be ( much ) more extensive. It could include data for every option, then let the real time filters pick what's needed, or perhaps just drop whatever is not needed, instead of retreiving the same data with but a single parameter changed, and caching it separetely. It's interesting to think that whenever you do a search, you're *almost* served some data, with only specific parameters keeping you from seeing it.
Like the &filter parameter. When added, it's not actually a different search, or is it? More like, if you don't add it, you see an excerpt from the full list.
Or was this something obvious to you all...?
Or is it completely wrong?
Like the &filter parameter. When added, it's not actually a different search, or is it? More like, if you don't add it, you see an excerpt from the full list.
I'm sure you're right. Normally you see a filtered version of the full list that the original query returned. Wouldn't surprise me at all if the original "raw" results remain in some kind of cache for a while so that they are more quickly retrievable for a modified result - especially the query for the next, deeper group of pages.
Try this experiment, if you haven't already. Do a Google search in Firefox with the ShowIP extension. Then query that IP address directly rather than using google.com. Many times, these results do not match. Some processes happen AFTER the basic data is returned.
[blog.searchenginewatch.com...]
where the headline suggests that SHARDS are responsible for the data delivered and that this delivery is optimized by topic.
still, i understand the amazement even more since i started my cluster of now 15 nodes to run some file hosting and shopping from it.
The bigger the stuff gets the more important gets your basic architecture. One little error in storage assignment or basic data flow makes trouble months later, when you grow.
Considered the speed of the SERPS it is very interesting that they had little problems with delivering results (even if the results go up and down in quality).
The caching is a good idea, but that must be some well designed cache!
P!
Within an hour or two Google was returning accurate results on the news event in its blog and news searches.
Within 6 hours it was returning accurate results on the news event in the main index.
this gentleman, is real good, at doing something we did together back in high school ( 80's ), We rewrote code after it was compiled to make it even faster, we would compete with each other all junior and senior year ( this was on the 6502 and the 8086 )
I know he works for Google since he asked me what I thought of working for them. I told him straight out " work for them but I guess I'll never know what you did for them "
which has worked out true, but I guess he's rewriting compiled code or builds special compilers and optimisers for them.
when we use to compete against each other, we had it down to cycle's, So I can only imagine what sort of heaven he is in right now ( heck I would be in heaven too with all those toys at Google )
one thing I am sure of, he's helping them, and I just think, his work I get to see very day by faster results.
The next layer builds off that foundations and shrinks down the index by analyzing the data and applying rank methods to pull up the highest ranking material.
The next layer does the same and the next layer does the same and so on and so forth.
Soon you know what your data set is. Of the 10 billion pages you have indexed you may only search the top 500 million because at one point they simply drop off. By dropping off i mean you only show the first 1000 result pages and as you scale through the result pages you simply hand the query off further towards the bottom of your pyramid (which would result in longer query times or simply end of results with any relationship to the query)
The same logic is used to keep the "Freshness" current. Sites that have a high or x% threshold of high results get spidered more frequently and then they simply re-balance the new pages by applying the results of pagerank and data derived from the existing aged index until the next major refresh happens. (google dance)
The top x percent is distributed across geographical regions for best performance and they focus on the main "corpus" at a fewer larger data centers.
The index itself is scalar so you may also have indexes that index the index so that your top x searches happen quickly and depending on if any semantics are incorporated they could cull out stop words, foreign languages, spam, duplicates and apply logic to do synonyms, spell checks and whatever other magic they wish to incorporate.
say on server 1-100 you have the top 1% on server 101-250 you have the top 2 through 9% and on server 250-1000 you have he top 10-25% and the rest of it goes off to lala land because its almost irelevent