| 9:42 pm on Feb 23, 2010 (gmt 0)|
I read the whole article, although most of what it says is already well-known. I only saw one mention of Caffeine. It said:
"The most recent major change, codenamed Caffeine, revamped the entire indexing system to make it even easier for engineers to add signals."
| 9:49 pm on Feb 23, 2010 (gmt 0)|
|"The most recent major change, codenamed Caffeine, revamped the entire indexing system to make it even easier for engineers to add signals." |
Does this mean that Caffeine is now live:)?
| 10:04 pm on Feb 23, 2010 (gmt 0)|
If it's live and has been for awhile, Google's got bigger problems than I realized.
| 10:12 pm on Feb 23, 2010 (gmt 0)|
The statement in the article appears to imply that Caffeine is live. Although it's odd that the author of the article didn't say more about it. I'm wondering if he correctly understood what he was told.
| 10:40 pm on Feb 23, 2010 (gmt 0)|
I told you guys a while ago, I think Caffeine has been live for a very, very long time.
| 11:38 pm on Feb 23, 2010 (gmt 0)|
I appreciated the writer's ability to explain Google's semantic advances in a way the average person can appreciate. Anyone who has wrestled with site search for, say, a million or more pages has got to be a bit awestruck at the huge job Google has taken on -- and why their successes in this area help them dominate the current market.
|Google's synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein's theories about how words are defined by context. |
|Sometime in 2001, Singhal learned of poor results when people typed the name "audrey fino" into the search box. Google kept returning Italian sites praising Audrey Hepburn... "We realized that this is actually a person's name," Singhal says. "But we didn't have the smarts in the system." |
|...he had to master the black art of "bi-gram breakage" — that is, separating multiple words into discrete units. For instance, "new york" represents two words that go together (a bi-gram). But so would the three words in "new york times," which clearly indicate a different kind of search. And everything changes when the query is "new york times square." |
That talk about bi-grams (n-grams in general) reminds me - is there anyone here who has played with with Google's publicly released 1 terrabyte n-gram data set [googleresearch.blogspot.com], often called the "1T corpus"?
That would take some serious computing power to deal with, but I'd love to have a go at it.
[edited by: tedster at 11:58 pm (utc) on Feb 23, 2010]
| 11:48 pm on Feb 23, 2010 (gmt 0)|
When Caffeine was originally announced, I understood that it would lead to a big improvement in Google's ability to collect and handle data. The announcement also mentioned deeper and more extensive crawling. All of this led me to expect that the number of indexed pages would be greatly expanded after Caffeine was implemented. But so far I haven't seen any signs of this. So now I don't know what to expect.
| 12:40 am on Feb 24, 2010 (gmt 0)|
Really? Why would Google start wanting to index all these pages - didn't they a few years ago toss all the second class pages into the supplemental index? Now they're changing their mind on that?
| 12:59 am on Feb 24, 2010 (gmt 0)|
Well Wheel, I based it on their staements about deeper and more extensive crawling, and the projected improvement in their dtat handling capacity. Also, in the past they've talked about being able to catalog all the world's information.
| 1:09 am on Feb 24, 2010 (gmt 0)|
|All of this led me to expect that the number of indexed pages would be greatly expanded after Caffeine was implemented. |
By index (technically the SERPs are the index) do you mean show them to people, or by index, do you mean spider them and use them as part of their indexing calculations?
I ask because I think there is a common misunderstanding of their use of the word index, which technically refers to the results not the larger underlying dataset used for computational purposes. Personally, I would think they would want more data to compute from with less but more accurate pages returned in the index (SERPs).
| 2:09 am on Feb 24, 2010 (gmt 0)|
Thanks Mad Scientist. I think your explanation is probably right -- Google wants to collect and use more data, yet at the same time could still restrict the number of pages they include in the SERPs. That would explain why they felt a need to improve their data handling capacity. Another reason would be the expected continuing growth of the internet. So I probably jumped to the wrong conclusion when I thought that the index would be expanded.
| 3:24 pm on Feb 24, 2010 (gmt 0)|
|Personally, I would think they would want more data to compute from with less but more accurate pages returned in the index (SERPs). |
This is what I have always understood and this also means why it is more important than ever to get one's on-page data/facts/text even more correct, generalisations will most probably gradually start to fall in the SERPs and accuracy rise to the top and, most probably, be even harder to dislodge then ever.
| 4:15 pm on Feb 24, 2010 (gmt 0)|
The sentence about Caffeine in the Wired article says that it "revamped the entire indexing system."
So maybe Caffeine has led to a change in the indexing process, although it isn't clear what "revamped" means.
| 6:48 pm on Feb 25, 2010 (gmt 0)|
According to Search Engine Land Google admitted that Caffeine is NOT alive yet except for one DC .
| 11:39 pm on Feb 26, 2010 (gmt 0)|
I see different rankings for some of our websites when we search on a purported caffeine i.p. The result is better for us compared to searching from within our country.
| 11:44 pm on Feb 26, 2010 (gmt 0)|
Here is a new twist:
Google on Thursday mounted a renewed defence of the way it ranks search results, as fresh questions emerged about its practice of sometimes manually intervening to override its automated ranking system.
| 12:03 am on Feb 27, 2010 (gmt 0)|
That FT article seems to be stretching things a bit, trying to create a negative impression. Google counsel said "We don't whitelist or blacklist anyone," talking about organic search. But then the articles jumps to an Adwords discussion. I don't think there's ever been any doubt that Adwords does use blacklisting.
| 1:21 am on Feb 27, 2010 (gmt 0)|
"We don't whitelist or blacklist anyone"
What about beefing up adsense partner/content farm sites when they can so they do perform well algorithmically? For example, dmoz listings. Dmoz is littered with plenty of main & single page listings for a certain demanding "media" company and its offshoot sites. Then check out the other big adsense partner/content farms (including article promotion sites). How is it they get so many listings (to both main and single pages), yet so many of their serps competitors are denied a single one to their home page?
| 1:39 pm on Feb 27, 2010 (gmt 0)|
Here is another quote from the article:
"A Google spokesperson refused to comment on the Foundem claims, but acknowledged Google sometimes manually alters rankings in its search engine to counter distortions that might arise from algorithms. The spokesperson did not disclose how often such changes were made, but said they were rare."
Does anyone know of any examples of "distortions that might arise from algorithms"?
I'm just curious about how this could happen. Could it refer to possible flaws in the algorithms? Or what about undeserved high rankings achieved with auto-linkbuilding methods?
| 8:15 pm on Feb 27, 2010 (gmt 0)|
My take is that the stored data that creates the final SERP is sharded into many many bits and pieces. The final SERP is built by a kind of layering process. It sequentially combines lists of various URLs, segmented by some metric or other (trust, semantics, PR, historical factors, etc).
An example of this process that we discussed here was whitenight's ghost data-set [google.com], but there are others.
Caffeine involves a rewrite of the way that most basic page information is stored, as well as the way it all gets layered together to make a final SERP. To get more of a handle on this, consider this description written in 2008:
|Today it's estimated that [a single Google query] travels across 700-1000 machines, a figure that has nearly doubled since 2006 perhaps due in part to the introduction of Google Universal. |
The opportunities for a big disconnect between what Google intends to happen and what really happens in a brand new infrastructure with a rewritten file system would be extreme.
Here's a few more references:
1. Our short discussion The Google Search Query - a technical look [webmasterworld.com]
2. The new domain Google began using in Q4 of 2009 = 1e100.net [webmasterworld.com]