Forum Moderators: Robert Charlton & goodroi
For Sat 7th June
Google.com 25,630,000,000
Google.co.uk (The Web) 25,270,000,000
Google.co.uk (UK Pages) 68,900,000
Google.ca (The Web) 25,360,000,000
Google.ca (CA Pages) 45,200,000
It is completely IP specific too. I had to log in through a proxy server for each TLD to get different results. If I logged in to .com from a UK IP address I'd get the exact same total figures as .co.uk even though the results shown for any given search were completely different. It is my guess that these figures do NOT include supplemental pages and that every page outside these boxes are considered supplementals by the search programming. These are, in my opinion, the figures relating to the bulk of pages returned for main searches.
What uses does this information have? Well, for me it shows maturity on Google's part. I remember the days when the big guns would display the number of indexed pages on the front page and it was big news when each broke a new barrier. This maturity can only increase the quality of their search results and these small boxes of data may really help because, as many have pointed out in other threads, Google's spam removal methods tend to push spam to the top for some period before they disappear into oblivion. Smaller data sets to work from will make that easier. It also goes a long way to explaining the huge surge in threads about "two months and Google still hasn't indexed me?" and "how long before Google updates the information they have on my pages?".
I don't want to post the source of the data as I'm worried Google will block access before I have a chance to collect enough info. I will post it eventually though.
Last August Google filed a patent application for selectively searching partitions of a database [webmasterworld.com]. This was right around the time that the Supplemental Results tags were removed and there was talk from the Google staff about the Supplemental Index evolving into some kind of different critter.
One key fact I notice in that patent title is the word "partitions" - that's plural as in more than one, not just the Supplemental Index.
I can't grasp why you think there are small pools of quality pages that Google is selecting from. I see the stats you quote but I miss the reasons for the conclusions. Sorry to be a bit dumb but can you be more specific? I ask because if you are corrrect then it's important for us all. Thanks.
Nomis5
I can't grasp why you think there are small pools of quality pages that Google is selecting from.
It's worth getting a handle on the supplemental index [google.co.uk] to understand the idea internetheaven mentions.
My take is that Google desires to be both relevant and comprehensive. It could be that there's an element of mutual exclusivity there. Indeed, most searches are just looking for a quick result that's relevant - not to be able to review all of the available information.
So, it makes sense for Google to restrict the data it searches through in order to satisfy the majority of searches, without wading through billions of URLs that it doesn't consider to be especially high quality.
So, it makes sense for Google to restrict the data it searches through in order to satisfy the majority of searches, without wading through billions of URLs that it doesn't consider to be especially high quality.
Aye, that's my line of thinking exactly. Sorry if that didn't seem clear to begin with?
It seems as though each datacenter is between 20-25 billion pages. Having such common numbers would suggest that this is the optimal point between enough data to mine and relevancy.
As requested, some datacenter checks:
Datacenter 66.249.93.104 - 25,270,000,000
Datacenter 64.233.179.104 - 20,090,000,000
Datacenter 216.239.51.104 - 20,090,000,000
Datacenter 66.102.9.99 - 25,360,000,000
Datacenter 66.102.9.147 - 25,350,000,000
Datacenter 66.102.9.104 - 25,360,000,000
Datacenter 64.233.161.83 - 20,090,000,000
Datacenter 64.233.183.103 - 23,790,000,000
Datacenter 64.233.189.104 - 25,360,000,000
Searcher's IP address does not seem to affect direct datacenter searches, only if you access a Google.tld
There's also a new Google datacentre list [webmasterworld.com] that may help.
Today's was a drop from 25,350,000,000 at 19:59 to 19,300,000,000 at 20:00
Did some search terms too and the drop was huge. For a competitive search term the results dropped from:
710,000 results at 19:59 to just 364,000 at 20:00
I think there was a thread about how Google was returning different results for morning/night users. Looks like night users in the UK get a lot less results to choose from.
I only checked Google.co.uk for the size and for the competitive term time changes. I'll work on Google.com from a US IP address next.
710,000 results at 19:59 to just 364,000 at 20:00
I wonder what it would be at about 8am, 3pm, 8pm?
If "more refined" at that point in time, is it the same for the USA as for the UK?
What about USA EST vs USA PST?
So many variables!
ps. internetheaven -- the thoroughness of your methodical research is impressive ... thanks for bringing it here.
.........................