|Think I've found how small the indexes are for current searches|
I've read a few theories in threads that Google is using small boxes of "quality" pages to produce searches now instead of slicing through all data they have available for each and every search. Many have pointed out that the data used might change based on geo-location, time of day and whether the user is logged in to their G-Accounts. Thought I'd try and get Google to show me just how many web pages were actually available for searching at any given time. I'll update the thread each time the index changes drastically:
For Sat 7th June
Google.co.uk (The Web) 25,270,000,000
Google.co.uk (UK Pages) 68,900,000
Google.ca (The Web) 25,360,000,000
Google.ca (CA Pages) 45,200,000
It is completely IP specific too. I had to log in through a proxy server for each TLD to get different results. If I logged in to .com from a UK IP address I'd get the exact same total figures as .co.uk even though the results shown for any given search were completely different. It is my guess that these figures do NOT include supplemental pages and that every page outside these boxes are considered supplementals by the search programming. These are, in my opinion, the figures relating to the bulk of pages returned for main searches.
What uses does this information have? Well, for me it shows maturity on Google's part. I remember the days when the big guns would display the number of indexed pages on the front page and it was big news when each broke a new barrier. This maturity can only increase the quality of their search results and these small boxes of data may really help because, as many have pointed out in other threads, Google's spam removal methods tend to push spam to the top for some period before they disappear into oblivion. Smaller data sets to work from will make that easier. It also goes a long way to explaining the huge surge in threads about "two months and Google still hasn't indexed me?" and "how long before Google updates the information they have on my pages?".
I don't want to post the source of the data as I'm worried Google will block access before I have a chance to collect enough info. I will post it eventually though.
I think you may be onto something here.
Last August Google filed a patent application for selectively searching partitions of a database [webmasterworld.com]. This was right around the time that the Supplemental Results tags were removed and there was talk from the Google staff about the Supplemental Index evolving into some kind of different critter.
One key fact I notice in that patent title is the word "partitions" - that's plural as in more than one, not just the Supplemental Index.
When you search at, say, google.com make a note of the IP address that those results come from (the ShowIP extension for Mozilla is the ONLY reliable way to do that) because some of the different IPs have different datasets and/or algorithm, and anyway, you might notice some patterns as to which IP you get depending on time of day and/or day of week.
I can't grasp why you think there are small pools of quality pages that Google is selecting from. I see the stats you quote but I miss the reasons for the conclusions. Sorry to be a bit dumb but can you be more specific? I ask because if you are corrrect then it's important for us all. Thanks.
I see very similar numbers for my "show me everything" searches - currently running at 25,350,000,000 (85,600,000 UK only).
|I see the stats you quote but I miss the reasons for the conclusions. Sorry to be a bit dumb |
Make that 2 dummies . I don't understand this either. Can you help clarify per above?
|I can't grasp why you think there are small pools of quality pages that Google is selecting from. |
It's worth getting a handle on the supplemental index [google.co.uk] to understand the idea internetheaven mentions.
My take is that Google desires to be both relevant and comprehensive. It could be that there's an element of mutual exclusivity there. Indeed, most searches are just looking for a quick result that's relevant - not to be able to review all of the available information.
So, it makes sense for Google to restrict the data it searches through in order to satisfy the majority of searches, without wading through billions of URLs that it doesn't consider to be especially high quality.
|So, it makes sense for Google to restrict the data it searches through in order to satisfy the majority of searches, without wading through billions of URLs that it doesn't consider to be especially high quality. |
Aye, that's my line of thinking exactly. Sorry if that didn't seem clear to begin with?
It seems as though each datacenter is between 20-25 billion pages. Having such common numbers would suggest that this is the optimal point between enough data to mine and relevancy.
As requested, some datacenter checks:
Datacenter 220.127.116.11 - 25,270,000,000
Datacenter 18.104.22.168 - 20,090,000,000
Datacenter 22.214.171.124 - 20,090,000,000
Datacenter 126.96.36.199 - 25,360,000,000
Datacenter 188.8.131.52 - 25,350,000,000
Datacenter 184.108.40.206 - 25,360,000,000
Datacenter 220.127.116.11 - 20,090,000,000
Datacenter 18.104.22.168 - 23,790,000,000
Datacenter 22.214.171.124 - 25,360,000,000
Searcher's IP address does not seem to affect direct datacenter searches, only if you access a Google.tld
Be aware that everything on the same Class C Block should be identical.
There's also a new Google datacentre list [webmasterworld.com] that may help.
Okay, for three nights now the index has changed values at 8pm GMT.
Today's was a drop from 25,350,000,000 at 19:59 to 19,300,000,000 at 20:00
Did some search terms too and the drop was huge. For a competitive search term the results dropped from:
710,000 results at 19:59 to just 364,000 at 20:00
I think there was a thread about how Google was returning different results for morning/night users. Looks like night users in the UK get a lot less results to choose from.
I only checked Google.co.uk for the size and for the competitive term time changes. I'll work on Google.com from a US IP address next.
How odd that at 8pm, the results drop. any speculation on the drop? servers go offline for maintanence? end of shift for google employees returning our results in .01 seconds? resources are being spent calculating the days crawl and crunching it into the index?
I see now: 25,340,000,000.
But how can u be sure these numbers are real and they are not estimates like result count for every search and they could differ wildly every time you hit the search button or go deeper in serps?
|710,000 results at 19:59 to just 364,000 at 20:00 |
Very interesting quirk. An anomaly? or does the top of the 8pm hour trigger a more refined search? (for your time zone)
I wonder what it would be at about 8am, 3pm, 8pm?
If "more refined" at that point in time, is it the same for the USA as for the UK?
What about USA EST vs USA PST?
So many variables!
ps. internetheaven -- the thoroughness of your methodical research is impressive ... thanks for bringing it here.