Forum Moderators: Robert Charlton & goodroi
We've known it for a long time: the web is big. The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. Over the last eight years, we've seen a lot of big numbers about how much content is really out there. Recently, even our search engineers stopped in awe about just how big the web is these days -- when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!
We knew the web was big... [googleblog.blogspot.com]
Indeed in that context, it's not all that meaningful a statement - there are 1 trillion URLs that Google is aware of that made it past initial filtering, most of which they admit won't even be spidered.
More interesting is the comment that they use these trillion URLs to determine PageRank, and claim to recalculate PR more than once a day:
Google re-process[es] the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections...
...
let's assume these numbers are accurate ( which they're not )
That'd mean that Google indexed only a single URL out of every 381 discovered.
Which would seem a surprisingly ideal number to me...
even though my little experiment was flawed to its core
Now - on the other hand - CUIL *hehe*
"has ~121,617,892,992 web pages indexed" ( quoted from site )
Mind you searching for 'a' ( the letter ) shows an estimate of only ~1,899,710,588 web pages.
The Internet has grown exponentially in the last fifteen years but search engines have not kept - until now. Cuil searches more pages on the Web than anyone else - three times as many as Google and ten times as many as Microsoft.
Actually it's not THAT hard to beat Google's index size.
All you've got to do is scrape all day and night.
What's hard to do... is decide what to show, when and in what order.
Then to keep it fresh.
And post on your blog 3 days before the competition launches.
...Have you ever let XENU check your site with a calendar / looped dynamic pages... and forget to filter out the URLs for them? ... I think I'm sitting on at least 10,000,000 'possible' URLs.
Google re-process[es] the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections...
So they're traversing and computing a trillion intersections/nodes on a webmap to recalc PR on a continual basis (of course, why else?), and the computations include all of those, whether or not they're included in their index?
Is there something in this message that we're not grasping?
Right now I am assuming that the blog article "misspoke" but I will certainly file this idea with my "vigilance department"!
Google re-process[es] the entire web-link graph several times per day.
ENTIRE. Nothing deceptive about that, it's about as clear as plate glass.
[webmasterworld.com...]
It seems they have answered some of the concerns.
We don't index every one of those trillion pages -- many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn't very useful to searchers.
It would have been great had they told us how they are going to keep pace with the notoriously increasing webspace in time to come.
I wonder how much less of a number that would be if they stopped following forms.
Not much IHMO.
Spam sites (including many millions of parked dommains junk) are responsible for a big part of that 1 trillion figure. Also sites that use non-obvious session IDs are bad offenders too - it's amazing that nobody thought of it when developing HTTP/URL specs - session IDs should have never become part of URLs :(
I can get a result estimate of ~25,500,000,000 for my 'try to get as many results as possible' testing queries, so the number does seem very low.
that's why I said 'today'.
and mentioned that it's not accurate.
I guess I'm using the same method as you do
for TODAY ( 29th ) I got a number of 25,600,000,000.
my *other* method shows 25,430,000,000.
Both queries showed 10% of this figure when I made the post.
Hence me warning against the numbers.
pretty reliable estimates I guess
*pfft*
either way the point was to show the huge difference in what Google finds / knows about and what it indexes. 380 to 1 or 38 to 1 would both be amazing in terms of necessary computation and processing power. Imagine the volumes of unwanted pages banging the gates.
...
I think Google isn't saying that the whole process is finished every day, but that there are several iterations every day. And I suspect that they don't run the process until it converges; they just keep running it, updating the connectivity graph on the fly. (Thus it never does converge, it just keeps chasing reality. But it chases reality as fast as it can.)
And the page rank you'd see if you could see page rank, isn't the converged number, it's just the results of the most recent iteration. Don't like it? just wait around a few hours and peek at the next iteration.
updating the connectivity graph on the fly
It's hard to do it on the fly and not necessary - if you use lots of servers then you can iterate pretty quickly (say 1 hour), and use updated index for the next run.
In theory PR calculation should only use nodes that connect to other nodes - 1 trln urls includes at least 80% of urls they never visited, these urls are removed from PR calculation because they can't pass out PR (as they have not yet been crawled).
I've been assuming that the published iterative method is not what's in play today, although Google still runs it from time to time - as a checkup, to fix any long-term divergence between the two similar (but not identical) types of math.