Google has seen a Trillion URLs

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google has seen a Trillion URLs

no word on actual index size.

Marcia

9:16 am on Jul 26, 2008 (gmt 0)

From the Google Blog:

We've known it for a long time: the web is big. The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. Over the last eight years, we've seen a lot of big numbers about how much content is really out there. Recently, even our search engineers stopped in awe about just how big the web is these days -- when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!

We knew the web was big... [googleblog.blogspot.com]

g1smd

11:58 am on Jul 26, 2008 (gmt 0)

I'm guessing that a large percentage is duplication from session IDs, and alternative navigation paths generating duplicate content URLs. A large chunk will be auto-page-generating sites that return a page of junk whatever input URL you try. Many will be login and stats pages etc.

Receptional Andy

8:38 pm on Jul 26, 2008 (gmt 0)

Just as a clarification: they mean a trillion 'discovered' URLs, not a trillion URLs actually indexed. And as the blog entry points out, the internet is pretty much infinite URLs if you think about various types of dynamic page out there.

Indeed in that context, it's not all that meaningful a statement - there are 1 trillion URLs that Google is aware of that made it past initial filtering, most of which they admit won't even be spidered.

More interesting is the comment that they use these trillion URLs to determine PageRank, and claim to recalculate PR more than once a day:

Google re-process[es] the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections...

trinorthlighting

10:10 pm on Jul 28, 2008 (gmt 0)

I wonder how many billions of pages are worthless spam.

Miamacs

10:54 pm on Jul 28, 2008 (gmt 0)

today Google.com estimated the number of its indexed pages at ~2,620,000,000
( out of the ~1,000,000,000,000 it claims to have found )

...

let's assume these numbers are accurate ( which they're not )
That'd mean that Google indexed only a single URL out of every 381 discovered.
Which would seem a surprisingly ideal number to me...
even though my little experiment was flawed to its core

Now - on the other hand - CUIL *hehe*
"has ~121,617,892,992 web pages indexed" ( quoted from site )
Mind you searching for 'a' ( the letter ) shows an estimate of only ~1,899,710,588 web pages.

The Internet has grown exponentially in the last fifteen years but search engines have not kept - until now. Cuil searches more pages on the Web than anyone else - three times as many as Google and ten times as many as Microsoft.

( from cuil.com [cuil.com] )

Actually it's not THAT hard to beat Google's index size.
All you've got to do is scrape all day and night.
What's hard to do... is decide what to show, when and in what order.
Then to keep it fresh.
And post on your blog 3 days before the competition launches.

...Have you ever let XENU check your site with a calendar / looped dynamic pages... and forget to filter out the URLs for them? ... I think I'm sitting on at least 10,000,000 'possible' URLs.

malcolmcroucher

7:58 am on Jul 29, 2008 (gmt 0)

wonder how much money is out there then.

Marcia

8:14 am on Jul 29, 2008 (gmt 0)

BTW, certain reputable news publications have indicated in articles that they think the blog post is mis-leading.

Google re-process[es] the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections...

OK now, hold the phone.

So they're traversing and computing a trillion intersections/nodes on a webmap to recalc PR on a continual basis (of course, why else?), and the computations include all of those, whether or not they're included in their index?

Is there something in this message that we're not grasping?

tedster

8:29 am on Jul 29, 2008 (gmt 0)

That's one heck of a question: Do links on urls that are not in Google's visible index still have an effect on PageRank? I've always assumed that they don't as I'm sure many of us do. And yet there could be a strnage logic here.

Right now I am assuming that the blog article "misspoke" but I will certainly file this idea with my "vigilance department"!

Marcia

9:39 am on Jul 29, 2008 (gmt 0)

Ted, I don't think the blog article necessarily mis-spoke, but I do believe that the Washington Post reporter/columnist hasn't done enough IR theory homework to properly interpret what I usually fondly refer to as "GoogleSpeak."

Google re-process[es] the entire web-link graph several times per day.

ENTIRE. Nothing deceptive about that, it's about as clear as plate glass.

KFish

10:21 am on Jul 29, 2008 (gmt 0)

I had posted a similar question here:

[webmasterworld.com...]

It seems they have answered some of the concerns.

We don't index every one of those trillion pages -- many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn't very useful to searchers.

It would have been great had they told us how they are going to keep pace with the notoriously increasing webspace in time to come.

Marcia

10:54 am on Jul 29, 2008 (gmt 0)

That's what filters are for.

Murdoch

2:25 pm on Jul 29, 2008 (gmt 0)

I wonder how much less of a number that would be if they stopped following forms.

That's why I disable robots on anything with POST data. I don't need my search results to bring me to a page with search results. The whole thing feels like they just do it to have the bigger number to brag about.

ecmedia

3:01 pm on Jul 29, 2008 (gmt 0)

It has been known for a while that Google ads many pages and then drops many more almost on a regular basis. I think it kind of makes sense when you hear all those people complaining that their pages are dropping. G found out that that info was already there on some other website that has been around a while and has higher pagerank. I guess unique content is the mantra.

jimbeetle

3:43 pm on Jul 29, 2008 (gmt 0)

today Google.com estimated the number of its indexed pages at ~2,620,000,000

I'm confused. Where did this number come from? Certainly Google didn't lose 6 billion or so pages over the past three years.

Receptional Andy

3:52 pm on Jul 29, 2008 (gmt 0)

I can get a result estimate of ~25,500,000,000 for my 'try to get as many results as possible' testing queries, so the number does seem very low.

Dabrowski

5:03 pm on Jul 29, 2008 (gmt 0)

25,500,000,000

They probably thought that if the information you wanted isn't in the top 25.5 billion, you won't look any further!

Lord Majestic

5:12 pm on Jul 29, 2008 (gmt 0)

I wonder how much less of a number that would be if they stopped following forms.

Not much IHMO.

Spam sites (including many millions of parked dommains junk) are responsible for a big part of that 1 trillion figure. Also sites that use non-obvious session IDs are bad offenders too - it's amazing that nobody thought of it when developing HTTP/URL specs - session IDs should have never become part of URLs :(

Miamacs

5:30 pm on Jul 29, 2008 (gmt 0)

I can get a result estimate of ~25,500,000,000 for my 'try to get as many results as possible' testing queries, so the number does seem very low.

that's why I said 'today'.
and mentioned that it's not accurate.

I guess I'm using the same method as you do
for TODAY ( 29th ) I got a number of 25,600,000,000.
my *other* method shows 25,430,000,000.

Both queries showed 10% of this figure when I made the post.
Hence me warning against the numbers.

pretty reliable estimates I guess
*pfft*

either way the point was to show the huge difference in what Google finds / knows about and what it indexes. 380 to 1 or 38 to 1 would both be amazing in terms of necessary computation and processing power. Imagine the volumes of unwanted pages banging the gates.

...

joelgreen

8:43 pm on Jul 29, 2008 (gmt 0)

and the computations include all of those, whether or not they're included in their index?

Could be. Every link is a vote, and when you compare with people, even those in prison can vote. Google could assign very low importance to such votes from "spammy pages".

hutcheson

9:12 pm on Jul 29, 2008 (gmt 0)

Remember that pagerank calculation involves an _iterative_ process, and it isn't "finished" until continued iteration doesn't result in any significant change.

I think Google isn't saying that the whole process is finished every day, but that there are several iterations every day. And I suspect that they don't run the process until it converges; they just keep running it, updating the connectivity graph on the fly. (Thus it never does converge, it just keeps chasing reality. But it chases reality as fast as it can.)

And the page rank you'd see if you could see page rank, isn't the converged number, it's just the results of the most recent iteration. Don't like it? just wait around a few hours and peek at the next iteration.

webfoo

10:25 pm on Jul 29, 2008 (gmt 0)

and cuil claims to be the biggest on the 'net with only 125 bil

Lord Majestic

11:36 pm on Jul 29, 2008 (gmt 0)

updating the connectivity graph on the fly

It's hard to do it on the fly and not necessary - if you use lots of servers then you can iterate pretty quickly (say 1 hour), and use updated index for the next run.

In theory PR calculation should only use nodes that connect to other nodes - 1 trln urls includes at least 80% of urls they never visited, these urls are removed from PR calculation because they can't pass out PR (as they have not yet been crawled).

tedster

12:07 am on Jul 30, 2008 (gmt 0)

Google never told us the exact method, but back when they moved away from monthly updates, they also changed the method of calculating PR to some other kind of math they had uncovered that gave similar results. This allowed them to move into continual calculation of PR, and they've found even more shortcuts for PR calculation since then.

I've been assuming that the published iterative method is not what's in play today, although Google still runs it from time to time - as a checkup, to fix any long-term divergence between the two similar (but not identical) types of math.

Lord Majestic

12:48 am on Jul 30, 2008 (gmt 0)

Well, I think what they did was parallelise calculations which allowed to run it on thousands of computers, this is a big speed up. I don't think it is possible to give up iterative nature of PR, but we will probably never know for sure for some time.

airpal

2:43 am on Jul 31, 2008 (gmt 0)

As previously mentioned, they're probably not counting the 90% of the pages that are considered spam.

Brett_Tabke

12:56 pm on Aug 1, 2008 (gmt 0)

> and cuil claims to be the biggest on the 'net with only 125 bil

Which is true. Google is only claiming to have found 1trillion urls. Cuil has not said how many urls it has "found" either.

Lord Majestic

1:12 pm on Aug 1, 2008 (gmt 0)

Cuil has not said how many urls it has "found" either.

Looking at results that they return one might think that they did say it.

I'd estimate that 125 bln unique crawled pages would produce around 600 bln unique urls.

[edited by: Lord_Majestic at 1:13 pm (utc) on Aug. 1, 2008]

brotherhood of LAN

1:35 pm on Aug 5, 2008 (gmt 0)

duplication from session ids

I see that Google can ignore session IDs quite easily (the 32 char hexadecimal ones anyway), as far as toolbar PR goes.