g1smd

msg:3708067 | 11:58 am on Jul 26, 2008 (gmt 0) |
I'm guessing that a large percentage is duplication from session IDs, and alternative navigation paths generating duplicate content URLs. A large chunk will be auto-page-generating sites that return a page of junk whatever input URL you try. Many will be login and stats pages etc.
|
Receptional Andy

msg:3708294 | 8:38 pm on Jul 26, 2008 (gmt 0) |
Just as a clarification: they mean a trillion 'discovered' URLs, not a trillion URLs actually indexed. And as the blog entry points out, the internet is pretty much infinite URLs if you think about various types of dynamic page out there. Indeed in that context, it's not all that meaningful a statement - there are 1 trillion URLs that Google is aware of that made it past initial filtering, most of which they admit won't even be spidered. More interesting is the comment that they use these trillion URLs to determine PageRank, and claim to recalculate PR more than once a day: | Google re-process[es] the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections... |
|
|
trinorthlighting

msg:3709775 | 10:10 pm on Jul 28, 2008 (gmt 0) |
I wonder how many billions of pages are worthless spam.
|
Miamacs

msg:3709803 | 10:54 pm on Jul 28, 2008 (gmt 0) |
today Google.com estimated the number of its indexed pages at ~2,620,000,000 ( out of the ~1,000,000,000,000 it claims to have found ) ... let's assume these numbers are accurate ( which they're not ) That'd mean that Google indexed only a single URL out of every 381 discovered. Which would seem a surprisingly ideal number to me... even though my little experiment was flawed to its core Now - on the other hand - CUIL *hehe* "has ~121,617,892,992 web pages indexed" ( quoted from site ) Mind you searching for 'a' ( the letter ) shows an estimate of only ~1,899,710,588 web pages. | The Internet has grown exponentially in the last fifteen years but search engines have not kept - until now. Cuil searches more pages on the Web than anyone else - three times as many as Google and ten times as many as Microsoft. |
| ( from cuil.com [cuil.com] ) Actually it's not THAT hard to beat Google's index size. All you've got to do is scrape all day and night. What's hard to do... is decide what to show, when and in what order. Then to keep it fresh. And post on your blog 3 days before the competition launches. ...Have you ever let XENU check your site with a calendar / looped dynamic pages... and forget to filter out the URLs for them? ... I think I'm sitting on at least 10,000,000 'possible' URLs.
|
malcolmcroucher

msg:3710042 | 7:58 am on Jul 29, 2008 (gmt 0) |
wonder how much money is out there then.
|
Marcia

msg:3710050 | 8:14 am on Jul 29, 2008 (gmt 0) |
BTW, certain reputable news publications have indicated in articles that they think the blog post is mis-leading. | Google re-process[es] the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections... |
| OK now, hold the phone. So they're traversing and computing a trillion intersections/nodes on a webmap to recalc PR on a continual basis (of course, why else?), and the computations include all of those, whether or not they're included in their index? Is there something in this message that we're not grasping?
|
tedster

msg:3710059 | 8:29 am on Jul 29, 2008 (gmt 0) |
That's one heck of a question: Do links on urls that are not in Google's visible index still have an effect on PageRank? I've always assumed that they don't as I'm sure many of us do. And yet there could be a strnage logic here. Right now I am assuming that the blog article "misspoke" but I will certainly file this idea with my "vigilance department"!
|
Marcia

msg:3710091 | 9:39 am on Jul 29, 2008 (gmt 0) |
Ted, I don't think the blog article necessarily mis-spoke, but I do believe that the Washington Post reporter/columnist hasn't done enough IR theory homework to properly interpret what I usually fondly refer to as "GoogleSpeak." | Google re-process[es] the entire web-link graph several times per day. |
| ENTIRE. Nothing deceptive about that, it's about as clear as plate glass.
|
KFish

msg:3710104 | 10:21 am on Jul 29, 2008 (gmt 0) |
I had posted a similar question here: [webmasterworld.com...] It seems they have answered some of the concerns. | We don't index every one of those trillion pages -- many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn't very useful to searchers. |
| It would have been great had they told us how they are going to keep pace with the notoriously increasing webspace in time to come.
|
Marcia

msg:3710117 | 10:54 am on Jul 29, 2008 (gmt 0) |
That's what filters are for.
|
Murdoch

msg:3710264 | 2:25 pm on Jul 29, 2008 (gmt 0) |
I wonder how much less of a number that would be if they stopped following forms. That's why I disable robots on anything with POST data. I don't need my search results to bring me to a page with search results. The whole thing feels like they just do it to have the bigger number to brag about.
|
ecmedia

msg:3710297 | 3:01 pm on Jul 29, 2008 (gmt 0) |
It has been known for a while that Google ads many pages and then drops many more almost on a regular basis. I think it kind of makes sense when you hear all those people complaining that their pages are dropping. G found out that that info was already there on some other website that has been around a while and has higher pagerank. I guess unique content is the mantra.
|
jimbeetle

msg:3710336 | 3:43 pm on Jul 29, 2008 (gmt 0) |
| today Google.com estimated the number of its indexed pages at ~2,620,000,000 |
| I'm confused. Where did this number come from? Certainly Google didn't lose 6 billion or so pages over the past three years.
|
Receptional Andy

msg:3710347 | 3:52 pm on Jul 29, 2008 (gmt 0) |
I can get a result estimate of ~25,500,000,000 for my 'try to get as many results as possible' testing queries, so the number does seem very low.
|
Dabrowski

msg:3710417 | 5:03 pm on Jul 29, 2008 (gmt 0) |
They probably thought that if the information you wanted isn't in the top 25.5 billion, you won't look any further! :D
|
Lord Majestic

msg:3710429 | 5:12 pm on Jul 29, 2008 (gmt 0) |
| I wonder how much less of a number that would be if they stopped following forms. |
| Not much IHMO. Spam sites (including many millions of parked dommains junk) are responsible for a big part of that 1 trillion figure. Also sites that use non-obvious session IDs are bad offenders too - it's amazing that nobody thought of it when developing HTTP/URL specs - session IDs should have never become part of URLs :(
|
Miamacs

msg:3710439 | 5:30 pm on Jul 29, 2008 (gmt 0) |
| I can get a result estimate of ~25,500,000,000 for my 'try to get as many results as possible' testing queries, so the number does seem very low. |
| that's why I said 'today'. and mentioned that it's not accurate. I guess I'm using the same method as you do for TODAY ( 29th ) I got a number of 25,600,000,000. my *other* method shows 25,430,000,000. Both queries showed 10% of this figure when I made the post. Hence me warning against the numbers. pretty reliable estimates I guess *pfft* either way the point was to show the huge difference in what Google finds / knows about and what it indexes. 380 to 1 or 38 to 1 would both be amazing in terms of necessary computation and processing power. Imagine the volumes of unwanted pages banging the gates. ...
|
joelgreen

msg:3710635 | 8:43 pm on Jul 29, 2008 (gmt 0) |
| and the computations include all of those, whether or not they're included in their index? |
| Could be. Every link is a vote, and when you compare with people, even those in prison can vote. Google could assign very low importance to such votes from "spammy pages".
|
hutcheson

msg:3710649 | 9:12 pm on Jul 29, 2008 (gmt 0) |
Remember that pagerank calculation involves an _iterative_ process, and it isn't "finished" until continued iteration doesn't result in any significant change. I think Google isn't saying that the whole process is finished every day, but that there are several iterations every day. And I suspect that they don't run the process until it converges; they just keep running it, updating the connectivity graph on the fly. (Thus it never does converge, it just keeps chasing reality. But it chases reality as fast as it can.) And the page rank you'd see if you could see page rank, isn't the converged number, it's just the results of the most recent iteration. Don't like it? just wait around a few hours and peek at the next iteration.
|
webfoo

msg:3710694 | 10:25 pm on Jul 29, 2008 (gmt 0) |
and cuil claims to be the biggest on the 'net with only 125 bil
|
Lord Majestic

msg:3710719 | 11:36 pm on Jul 29, 2008 (gmt 0) |
| updating the connectivity graph on the fly |
| It's hard to do it on the fly and not necessary - if you use lots of servers then you can iterate pretty quickly (say 1 hour), and use updated index for the next run. In theory PR calculation should only use nodes that connect to other nodes - 1 trln urls includes at least 80% of urls they never visited, these urls are removed from PR calculation because they can't pass out PR (as they have not yet been crawled).
|
tedster

msg:3710737 | 12:07 am on Jul 30, 2008 (gmt 0) |
Google never told us the exact method, but back when they moved away from monthly updates, they also changed the method of calculating PR to some other kind of math they had uncovered that gave similar results. This allowed them to move into continual calculation of PR, and they've found even more shortcuts for PR calculation since then. I've been assuming that the published iterative method is not what's in play today, although Google still runs it from time to time - as a checkup, to fix any long-term divergence between the two similar (but not identical) types of math.
|
Lord Majestic

msg:3710752 | 12:48 am on Jul 30, 2008 (gmt 0) |
Well, I think what they did was parallelise calculations which allowed to run it on thousands of computers, this is a big speed up. I don't think it is possible to give up iterative nature of PR, but we will probably never know for sure for some time.
|
airpal

msg:3711759 | 2:43 am on Jul 31, 2008 (gmt 0) |
As previously mentioned, they're probably not counting the 90% of the pages that are considered spam.
|
Brett_Tabke

msg:3712924 | 12:56 pm on Aug 1, 2008 (gmt 0) |
> and cuil claims to be the biggest on the 'net with only 125 bil Which is true. Google is only claiming to have found 1trillion urls. Cuil has not said how many urls it has "found" either.
|
Lord Majestic

msg:3712938 | 1:12 pm on Aug 1, 2008 (gmt 0) |
| Cuil has not said how many urls it has "found" either. |
| Looking at results that they return one might think that they did say it. I'd estimate that 125 bln unique crawled pages would produce around 600 bln unique urls. [edited by: Lord_Majestic at 1:13 pm (utc) on Aug. 1, 2008]
|
brotherhood of LAN

msg:3715644 | 1:35 pm on Aug 5, 2008 (gmt 0) |
| duplication from session ids |
| I see that Google can ignore session IDs quite easily (the 32 char hexadecimal ones anyway), as far as toolbar PR goes.
|
|