homepage Welcome to WebmasterWorld Guest from 54.197.215.146
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
3+ billion pages indexed- why can't we see them all?
Lots0




msg:51266
 10:28 pm on Jan 11, 2003 (gmt 0)

3+ billion pages indexed - why do we only get to see a small percent of those?

When you do a search on Google for any term that returns over 1,000 results, why does Google only let you see seven or eight hundred results?

It dose not matter if the term you searched for returns 1,000 results or 10,000,000 results Google only lets you see a few hundred...Why?

 

nativenewyorker




msg:51267
 10:39 pm on Jan 11, 2003 (gmt 0)

There used to be an option at the bottom of the search results to view the page with omitted results included.

Ted

bcc1234




msg:51268
 10:43 pm on Jan 11, 2003 (gmt 0)

For one thing, in any ordered dataset - if you want to get the N-th element, the system has to go through N-1 elements in order to even figure out that N-th element is in fact the N-th.

In other words, Google's software has to go through 10M records every time you click refresh on the page that would display 10,000,000-10,000,010th results.

That's just one "final" step that is required after all index has been searched and the pages were ordered (by relevance or anything else for that matter). And this thing alone would consume all google's resources.

[edited by: bcc1234 at 10:45 pm (utc) on Jan. 11, 2003]

Lots0




msg:51269
 10:45 pm on Jan 11, 2003 (gmt 0)

That button is there for some specialized searches (never for normal searches) - and it usually only returns pages in the same domain that have the search term on them. such as links:yourdomain.com.

No I am talking about normal searches - Why can't we see all the results? Why do we only get to see a very small percent of the sites returned?

<added>Bcc1234 if what you said were accurate, how do we get any search results at all?</added>

heini




msg:51270
 10:52 pm on Jan 11, 2003 (gmt 0)

I believe the cut off is ~1000 results.

Why? Serverload? Never handled 3 Bill. dbs, so no idea really.

bcc1234




msg:51271
 11:03 pm on Jan 11, 2003 (gmt 0)

I have no way of knowing Google's internals.
If I did - I would be rich :)

But I assume the way they generate serps by sending the query to many parallel boxes and then combining the result.
Let's say there are 5 parallel boxes that contain all index (the 3B pages), each box has 1/5 of the index.

The query goes to all 5 of them and they return 1,000 (or any other preset number) most relevant results from THEIR indeces.
So we get 5 lists from 5 different indeces.
After that, all 5 of them are combined and the final 1k results are sorted out from those records.

Why there is a limit? Well, it's easier to allocate memory for the list with "at most" set limit of records.

That way, if some of the 5 boxes' indeces did not have a single relevant page - it's just 4x1,000 or 3x1,000 etc.

On really specific terms it might be:
box 1 - 25 results
box 2 - 0 results
box 3 - 150 results
box 4 - 2 results
box 5 - 0 results

And the final list has 177 results.

But if the list is larger then it's truncated with the least relevant entries being left out.

I can't even imagine an efficient architecture that would allow to retrieve it all. After all, you would have to store it somewhere while it's being merged and served.

Lots0




msg:51272
 11:20 pm on Jan 11, 2003 (gmt 0)

Thanks bcc1234
That makes it a little more understandable :-)

heini




msg:51273
 11:27 pm on Jan 11, 2003 (gmt 0)

AV cuts off at 1K too, ATW cut off is at 4K.

Lots0




msg:51274
 11:30 pm on Jan 11, 2003 (gmt 0)

Yup I was just checking that out myself.
your right the cutoff is 1000 for google and AV - I hadn't gotten to the end on fast yet :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved