homepage Welcome to WebmasterWorld Guest from 54.211.219.178
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google Word Count Stats
Brett_Tabke




msg:722388
 2:17 pm on Mar 10, 2005 (gmt 0)

A very nice peice of analysis over on Jean Véronis Aix-en-Provence [aixtal.blogspot.com] site about stop words and Google rounding search count numbers.

 

Liane




msg:722389
 2:48 pm on Mar 10, 2005 (gmt 0)

Ever since Google started reporting over 8 billion pages indexed which represented a massive (overnight) increase from their previous index counts ... I haven't trusted any of Google's numbers for anything.

I've long questioned the number of search results returned for any given search and I know that the number of pages returned when doing a site:mysite.com search returns highly inflated results.

If Google can't correctly report that a site with 154 pages has 154 pages and incorrectly reports that the site has 211 ... then it only stands to reason that something, somewhere is messed up in their counting matrix. The phantom "suplemental page" result is (I suspect) at the root of the problem.

It also makes the reported 8,058,044,651 pages indexed very suspect to my way of thinking.

My math skills are sadly lacking, but I would be interested to know if Google's apparent inability to count are random inconsistencies or if there is any correlation (percentage wise) to Jean Voronis' "the" theory and other irregularities noted by various others?

Everyone assumes that Google is purposely not showing us all our backlinks because they don't want webmasters to target competitors links and go after them for themselves. Well, is that true or are the link results just completely messed up like the page count for individual web sites and the "the" theory?

mrMister




msg:722390
 2:53 pm on Mar 10, 2005 (gmt 0)

Ever since Google started reporting over 8 billion pages indexed which represented a massive (overnight) increase from their previous index counts ... I haven't trusted any of Google's numbers for anything.

It wasn't overnight. That figure wasn't updated for a number of years.

In case you haven't figured it out, Google likes to play games with other search engines.

Every so often, a rival search engine sends out a big press release claiming they have the largest index. They make a big song and dance about it and get all boastful, because their index is bigger than Googles.

When that happens, Google update the number on their home page to reflect the number of sites they currently have. It basically makes a mockery of the other search engine.

The number on their homepage stays the same until a rival search engine starts to get cocky... then Google updates the number (they always have the largest index at any one time). The number is accurate, it just isn't updated frequently.

If Google can't correctly report that a site with 154 pages has 154 pages and incorrectly reports that the site has 211 ... then it only stands to reason that something, somewhere is messed up in their counting matrix

Either that, or they think there's no point spending excess processor time for a statistic that doesn't need to be deadly accurate.

I know I prefer faster results to an accurate count of the total number of pages. The figure Google gives is accurate enough for most purposes.

mrMister




msg:722391
 3:01 pm on Mar 10, 2005 (gmt 0)

I've always assumed that the "the" count has been hard coded in recent years.

The main reason people search for that word it is to see how many pages are listed in Google. Why waste processor time when you can hard code it?

Bear in mind when thinking that Google is lying that Google has a supplimental index which wouldn't normally be referenced in queries with lots of results. I believe they also have a seperate index for one word queries.

Liane




msg:722392
 3:11 pm on Mar 10, 2005 (gmt 0)

when thinking that Google is lying

Please note that you have used that word and not I. I happen to think calling anyone or any entity a liar is pretty serious stuff.

I think they are having computing problems which have affected their counts for almost everything. As a result, I don't trust their reported counts. Period.

ciml




msg:722393
 3:14 pm on Mar 10, 2005 (gmt 0)

A most interesting experiment. Mr Véronis goes on to suggest [aixtal.blogspot.com] an index of two parts.

It seems highly likely that Google would search a subset of the results for searches that would otherwise return very large numbers of results.

This makes sense, as you'd expect to have the top results for "the" after looking at the 80 million most important pages. The rest of the eight billion would seem somewhat unnecessary.

Also, it would help with daft searches such as +the OR +www OR +a OR 1 OR +com OR html OR 2005 OR htm [google.com] that would otherwise require the processing of some rather large lists.

I don't see the figures as at all dishonest as Google still uses the full index when needed. e.g. "+the +www +a 1 +com html 2005 htm blue fuzzy widget".

Lord Majestic




msg:722394
 3:26 pm on Mar 10, 2005 (gmt 0)

It seems highly likely that Google would search a subset of the results for searches that would otherwise return very large numbers of results.

It is certain that they have sub-indexes, at least one of which is done on what they used to, and probably still call "fancy" hits, ie: bolded items, in title items, anchor text etc. This sub-index is 10 or more times smaller than main full index. First search is made on that index since it is reasonable to assume that good "fancy" hits will be better pages than hits from general index.

Now, having said that there is still no reason why total number of matches is not calculated on the basis of the main index. The figure will be approximate since they or anyone else who know that stuff would not actually run full search to calculate number of matches -- they will just merge sub-sub-indexes and get approximate number in no time.

BReflection




msg:722395
 4:08 pm on Mar 10, 2005 (gmt 0)

Now, having said that there is still no reason why total number of matches is not calculated on the basis of the main index

There is a reason. Google's index has become largely dynamic, with pages from high PR sites such as slashdot showing up as #2 in the main index in a matter of four days (or less. I am talking specifically about Jef Raskin [google.com]. Slashdot was #2 four days after their article and are now #10)

Every time you query Google it hits ~1000 machines. In striving for a completely dynamic index, they have indicated that their biggest limitation is the complaints of smaller webmasters that their spiders are hitting too hard. There is no reason that they don't have a few (tens of) machines calculating a rough rounding estimate of the number of occurences in the given index being searched at that time.

This explains the anamoly in the OP.

Lord Majestic




msg:722396
 4:19 pm on Mar 10, 2005 (gmt 0)

(or less. I am talking specifically about Jef Raskin. Slashdot was #2 four days after their article and are now #10)

I think this is related to ranking that is changed depending on how fresh page is, rather than the index per se -- the way they work is to have dynamic ranking formulae that can change (easy) without having to redo whole index (hard).

There is no reason that they don't have a few (tens of) machines calculating a rough rounding estimate of the number of occurences in the given index being searched at that time.

There are many reason why not to do it:
a) queries are very fragmented (no 80/20 rule) -- you can't precompute perfectly everything and b) shows it makes no sense anyway
b) computational power requires to make an estimate is low -- who cares if there are 80 mln or 800 mln pages for your term so long as you get decent first 30-50 matches?

Having imperfect but very fast estimate calculation is one thing, but having smaller indexes than claimed is another thing.

Lorel




msg:722397
 4:30 pm on Mar 10, 2005 (gmt 0)

If Google can't correctly report that a site with 154 pages has 154 pages and incorrectly reports that the site has 211 ... then it only stands to reason that something, somewhere is messed up in their counting matrix.

I have noticed that Google reports more pages than I actually have also, but then I realized it is also counting small windows I have set up people can click on for more info that I didn't set up as HTML pages. I have since disallowed those pages in robots.txt so they should disappear from the listing soon. This could be why you're getting more pages listed than you actually have.

scenpro




msg:722398
 4:37 pm on Mar 10, 2005 (gmt 0)

Isn't the real question "how many pages on google are available for searching?"
for example-on Monday i get 1000 hit on 'X' on tuesday I get 0 hits for 'X' but 500 for 'Y' and Wednesday 0 for 'X' and 'Y' but 750 for 'Z' and this goes on everyday.

And then lets say I get a search for "A C DF " I check my tracker and see what page it's on in the search results and it's gone

8 Billion pages maybe but how many are available to be searched as a 'result' at one time? is the real question- No?

randle




msg:722399
 4:56 pm on Mar 10, 2005 (gmt 0)

Google has consistently demonstrated themselves as masters of spin, deflection and disinformation. Their ability to carry this persona well into their emergence as a public company is amazing really. I don’t blame them one bit, but any information gleaned other than where your site sits in the results should be looked upon with extreme skepticism.

ciml




msg:722400
 5:45 pm on Mar 10, 2005 (gmt 0)

Lord Majestic, I think fancy hits are to do with the inclusion of the words within the page. But you got me thinking...

The Anatomy of a Large-Scale Hypertextual Web Search Engine [www-db.stanford.edu], section 4.5:
To put a limit on response time, once a certain number (currently 40,000) of matching documents are found, the searcher automatically goes to step 8 in Figure 4. This means that it is possible that sub-optimal results would be returned. We are currently investigating other ways to solve this problem. In the past, we sorted the hits according to PageRank, which seemed to improve the situation.

So in the days of Backrub, they used the first 40,000 results that matched all the words (after the union). Back then, the index was in the region of 25 million pages.

The index is ~8 billion pages. My suggestion is that Google might nowadays use the first 80 million results for each of the words, and then look for the intersection.

Liane




msg:722401
 5:56 pm on Mar 10, 2005 (gmt 0)

This could be why you're getting more pages listed than you actually have.

Nope ... no small windows, no pop ups, no pop unders. Nothing but straight html pages.

The index is ~8 billion pages. My suggestion is that Google might nowadays use the first 80 million results for each of the words, and then look for the intersection.

That seems logical, which might also partially explain the so called "sandbox".

ciml




msg:722402
 6:41 pm on Mar 10, 2005 (gmt 0)

> might also partially explain the so called "sandbox"

I had discounted that on the assumption that the sandbox had to apply to phrases with no words under 80 million results. Now I'm not sure, let's run with that idea [webmasterworld.com].

Lord Majestic




msg:722403
 6:47 pm on Mar 10, 2005 (gmt 0)

The index is ~8 billion pages. My suggestion is that Google might nowadays use the first 80 million results for each of the words, and then look for the intersection.

Its not impossible that they check (rank) only first X matches, however the estimate that they have does not depend on them having to actually match and calculate ranks for all Y>X matches. Estimation can be done based on available index data -- this is very fast, but it won't be very reliable, which is not that important when you deal with 10s of millions of matches.

Its sure a lot more complicated than in Backrub days ;)

claus




msg:722404
 7:20 pm on Mar 10, 2005 (gmt 0)

About time somebody put some attention towards those numbers. In light of this thread, the "dual index" thing seems much more rational than in earlier threads (i must have overlooked or forgot that sentence in the Anatomy paper somehow - been ages since i read it). Still, i oppose somewhat to that term (dual index) as it's still one big index imho, some parts just get treated differently than others.

Very interesting behavior with that Jef Raskin search - it's #12 for me now - i wonder if sites like /. are flagged for "very dynamic content" and rank accordingly. Front page links disappear quickly on a newsworthy day, and then the story slips to next page, and next page, and... (lower PR levels)

This would require continuous PR calculation / ranking, which has been hinted at before, but slashdot speed is fast. So, here's another way to divide the index: Some pages might require more frequent ranking calculations than others ;)

lego_maniac




msg:722405
 12:35 am on Mar 11, 2005 (gmt 0)

The index is ~8 billion pages. My suggestion is that Google might nowadays use the first 80 million results for each of the words, and then look for the intersection.

How would they handle instances where the two sets don't union at the first 80 million?

Hypothetical:
"widgets" - competitive term, use first 80 million out of 2 billion pages
"rareword" - non-competitive term, use 26 out of 26 pages

Johnny's web page ranks 90 millionth for "widgets".
Johhny's same page ranks 23rd for "rareword"

In this hypothetical, lets pretend "rareword widgets" word combination can only be found on Johnny's website.

Would Google have to perform a lookup for "widgets" a second time since the first run-through used only 80 million in the set?

I may be inclined to fabricate some "rare" words (XAISHXNIYIWLP) and group them with very competitive terms on a page just to see how they're treated in Google.

Lord Majestic




msg:722406
 1:37 am on Mar 11, 2005 (gmt 0)

Hypothetical:
"widgets" - competitive term, use first 80 million out of 2 billion pages
"rareword" - non-competitive term, use 26 out of 26 pages

This type of query is very easy since you only need to make 26 intersections (rareword and widget) -- this sort of low match should allow for full evaluation and exact counting of matches. The hard work is when you get common words with frequences of (say) 100 mln and 300 mln.

Marval




msg:722407
 1:40 am on Mar 11, 2005 (gmt 0)

Liane - your last sentence/question was brought up a long time ago, and the "messed up" backlinks was actually something that Google publicly stated here was done as the result of a suggestion made at a conference by a WebmasterWorld member that thought that would be a good way to cut down the usage of that data for your stated purpose. It had been discussed at the conference and google implemented it the following month with full disclosure right here - although I cant find the actual thread where we discussed it with their rep at the time.

danny




msg:722408
 1:57 am on Mar 11, 2005 (gmt 0)

For most purposes, it hardly matters whether the Google "page count" numbers are accurate or not. But linguists have been using Google's index as a corpus for linguistic analysis, and they've discussed some of the problems this raises at length. Someone has even suggested a new unit: whG/Gp would be "Google web hits per gigapage":

[itre.cis.upenn.edu...]

webnewton




msg:722409
 11:29 am on Mar 11, 2005 (gmt 0)

The index is ~8 billion pages. My suggestion is that Google might nowadays use the first 80 million results for each of the words, and then look for the intersection.

This makes sense Ciml. I belive Google operates in say a sort of backend and a frontend. The so called 8 billion index resides at the backend and shared and being played by the various datacenters (visible and invisible) whereas the frontend retrieves its resuts from the frontend. A site goes down from backend to frontend bases on some criteria,importance or its age of index.
This also makes sense when you say that the sites which were indedexed some 5-6 months back have still not been included in the SERPS.

Kirby




msg:722410
 3:32 pm on Mar 11, 2005 (gmt 0)

If Google can't correctly report that a site with 154 pages has 154 pages and incorrectly reports that the site has 211 ... then it only stands to reason that something, somewhere is messed up in their counting matrix

Your site may only have 154 pages, but Google may have indexed 211 different urls for those 154 pages which Google then counts as different pages.

Lord Majestic




msg:722411
 3:36 pm on Mar 11, 2005 (gmt 0)

Your site may only have 154 pages, but Google may have indexed 211 different urls for those 154 pages which Google then counts as different pages.

This brings another question -- how many of 8 bln "pages" are mere URL links? It seems to me that number of those in "supplemental results" is much higher than in the main index.

Just Guessing




msg:722412
 4:58 pm on Mar 11, 2005 (gmt 0)

Search for:

google +the 60,900,000 results
google -the 44,300,000 results

Total 105,200,000 results by my reckoning.

Search for:

google 215,000,000 results

Huh?

How many pages are there, Google?

Worse still, search for:

google the
"the" is a very common word and was not included in your search.
80,200,000 results

Liane




msg:722413
 5:44 pm on Mar 11, 2005 (gmt 0)

and the "messed up" backlinks was actually something that Google publicly stated here was done as the result of a suggestion made at a conference by a WebmasterWorld member that thought that would be a good way to cut down the usage of that data for your stated purpose.

Yes, I recall that and it was put into practice shortly after the Boston PubCon in May 2003. However, that was then and this is now. I just wonder if it still remains the case?

Marval




msg:722414
 10:32 pm on Mar 11, 2005 (gmt 0)

Liane - as far as I can see in the backlinks I watch somewhat carefully (which arent too many so may not be totally a good statistical sampling) I havent seen much change in the "types" of backlinks I see - mostly low grade, low importance, very little transfer of PR type links. The ordering of the links doesnt seem to have changed much either, but again that may just be limited to the type of sites I watch.

pontifex




msg:722415
 11:26 pm on Mar 11, 2005 (gmt 0)

Let's say I have 8 large boxes of sweets, all together 64 billion sweets roughly: red, greens, blues, yellows, etc. all colors wildly mixed...

I take my blade and shovel approximately the same amount in 8 different buckets.

Now I sell each bucket to children on the street for 0.05 cents each, right in front of a large school with around 200 Million pupils.

I am quite sure, none of the pupils would care about the amounts of sweets in each bucket and if they are distributed right and with the right colors... just the teachers would come running and look closely what I am doing and maybe the ice cream man would try to get me of the school yard...

Google is a good looking sweet, not the "better" product anymore.

2 cents for my thoughts, 5 cents for my clicks ;-)
P!

PS: nevertheless, the article is very interessting, thanks brett!

BillyS




msg:722416
 1:41 am on Mar 12, 2005 (gmt 0)

Yes, thanks Brett, good stuff. Makes you think a little bit about search terms. This has probably been reported like a thousand times but it was insightful to me.

I did the search query:

best of the web

then tried:

best to the web
best to for web
best of for web

all with the same result. So I tried:

best of web
best a web
best to web

Same result again. So I tried:

best to of the web
best for of the web
best of to the web

Different, but consistent results! So I tried:

best of to the for web
best of to the a web

They were consistent but changed again. Now this might seem useless to some, but when you write articles a lot, I just learned something about spacing my keywords and more natural writing.

As a follow up, this does not work the same way on Yahoo. Playing around with other combinations, Yahoo seems to handle these word groupings in a more logical manner.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved