inflated page count

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

inflated page count

10x real count

nickied

1:30 pm on Nov 18, 2005 (gmt 0)

Most focus has been on Jagger recently as it probably should. Though there is another topic which is / has been problematic - the inflated page count.

The topic was covered, though very briefly in August, thread: [webmasterworld.com...]

On my small site, I figure around 1,000 pages. G thinks around 10k. The yourcache tool shows the problem starting around July 4. Correct count came back around July 10 and then took off on July 16 and went to 10k+ now.

I do have a 301 redirect in place which took care of the non-www problem. I also see at least 4 different cache dates going back to around last January.

In the old cache I've got old pages which I can't get rid of (using console). Long story. My programmer will attempt to fix these through htaccess returning a 404.

Surely this must be as important as the jagger threads owing to possible dup content?

Comments?

twebdonny

8:05 pm on Nov 27, 2005 (gmt 0)

Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page. ....Don't use "&id=" as a parameter in your URLs, as we don't include these pages in our index

That is a bunch of BS, there are many pages that use &id= and ARE included in the index.

and as far as what Googleguy says...

Google needs to just fix it's own problems with indexing
instead of trying to get each indivual webmaster to cut and hack at these issues..hoping that they will be somehow miraculousy be remedied 6 months down the line.

All this could be fixed with 1 single command in Google's
current algorithm.

daveVk

3:21 am on Nov 29, 2005 (gmt 0)

Now look at the numbers Google reports. I'm not kidding here, and this has been going on for around nine months or more. This is a nonprofit, noncommercial, educational site.
site:www.mydomain.org 5,350,000
site:www.mydomain.org reserved 390,000
site:www.mydomain.org -reserved 1,320,000

Clearly first figure should to be sum of other 2 as there is no third option regardless of keyword used, integrety test is A=B+C, in this case 5,350,000 = 390,000 + 1,320,000. Failed.

So either
- index broken
- inflation factor different for A than B and C
- integrety test wrong

B + C = 1,710,000 adjusting for x10 = 171,000 is in the ball park. But A looks more like x40, does x40 apply over some limit, can not see other obvious trigger.

Or is the high percentage of url only entries "-reserved" indication that index in bad state

Rollo

6:18 pm on Nov 30, 2005 (gmt 0)

That's a pretty comming problem with content management systems, they have a propensity to confuse the bots. I have a site where users can rate, recommend, post comments, on a page cuasing Google to think one page are 6 disticnt pages. I use a noindex,follow on these to aviod potential dupe content penalites and don't seems to have had problems (knock on wood).

BillyS

7:22 pm on Nov 30, 2005 (gmt 0)

BillyS - As your site has just over 1000 pages, try this test, choose a keyword likely to be on many but not all pages say its 'widget' do "site:yoursite widget" and "site:yoursite -widget" total of both should give total page count, assuming each under 1000. Hence my assertion the 1000 limit is per search rather than per site. Assume search cuts out at 1000 and some guess used if cut off reached.

word site:foo.com = 863
-word site:foo.com = 349
site:foo.com = 10,300

10,300 - 1,185 = 9,115
Inflation Factor = 9

secondword site:foo.com = 107
-secondword site:foo.com = 9,700

1,185 - 107 = 1,078
Inflation Factor = 9

This particular example provides pretty good evidence that the 1,000 page mark kicks in some kind of estimation by Google instead of an actual number.

FromRocky

8:01 pm on Nov 30, 2005 (gmt 0)

Note that 1000 is the limit number of pages you can see.

BillyS

8:02 pm on Nov 30, 2005 (gmt 0)

My last post might have been a bit confusing as written (and contained an addition error), let me explain the observation. In msg #25, daveVk asked me an interesting question. Since I was close to the 1,000 page mark, he wanted me to perform a test. In the first part I did the following query using a word I suspected would appear on roughly half of my pages:

word site:foo.com = 863
-word site:foo.com = 349

Here Google returned what I would have expected. Both values are under the 1,000 page mark and Google appears to return an actual count. Based on this information, I conclude my actual page count in Google is really 1,212 (863 + 349).

The simple query:

site:foo.com = 10,300

Next I performed a test where I found a query would split the results less evenly:

secondword site:foo.com = 107
-secondword site:foo.com = 9,700

Here you can see that the second query should have returned 1,212 � 107 = 1,105. Instead it returned a value of 9,700. This demonstrates that an estimate appears to kick in at the 1,000 mark. The �inflation factor� for my entire site appears to be:

10,300 / 1,212 = 8.5

You might also look at the inflation factor for the second query:

9,700 / 1,105 = 8.8

My guess is that Google returns an estimate to save time. But the question remains is where else does Google use the estimate? Overnight I went from 985 pages to over 10,000 and dropped in Google SERPs. Does Google think I am spamming their search engine?

daveVk

10:21 am on Dec 1, 2005 (gmt 0)

Overnight I went from 985 pages to over 10,000 and dropped in Google SERPs

Thanks for test results, this inline with what I am seeing. Googles use of estimate may effect web sites of over 1000 pages, assumably effecting all sites of similar size to same degree. What was extent of "dropped in Google SERPs", interested to see effect if total happens to drift below 1000 again.

BillyS

5:19 pm on Dec 1, 2005 (gmt 0)

I just got rid of some more pages, but the total count is still over 1,000 - and it grows by one page a day. Not sure I'm ever going to drop back down below 1,000 pages.

girish

6:48 pm on Dec 1, 2005 (gmt 0)

Billy- have similar problem and would very much like to know if reducing actual pages below 1000 improves ranking. My site dropped like anchor in Jagger.

2by4

11:47 pm on Dec 1, 2005 (gmt 0)

this question was covered pretty well by steveb in the previous thread, the bug triggers when > 1000 urls [not pages, note] are indexed within a given folder.

it's totally unrelated to sessions ids, I've seen this bug trigger on sites without any type of session id, static urls. Sometimes I have to wonder about matt cutts and his statements, this one is very easy to see, go to any site with > 1000 urls indexed in any folder, and you'll see this bug get triggered. At least I haven't found any sites that don't trigger it.

I would assume that some sites might have this type of url structure: site.com/dynamic-page.php?page=1235 which wouldlook like > 1000 urls indexed in the / folder.

But I've seen this on one hundred percent static pages. It has as far as I can tell no impact on rankings

I'm somewhat inclined to believe that this is just another google clamping down on its webmaster tools, if it isn't that, then it's a bug, and should have been fixed a long time ago. If it is a bug, and still isn't fixed, that's not very encouraging as far as I'm concerned, if google can't fix a simple if > 1000 then bug triggers event, there's not a lot of hope for more complex issues people are reporting.

BillyS

1:05 am on Dec 2, 2005 (gmt 0)

when > 1000 urls [not pages, note] are indexed within a given folder

That's not the case with my site. None of the information your sharing applies to my site.

steveb

1:53 am on Dec 2, 2005 (gmt 0)

Not sure why you are saying it doesn't apply as you seem to be saying above that it does.

After Sept22 I 301ed a section of pages to take a site from about 1015 page to about 950. Doing this took Google from showing 9760 pages for the site to showing the exact correct number (plus a handful of miscellaneous supplementals and URL only duplicates dropping a trailing slash). On the 64.233.179.104 test datacenter a couple days ago, the dramatic increase in supplementals that was briefly there pushed this site back over 1000 pages, and again it showed as 9000+ pages... and the rankings for the site were down a lot, although that obviously could be because the test datacenter was using a different algo at the time.

2by4

2:09 am on Dec 2, 2005 (gmt 0)

steveb, interesting test, that's what I'v found too. So it really is exactly 1000, that kind of figures I guess.

I saw pretty significantly different results on those test data centers too, I'd have to agree with your guess here. If this poster will give some meaningful details about why his case is unrelated it would help, obviously. I could see some cases, for example, if you do use session ids, you could easily trigger the > 1000 condition after enough bot visits, since the pages would be different urls each time.

from what I can see it's simply how many urls google thinks that folder contains, it's not pages that decide if the bug activates.

Any guesses on if it's a bug or deliberate messing with webmasters, sort of like link:, allinurl: etc?

I can see why someone might think > 1000 urls indexed doesn't apply to their site if they are in fact using session ids though, they'll only see a few files/urls, and not realize that googlebot has created a new url on each visit.

but more details would be helpful

steveb

3:31 am on Dec 2, 2005 (gmt 0)

In my case those urls/supplementals that pushed me over the 1000 threshold are not "pages" and never have been pages.

Pauloogle

1:14 pm on Dec 2, 2005 (gmt 0)

I am having the same inflated count - just over 1000 articles (URL's) are in a certain directory.

Also out of the reported 9000 indexed by google - only about 50 are deep indexed, the rest is URL only.

BillyS

2:35 pm on Dec 2, 2005 (gmt 0)

2by4 and steveb:

I just replied to a message from 2by4 giving him (?) the exact example I used in my post. The numbers have changed around a bit from my example (Googlebot started a deep crawl yesterday), but it still works on the data center I'm using.

is very easy to see, go to any site with > 1000 urls indexed in any folder, and you'll see this bug get triggered

I also apologize for my comment to 2by4, but this is what I'm saying does not apply to me. The most pages in any one folder is ~260.

It's a bug in the calculation, no doubt. What confuses me is why some webmasters with very large sites state Google's count is accurate. I believe them. Since Google is just a machine, then there must be something that triggers this inaccurate estimate. In addition, through my example (which I sent to 2by4) it's pretty clear that this bug is unrelated to the site's structure (folders), but rather the bug applies to the website itself.

nickied

3:15 pm on Dec 2, 2005 (gmt 0)

Ok on what's been posteed so far.

But what about all the pages with old cache dates still showing up? i.e. last January, June, etc. Surely these are contributing to the problem?

girish

4:15 pm on Dec 2, 2005 (gmt 0)

"Steveb -- and the rankings for the site were down a lot, although that obviously could be because the test datacenter was using a different algo at the time. "

2x4 / Steve-- in your later posts you seem to say that the inflated counts are NOT impacting ratings. What is your evidence? My site and others reported on the Jagger threads seems to have been penalized by the over-counting.

FrogOfPower

4:41 pm on Dec 2, 2005 (gmt 0)

Just thought I'd add that if results over 1,000 are multiplied by a factor of 10 then there should be no sites with a number of pages listed between 1,000 and 10,000. My site is currently showing near 5,000 pages so G can't be multiplying them all.

BillyS

6:02 pm on Dec 2, 2005 (gmt 0)

FrogofPower:

What confuses me is why some webmasters with very large sites state Google's count is accurate. I believe them. Since Google is just a machine, then there must be something that triggers this inaccurate estimate.

You're exactly right with your comment, that's what I was talking about above. How come you've got 5,000 pages (a correct count) and not 50,000? Is it an indication of some kind of penalty?

PurpleHaze

8:34 pm on Dec 2, 2005 (gmt 0)

I would suggest that the mechanism behind the page count discrepancy is, more or less, staring us in the face. Whenever anyone performs a search, Google only ever snatches 1000 matching results from its database. These results are held in a temporary cache while Google is sorting them, paginating them, and otherwise processing them for display to the user as serps pages 1 to 100.

The results are physically counted into the cache while it is collecting them, when its got 1000 results it stops collecting and processes the results for display. If it finds less than 1000 results it knows how many results there are because it has counted them into the cache. If it has to stop collecting results because the cache is full, it really has no idea how many results there are, it can't waste processing resources to count 5 million results, of which only 1000 will ever be displayed, so it has to resort to a guessitmate. It is this guessitimate that is wildly out.

I would also suggest that the guesstimate is only designed for normal searches, not searches of the type site:www.somesite.com It happens to be in the user interface for normal searches, so some figure is displayed for site: searches, but it might be completely spurious. However, unlike normal searches where the guesstimate is difficult and might be expected to be wildly out, with site: searches their database should know exactly how many urls from a particular site are indexed, but I suspect their URL database is in a mess - it seems that a lot of Google databases are in a mess at the moment.

The results count is superfluous anyway. What's the point of saying I've found 600,000 results if its only going to display 1000 of them. It was never meant to be accurate. It was to demonstrate to Joe Public how mighty Google is/was, but now it just haunts them. I predict that that figure will suddenly disappear one day, along with PR on the toolbar.

What we are seeing here, and in many of the threads on this site, is Google struggling, really struggling, with scaleability (and concept) issues.

BillyS

9:04 pm on Dec 2, 2005 (gmt 0)

It is this guessitimate that is wildly out.

But my concern is that this estimate is used in other calculations. That's the point I was trying to make earlier. If Google knew the exact number (it was stored somewhere in the index), then it should return that value - because it's more efficient.

If it doesn't know the real value and is calculating it on the fly, then this calculation is flawed. If there is a bug in Google, then where else does this appear. Am I being penalized for spamming Google? My site grew by 9,000 pages overnight according to this calculation.

2by4

10:00 pm on Dec 2, 2005 (gmt 0)

billys was nice enough to send me sample url, and I can confirm that this is due to > 1000 urls indexed per folder, I went far back enough in the serps to find url only results, where I found the offending component, an old php page.php?content=12344 type url. Also other old urls. When I tested these, no 301 was in place, which means google has simply kept these urls for long enough to allow them to trigger the bug.

While the theory on 1000 pages returned and then a guess is interesting, it has one major problem, the count returned is in some proportion to the urls actually indexed, by roughly a factor of 10. So a folder with 2000 urls indexed will show as site:sample.com/folder/ = 20,000 pages. It's not always exactly a factor of 10, but it seems to be about that from what I can see.

Billys, this might also be useful information: when was the url structure of the site redone? There's been some question about just how long google is keeping old urls, and this would be good information. It looks to m e like it may have been redone twice, once from the index.php?... stuff, and then once from /section/ type stuff. When was each change done?

Girish, if your sites got hit in jagger then it's best to look at the full jagger sequence, starting in roughly august, to detect the causes. This page count bug has been around longer than that.

Since I'm ranking very high on a site with these inflated counts it's hard to really see them as an issue, whatever is added ranks as a rule, this really just looks like yet another component of google's indexing bugs, which are starting to look like they might be more serious than they are letting on, would definitely account for the www/non-www issues as well, if google simply will not let go of their old data, and if their current indexers have these types of bugs, there's a problem and google isn't admitting it.

steveb

10:40 pm on Dec 2, 2005 (gmt 0)

"The most pages in any one folder is ~260."

folder = website. I don't know why 2by4 says folder.

"in your later posts you seem to say that the inflated counts are NOT impacting ratings"

Didn't say that at all. It probably does effect rankings, but I'm not going to put on a tin hat and say it absolutely does, that would be stupid.

2by4

10:44 pm on Dec 2, 2005 (gmt 0)

directory = /, the root folder. In this case, the directory containing > 1000 urls indexed is /, at least from what I can see.

so for example: site.com/page.php?content=1234 seems to have had > 1000 urls indexed in it. Then the site urls were seoed, to make the search engine friendly, but the old urls were never 301ed to the new url method. There appears to have been at least 2 of these changes, one major, from page.php?content=23432, one minor, with the removal of one layer of directories. Neither was 301ed to the newest scheme.

In other words, you're right, still, steveb.

BillyS

12:56 am on Dec 3, 2005 (gmt 0)

2by4 -

Please tell me if you can find more than 20 of those index? urls. Doubtful you can because those paricular examples leaked out during test. You can even look at the date on those urls, all this happened less than 2 weeks ago.

Sorry, just because your found one example of the 20 doesn't mean my entire site's been indexed that way. Look at my robots.txt file, they are blocked to prevent this from happening in the future.

Even Steveb is saying you're misunderstanding him.

Billys, this might also be useful information: when was the url structure of the site redone?

Again, sorry it's always been that way.

The Section pages were removed recently.

2by4

1:03 am on Dec 3, 2005 (gmt 0)

oh, I'm not saying this is the only possible cause, it's just the one I've seen the most often. All other possible causes are worth looking at, I just haven't seen them yet.

I've found that some posters are assuming that it's the site itself that is overcounted, that's what I assumed first too, but when you go back to the other thread, you'll find that it was the actual folder that contained > 1000. I'm not saying there is no other cause, but that's the cause I've seen.

Sorry for not looking deeper into your site, it gets pretty boring digging through the lower levels, > 900 +, you'll find it though, keep looking. What about /section/? How much of the site's content was in that? my guess is > 500 urls, and if no www/no-www rewrite, there could be > 1000 urls indexed for that old directory.

This appears to be a url issue, not a page issue. Like so many other bugs, I'd think this would ring steveb's bell, he's pretty fond of supplementals if I remember right, this might just be one more manifestation of the supplemental issues.

What I found was that when faced with an overcount, once I found the folder with > 1000, I could localize the issue. However, your case is slightly different, it's a case of google keeping old urls, and it may not be possible to determine which old folder/directory actually had the overcount applied. I went through each major directory on your site and couldn't find any new one that triggered the condition.

Keep in mind though that as far as I can tell this has no real impact on serp positions. Or if it does, google would have to invent serp positions < 1 for one site I'm looking at with that overcounting bug in it.

Your site however is not a great text book example since the urls were redone, or maybe it is? hard to say for sure.

If you keep looking you'll find where the actual overcount happened though, from what I've seen it's localized, maybe not always though.

<added> keep in mind though, there is no argument about the overall problem, 8.5 times too many pages on certain sites.

personally, I've ignored worrying about supplementals, my sites have mostly been 301ed to site.com or www.site.com for years, and I just don't see much problems, so I spend my time creating content etc, seems to be a more effective strategy long term, just my 1.02 cents.

twebdonny

4:15 pm on Dec 4, 2005 (gmt 0)

Something new for the mix...
when cache doesn't = true cache

Dec 2, 2005 - Cached

but not really....

daveVk

10:45 am on Dec 8, 2005 (gmt 0)

It appears that in cases where search results includes supplimentals the 'about' figure (A) is correct. The following rules seems to apply, stop at first rule to apply.

1 - SERPS includes supplimentals, then total (including sups.) = A.
2 - A < 1001, then total(excluding sups.) = A.
3 - A < about 100000, then total(excluding sups.) = aprox( A/10 ).
4 - For large A, inflation factor seems to vary.

Is inflation factor attempt to account for supplimentals?

Could sup. Index actually be (factor-1) times the size of the main index?

Rule 1 observed in two cases, one from own site and one thanks to FrogOfPower, so is hardly proof, observations appreciated

Rules seem to apply regardless of nature of search, that is, total = pages meeting search criteria

twebdonny

2:48 pm on Dec 12, 2005 (gmt 0)

Something is afoot, at least with our site...

Page totals dropped from over 19K to just over 500 today
with backward links to the homepage actually rising.

This 65 message thread spans 3 pages: 65