inflated page count

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

inflated page count

10x real count

nickied

1:30 pm on Nov 18, 2005 (gmt 0)

Most focus has been on Jagger recently as it probably should. Though there is another topic which is / has been problematic - the inflated page count.

The topic was covered, though very briefly in August, thread: [webmasterworld.com...]

On my small site, I figure around 1,000 pages. G thinks around 10k. The yourcache tool shows the problem starting around July 4. Correct count came back around July 10 and then took off on July 16 and went to 10k+ now.

I do have a 301 redirect in place which took care of the non-www problem. I also see at least 4 different cache dates going back to around last January.

In the old cache I've got old pages which I can't get rid of (using console). Long story. My programmer will attempt to fix these through htaccess returning a 404.

Surely this must be as important as the jagger threads owing to possible dup content?

Comments?

twebdonny

12:45 pm on Nov 19, 2005 (gmt 0)

Our site has suffered from the same problem with no resolution in sight.
We have followed all the advice, htaccess, 301...
and on and on...yet our site shows nearly 4X the true amount of pages and no one seems to know why
this has occurred. This seems to have started to rear it's
ugly head at the precise time our site dropped in the serps
too. Can't get an answer from Google, Cutts, Google Groups
or anywhere else on the true nature of this either.

BillyS

3:26 pm on Nov 19, 2005 (gmt 0)

nickied -

I also posted on this topic back in late September:

[webmasterworld.com...]

Right now, Google says I have 9,790 pages. The site only has about 1,100 pages. Even at this inflated number I get this:

"In order to show you the most relevant results, we have omitted some entries very similar to the 995 already displayed."

Around this same time my rankings in Google started to drop considerably and has not regained any ground in J3. I suspect this incorrect page count bug is causing a penalty. In other words, if Google thinks I've added 8,000 pages in the last two months (actually, this problem appeared overnight once the site hit around 1,000 pages), they might be suspect of the site.

Not a good thing, almost like a return to the sandbox. I've done all the suggested fixes (301, 404...) but in my case I think there is a bug in Google's algo when a site hits 1,000 pages indexed.

The Contractor

3:40 pm on Nov 19, 2005 (gmt 0)

This is usually because the same "script" page is counted several thousand times. Say you have a "add-url" form and the script is accessible as addurl.pl?3, addurl.pl?2, addurl.pl?6, etc with the number indicating the category or other reference. I have experienced this even though the folder/pages were blocked via robots.txt and also with a noindex, nofollow robots meta tag on the page. (They say it can't happen, but it does if there is a link pointing to a page/file).

twebdonny

5:16 pm on Nov 19, 2005 (gmt 0)

Not the case here, never has been

BillyS

5:40 pm on Nov 19, 2005 (gmt 0)

Not the case here, never has been.

Ditto here. It's a count problem. Upon inspection of the pages listed I do not see this problem.

lorien1973

3:43 am on Nov 20, 2005 (gmt 0)

I have this same problem for my sites. When I run xenu on the site, it never gets caught into a loop where it finds different urls or pages. I gets the correct count each time. Its very odd.

BillyS

3:53 am on Nov 20, 2005 (gmt 0)

When I run xenu on the site, it never gets caught into a loop where it finds different urls or pages.

Great tool. I use the same to count my pages. If you have a loop it will find it.

nickied

2:15 pm on Nov 20, 2005 (gmt 0)

>BillyS

>I also posted on this topic back in late September:

yes, saw it, forgot link.

i'm using yourcache tool. do it every day since spring so it has history. right now i'm at -1. this happens every so often. don't see a pattern. then comes back to inflated number. just added 200 more to 10,600.

i'm at around 1,000 real pages. pretty sure just <1000. not sure your theory about >1000 is part of problem. probably dup content is though.

i've got many cache dates going back to january.

i'm surprised there isn't more comment on this?

thanks.

girish

3:51 pm on Nov 20, 2005 (gmt 0)

same problem here- yahoo/msn count 2280 pages (correctly +/- 100 pages)

Google is at 23,000+

Site got creamed in J1-J3. I was pos#5 for very competitive terms -- now #700s

Still #1 on yahoo for same terms

Will submitting a site map help?

I've sent email to Google, no reply.

twebdonny

6:11 pm on Nov 20, 2005 (gmt 0)

Site map made no difference here, had them in various forms
since this problem began.

Never an answer from G either about this.

BillyS

8:54 pm on Nov 20, 2005 (gmt 0)

I was just about to ask the same question - Does having a sitemap help. I don't have one right now, but I was thinking about generating one.

I'm wondering if a sandbox-like affect (penalty) is hitting me because of the jump in pages. As mentioned earlier, I would think Google's calculations are using the inflated number.

nickied

10:11 pm on Nov 20, 2005 (gmt 0)

>>BillyS

>>I was just about to ask the same question - Does having a sitemap help.<<

A little off topic of the inflated counts but this was my experience with xml site map.

Could not get the non-www's to go away. Had the 301 in place since last December/January. The non-www just would not go away for months.

Whenever the site map program was instituted; last spring/summer? I made site map. Shortly thereafter, the non-www's went away. Coincidence? I don't know. Haven't seen any other benefit from it.

nickied

10:18 pm on Nov 20, 2005 (gmt 0)

>>BillyS

>>I was just about to ask the same question - Does having a sitemap help.<<

Followup:

BillyS - I was quick to reply; did you mean site map or xml site map? I've got both, same result as previously mentioned.

daveVk

2:02 am on Nov 21, 2005 (gmt 0)

Have observed that for my site figure is very close to 9x, despite real total changing over range 1k to 6k. I use a keywork to split population into groups of under 1000 eg "widget" and "-widget" summing these to get real total. Now lazy and just divide by nine.

BillyS

2:47 am on Nov 21, 2005 (gmt 0)

nickied -

I was talking about xml site map... I thought perhaps this might help Google get a more accurate count.

nickied

4:40 am on Nov 21, 2005 (gmt 0)

>>BillyS

>>nickied -
I was talking about xml site map... I thought perhaps this might help Google get a more accurate count.<<

Nope, not here. Installed the xml site map soon after it was introduced. Pages continue to rise, going on 11x actual now.

BillyS

2:43 pm on Nov 22, 2005 (gmt 0)

I created a sitemap last night (xml), I figured it cannot hurt. Everything looks good from a duplicate content perspective (none that I can find).

My count via the generator was 1,028 pages. This morning I jumped from 9,800 pages to 10,200. I'll have to wait and see if the sitemap helps at all.

I am definately suffering some kind of penalty - whether it's related to the inflated page count or some kind of duplicate content, I'm not sure.

Oh well, there is always next year.

BillyS

12:57 pm on Nov 26, 2005 (gmt 0)

So far, no good with the generator. Even though I further restricted Googlebot via robots.txt (which would have the affect of removing 150 pages), my page count is higher than ever 10,400 today. Roughly 10 times the actual number of pages.

twebdonny

2:11 pm on Nov 26, 2005 (gmt 0)

and never an answer from Google...

ksoper78

3:44 pm on Nov 26, 2005 (gmt 0)

I am showing 211,000 pages. I only have about 20,000 pages. I use Google Sitemaps. I too got hit by a penalty with Jagger 3. We do not do anything sneeky on our site, just good content. My guess is that this (or Google thinking we have duplicate content) is the reason for our penalty. Another innocent website getting hit with penalty! I sure wish Google had a forum or site that showed you what your penalty was an how to fix it. It is hard to guess as to why you got penalized when you are doing nothing illegal. I have read their webmaster pages a zillion times. Still can not find anything that we are doing that they do not like.

The weird thing is that only our interior pages got penalized. Our homepage still ranks fairly well. I hope Google fixes this soon!

girish

4:17 pm on Nov 26, 2005 (gmt 0)

I've posted elsewhere- my site 2285 pages is counted as 23,000 pages. I did not record exactly when the inflated count first appeared. I think it showed up around Bourbon and at that time my site did drop for a while but came back until Jagger. Under Jagger my traffic is a shadow of what it used to be.

About 90% of the pages are glossary pages with one or two unique sentences per page. Previously the definitions were surrounded by a long templated nav bar. I changed these pages two weeks ago (the change has not yet improved ranking) by eliminating the common site-wide left nav bar with its 40+ keyword-rich internal links and a common footer with more internal links on each of these pages. I substituted a breadcrumb nav instead. My thinking is that the identical (duplicate) content of the old nav bar on every glossary page overwhelmed the "unique" definitions on each glosssary page thereby creating a possible dup filter. Since googlebot may have seen these 2000+ pages as largely identical (except for one sentence per page). However, I have not seen any benefits from this change yet.

daveVk

12:35 am on Nov 27, 2005 (gmt 0)

As far as I am aware, this has nothing to do with penalties or aspects of your site, anytime a search finds over 1000 pages G reports an inflated number, for my site the factor is very close to 9x. The exception to the rule seems to be when all (some?) pages are supplementary in which case G is well behaved. The 1000 limit is on pages found not per site.

BillyS

3:26 am on Nov 27, 2005 (gmt 0)

As far as I am aware, this has nothing to do with penalties or aspects of your site.

Let's think this one through...

My site has 1,020 pages, yet shows 10,400 in Google site:www.domain.com command. So why the difference? Let's walk through some scenarios:

1. - Google intentionally gives the wrong count for this command. This could be a deceptive move to inflate the index's size. Not sure what other motive they would have to purposefully show the wrong number. I don't subscribe to this explanation.

2. - When a website reaches a certain size (1,000 pages), it switches from an actual count to an estimate. But the estimating algo has significant problems (we all agree).

If Google had a more accurate count, they would likely show that number. So my assumption is that in downstream calculations, they also use this incorrect value (10x too high). One technique used by spammers is to use a database to generate lots of content. Let's say that too much content too fast trips a spam penalty. Guess what happens to those sites like mine? Back to a sandbox...

daveVk

4:21 am on Nov 27, 2005 (gmt 0)

BillyS - As your site has just over 1000 pages, try this test, choose a keyword likely to be on many but not all pages say its 'widget' do "site:yoursite widget" and "site:yoursite -widget" total of both should give total page count, assuming each under 1000. Hence my assertion the 1000 limit is per search rather than per site. Assume search cuts out at 1000 and some guess used if cut off reached.

phantombookman

7:18 am on Nov 27, 2005 (gmt 0)

Same here, every site under 1k pages is spot on, every site over 1k approx' 10x inflation.

I actually posted the question here, many months ago, on the day it first happened to me, my initial concern was some pending duplicate content problem.

Initially it was about 2x inflation and just kept going up.

On a more recent thread I posted how potentially embarrassing it could be for G' were it wider known they were inflating their index claims. Not long after they removed the index page count from their homepage. Clearly someone at G had similar concerns

daveVk

8:22 am on Nov 27, 2005 (gmt 0)

An example to show inflation is not site specific.

"carrot pie" __________ about 13300
"carrot pie" free ______ about 588
"carrot pie" -free _____ about 864

free + not free = 588+864 = 1452
inflation factor = 13300/1452 = 9.16

Scarecrow

12:50 pm on Nov 27, 2005 (gmt 0)

One way to compare the URL-only to the fully indexed pages is to use the "site:" command in conjunction with a word or phrase that is present on every page. Something like the word "reserved" as in "all rights reserved" in the copyright notice might work. I'll use that as an example for the actual statistics from my site.

First, you should know that the absolute maximum number of pages available to spiders on my site is 132,000. These are all static pages. The file-naming conventions have been stable for three years now. The site has been online for ten years. The updates from one year to the next are less than seven percent of the total.

Using sampling techniques with a special cross-sectional keyword so that I can get this number below 1,000 (all numbers over 1,000 are worthless because they are unverifiable), I estimate that Google's actual coverage of my 132,000 available pages is 66 percent. That would make a realistic number that Google should report for the entire site closer to 87,000 than to 132,000.

Now look at the numbers Google reports. I'm not kidding here, and this has been going on for around nine months or more. This is a nonprofit, noncommercial, educational site.

site:www.mydomain.org 5,350,000

site:www.mydomain.org reserved 390,000

site:www.mydomain.org -reserved 1,320,000

Forty times the absolute maximum possible count. Can anyone beat this? If you applied this rate of inflation to Google's market cap, then instead of $126 billion, it would be closer to $3 billion. I think that's a rather good way to estimate Google's actual worth.

nickied

4:10 pm on Nov 27, 2005 (gmt 0)

>> Scarecrow

>> One way to compare the URL-only to the fully indexed pages is to use the "site:" command in conjunction with a word or phrase that is present on every page. Something like the word "reserved" <<

Not sure I understand this fully but by doing the above this is what I get:

site:mysite.com reserved 741
site:mysite.com -reserved 1,310

The -reserved search gives me url onlys and also supplementals with cache. Checking the cache, the dates on the supplementals go back at least to last January and some up to this November.

Been discussed before, but further comments - old cache dates?

Lorel

7:33 pm on Nov 27, 2005 (gmt 0)

Here is what Google Guy says about inflated page counts:

So what's the problem with a session id, and why doesn't Googlebot crawl them? Well, we don't just have one machine for crawling. Instead, there are lots of bot machines fetching pages in parallel. For a really large site, it's easily possible to have many different machines at Google fetch a page from that site. The problem is that the web server would serve up a different session-id to each machine! That means that you'd get the exact same page multiple times--only the url would be different. It's things like that which keep some search engines from crawling dynamic pages, and especially pages with session-ids.

From Google's Webmaster Guidelines:

Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page. ....Don't use "&id=" as a parameter in your URLs, as we don't include these pages in our index.

This 65 message thread spans 3 pages: 65