Forum Moderators: Robert Charlton & goodroi
The topic was covered, though very briefly in August, thread: [webmasterworld.com...]
On my small site, I figure around 1,000 pages. G thinks around 10k. The yourcache tool shows the problem starting around July 4. Correct count came back around July 10 and then took off on July 16 and went to 10k+ now.
I do have a 301 redirect in place which took care of the non-www problem. I also see at least 4 different cache dates going back to around last January.
In the old cache I've got old pages which I can't get rid of (using console). Long story. My programmer will attempt to fix these through htaccess returning a 404.
Surely this must be as important as the jagger threads owing to possible dup content?
Comments?
I also posted on this topic back in late September:
[webmasterworld.com...]
Right now, Google says I have 9,790 pages. The site only has about 1,100 pages. Even at this inflated number I get this:
"In order to show you the most relevant results, we have omitted some entries very similar to the 995 already displayed."
Around this same time my rankings in Google started to drop considerably and has not regained any ground in J3. I suspect this incorrect page count bug is causing a penalty. In other words, if Google thinks I've added 8,000 pages in the last two months (actually, this problem appeared overnight once the site hit around 1,000 pages), they might be suspect of the site.
Not a good thing, almost like a return to the sandbox. I've done all the suggested fixes (301, 404...) but in my case I think there is a bug in Google's algo when a site hits 1,000 pages indexed.
>I also posted on this topic back in late September:
yes, saw it, forgot link.
i'm using yourcache tool. do it every day since spring so it has history. right now i'm at -1. this happens every so often. don't see a pattern. then comes back to inflated number. just added 200 more to 10,600.
i'm at around 1,000 real pages. pretty sure just <1000. not sure your theory about >1000 is part of problem. probably dup content is though.
i've got many cache dates going back to january.
i'm surprised there isn't more comment on this?
thanks.
Never an answer from G either about this.
I'm wondering if a sandbox-like affect (penalty) is hitting me because of the jump in pages. As mentioned earlier, I would think Google's calculations are using the inflated number.
>>I was just about to ask the same question - Does having a sitemap help.<<
A little off topic of the inflated counts but this was my experience with xml site map.
Could not get the non-www's to go away. Had the 301 in place since last December/January. The non-www just would not go away for months.
Whenever the site map program was instituted; last spring/summer? I made site map. Shortly thereafter, the non-www's went away. Coincidence? I don't know. Haven't seen any other benefit from it.
My count via the generator was 1,028 pages. This morning I jumped from 9,800 pages to 10,200. I'll have to wait and see if the sitemap helps at all.
I am definately suffering some kind of penalty - whether it's related to the inflated page count or some kind of duplicate content, I'm not sure.
Oh well, there is always next year.
The weird thing is that only our interior pages got penalized. Our homepage still ranks fairly well. I hope Google fixes this soon!
About 90% of the pages are glossary pages with one or two unique sentences per page. Previously the definitions were surrounded by a long templated nav bar. I changed these pages two weeks ago (the change has not yet improved ranking) by eliminating the common site-wide left nav bar with its 40+ keyword-rich internal links and a common footer with more internal links on each of these pages. I substituted a breadcrumb nav instead. My thinking is that the identical (duplicate) content of the old nav bar on every glossary page overwhelmed the "unique" definitions on each glosssary page thereby creating a possible dup filter. Since googlebot may have seen these 2000+ pages as largely identical (except for one sentence per page). However, I have not seen any benefits from this change yet.
As far as I am aware, this has nothing to do with penalties or aspects of your site.
Let's think this one through...
My site has 1,020 pages, yet shows 10,400 in Google site:www.domain.com command. So why the difference? Let's walk through some scenarios:
1. - Google intentionally gives the wrong count for this command. This could be a deceptive move to inflate the index's size. Not sure what other motive they would have to purposefully show the wrong number. I don't subscribe to this explanation.
2. - When a website reaches a certain size (1,000 pages), it switches from an actual count to an estimate. But the estimating algo has significant problems (we all agree).
If Google had a more accurate count, they would likely show that number. So my assumption is that in downstream calculations, they also use this incorrect value (10x too high). One technique used by spammers is to use a database to generate lots of content. Let's say that too much content too fast trips a spam penalty. Guess what happens to those sites like mine? Back to a sandbox...
I actually posted the question here, many months ago, on the day it first happened to me, my initial concern was some pending duplicate content problem.
Initially it was about 2x inflation and just kept going up.
On a more recent thread I posted how potentially embarrassing it could be for G' were it wider known they were inflating their index claims. Not long after they removed the index page count from their homepage. Clearly someone at G had similar concerns
First, you should know that the absolute maximum number of pages available to spiders on my site is 132,000. These are all static pages. The file-naming conventions have been stable for three years now. The site has been online for ten years. The updates from one year to the next are less than seven percent of the total.
Using sampling techniques with a special cross-sectional keyword so that I can get this number below 1,000 (all numbers over 1,000 are worthless because they are unverifiable), I estimate that Google's actual coverage of my 132,000 available pages is 66 percent. That would make a realistic number that Google should report for the entire site closer to 87,000 than to 132,000.
Now look at the numbers Google reports. I'm not kidding here, and this has been going on for around nine months or more. This is a nonprofit, noncommercial, educational site.
site:www.mydomain.org 5,350,000
site:www.mydomain.org reserved 390,000
site:www.mydomain.org -reserved 1,320,000
Forty times the absolute maximum possible count. Can anyone beat this? If you applied this rate of inflation to Google's market cap, then instead of $126 billion, it would be closer to $3 billion. I think that's a rather good way to estimate Google's actual worth.
>> One way to compare the URL-only to the fully indexed pages is to use the "site:" command in conjunction with a word or phrase that is present on every page. Something like the word "reserved" <<
Not sure I understand this fully but by doing the above this is what I get:
site:mysite.com reserved 741
site:mysite.com -reserved 1,310
The -reserved search gives me url onlys and also supplementals with cache. Checking the cache, the dates on the supplementals go back at least to last January and some up to this November.
Been discussed before, but further comments - old cache dates?
So what's the problem with a session id, and why doesn't Googlebot crawl them? Well, we don't just have one machine for crawling. Instead, there are lots of bot machines fetching pages in parallel. For a really large site, it's easily possible to have many different machines at Google fetch a page from that site. The problem is that the web server would serve up a different session-id to each machine! That means that you'd get the exact same page multiple times--only the url would be different. It's things like that which keep some search engines from crawling dynamic pages, and especially pages with session-ids.
Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page. ....Don't use "&id=" as a parameter in your URLs, as we don't include these pages in our index.