Forum Moderators: Robert Charlton & goodroi
Over the past couple of days, the number stood at 999 using the site:www.foo.com command. Today I noticed something interesting. The same command that returned 999 pages over the last several days now shows 9,140 pages. Quite a jump. Now I understand why everyone's been talking about inflated page numbers. I was wondering if this happened only for sites in excess of 1,000 pages in Google's index. For someone that just crossed this milestone, that was my experience.
I see this a lot. There are very many irrelevant Supplemental pages with cache dates from one year ago all over the SERPs. Been like it for several months.
>> By the way, at least HotBot can count. Checking in Hotbot indicates that Google has exactly 1,000 pages for my website - not 9,100. <<
No idea about HotBot, but the page counts in DogPile do NOT include any supplemental results - they only include normal pages.
If you do a site command in Google the count returned seems to be "inflated" whenever the server the site is hosted on sends the last modified date as part of the header.
Paste your home page into a header checker and take a looksee.
Take your site and do site commands in each of Y!, Google, and MSN.
A very good example is the Apache website.
This was mentioned by a couple of others in stickies.
I just thought you might want to know what appears to cause wildly "inflated" page counts.
There are also supplemental pages and in the case of Windows servers mixed case duplicate page names, etc...
In short it is a number of factors.
Now what Google does with the "pages" with last modified dates on them other than add to a counter I haven't a clue.
A dynamic site, however, that does send last modified headers, does appear to trigger this bug. I have enough sites done both ways to determine this without much doubt, then the apache.org site thebear mentioned is an excellent example of this.
Also, there appear to be other factors that contribute to the inflated page count, steveb notes that site with > 1000 pages also seem to trigger this error, although it would be nice to know how many of those are sending last modified headers as well.
So we have a new indexing system in google, and that new system has bugs in it. From now on I'm assuming I'm looking at a new algo, that was implemented around last december. Which means I'm going to be spending my time learning how this new beast works. The bugs I'm seeing all point to new system, these are not mature bugs, it's something new, this stuff is too basic.
This will make it a bit harder to know what's causing events, for example, yesterday, I just saw another new bug in action, we made a tweak, the bug was revealed, something that shouldn't happen happened, and the site rose. All very interesting.
I haven't gone playing with anything to say for certain under what conditions this appears however one static site that serves that data in the headers also is showing wildly inflated counts.
In any event the counts are shall we say not 'xactly spot on?
If you forget the trailing / then your link to www.domain.com/folder will first be redirected to domain.com/folder/ {without www!} before arriving at the required www.domain.com/folder/ page.
The intermediate step, at domain.com/folder/ will kill your listings. Lucklily, this effect is very easy to see if you use Xenu LinkSleuth to check your site: it shows up as reporting double the number of pages (when you generate the sitemap) that you actually have, with half of the pages having a title of "301 Moved".
To check this for yourself type this into the Google search box:
site:www.yourdomain.tld/folder/
All my folders had the correct number of pages indexed apart from one - which was the only folder with 1000+ real pages. So there may be some kind of 1K problem.
Still a long way but there's hope.
Concerning the 1000 pages barrier: Maybe Google considers smaller sites as no problem when they have dupe content and applies the dupe filter only at sites with > 1000 articles.
Where it has happened to me there are no extra or duplicate pages. Do the site: search and click through the results (if viable to do so) in every case I find the results actually stop at an accurate page count!
It is simply a case of Google claiming the index is bigger than it really is - no technical mystery.
I mentioned, a while ago, in another thread that this was such an obvious ruse to see through that it may well backfire in bad publicity. Now we see the index size claim has gone from G's homepage. Perhaps this is a vindication?
[Have a site with about 12,000 pages. Google for 4-5 months reported 24,000 and was sending visitors. Beginning of this month it thinks we have 72,000 and stopped sending visitors.]
My Statement
I have a site with above 1 lakhs pages yet google till last month had spidered around 22500 pages and the visitors from google was around 2500 to 3000 per day NOW a GREAT NEWS recently google spidered our site again in depth and we have more then 60000+ pages spidered in google but the BAD NEWS from last 2 weeks we have been only abel to receive not more then 400 google visitors per day now what you all have to say about it? Dose Google Spider More Pages and stop sending visitors. funny huuuu.... GGGGRRRRRR
Results 1 - 10 of about 15,300,000 from dmoz.org
Anyone from the ODP who cares to comment?
Maybe Google is cleaning up its collection of oddball pages and that is what we are seeing.
Half a million category pages, each with a link to the "suggest a site" page, the "update listing" page, the "category description" page, the "category edit" page, the "apply to be an editor" page, the "edit description" page, the "report abuse" page, etc, makes about 4 to 5 million "real" pages. Then there are less than half a million other pages (mostly informational) on this site, and other dmoz.org sub-domains. The true total should be well under 6 million.
Notice that site:www.dmoz.org returns zero results. A year ago that result showed several million of the "302 hijack" URLs. Google has since filtered them out of the results.
For site:dmoz.org I guess there are still millions more 302 hijack pages - Google are not showing them, but they are still counting them.
Even for a term that no one optimizes for (my name), I continue to sink in Jagger2. At the same time, my site (which has around 1,025 pages now) continues to rise in page count using the site:www.sitename.com query. Right now the count stands at 9,700 pages.
Is anyone else suffering from a recent run up in page numbers (as reported by Google) also sinking in Jagger2?
As mentioned, I'm afraid Google is using this incorrect count in their calculations and that makes it look like the site has grown from 1,000 pages to nearly 10,000 overnight.