|Why are "Site:" command pages inlfated?|
Just curious... How many people have accurate page counts when using the "Site:" command at Google?
If you do not, how far is it off from the actual number of pages you have?
If it is off, is your site in penalty mode?
What do you think is causing your site to have an incorrect count in the number of pages?
Has anyone corrected the page count and got it to reflect the right value? What technique did you use?
(Bump) Am I the only one?
I think my largest site has an inflated count of around 3X the actual number of pages indexed. However, no site under 100 pages seems to have an inaccurate count, as far as I can tell.
Google has some trouble with old pages in the index, but that doesn't account for this anomaly.
The site: command counts all URLs associated with the site, which is not the same as all pages of that site, or all indexed URLs. Some examples of URLs which are counted in the site: command:
- URLs temporarily deleted with the URL removal tool
- URLs from other sites doing a 302 hijack of your site (should be fixed by now)
- Obsolete URLs which have still links to them from other sites and which Google visits now and then just to see of they are active
- Links to your site with typos in it i.e. www.yourdomain.com/fiel.html instead of www.yourdomain.com/file.html. At one time I had many copies of my sitemap in the SERPs because I used the sitemap as my 404 page. Except for the original sitemap they now all went supplemental, but Google still counts them.
- URLs that have been marked with "noindex,follow".
Google keeps track of many more URLs of your site, but I don't know if these are counted in the site: result. For example, if you have a 301 redirect from domain.com to www.domain.com, then Google must know that domain.com/file.html exists, but is equivalent to www.domain.com/file.html. So there has to be some database record or field somewhere with information about domain.com/file.html, but I don't know if this one inflates the number in the site: command.
Most inflated sites that I have seen, have been serving both www and non-www but without a redirect. This is duplicate content.
Add a 301 redirect to fix that problem.
The 301 redirect would be the logical thing to get things back in line... However, what happens when Google has grabbed things and has never updated since 2004? If they dont revisit - this means they will never get the 301. Therefore, the stuff stays in the index.
Including also: stuff crawled by the Mozilla Googlebot only. Can verify this on one of my domains.
My client site also effect from that. It is shown only url filename, no title, description like previous time. When It will recover?
site: is fine for me, but link: is screwed up completely