Google Page Count More Than the number of Pages in the Site

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Page Count More Than the number of Pages in the Site

zeus

12:21 pm on Jun 16, 2005 (gmt 0)

many times a day Im watching the google serps to see if more of my site has been indexed after hijacking and the googlebug, the last 1-2 I have seen more pages indexed mostly more supplemental results.

Now everytime I push the search button, the more pages are listed "it says" many time the amount of the pages that realy exists.

nancyb

11:09 pm on Jun 16, 2005 (gmt 0)

This just happened to my site as well. In fact, G is now showing more than 3 x's the total number of pages I have. I checked the 1000 pages that G lists for site:mydomain.com and they are all correct, so I have not a clue what the additional pages are. Very strange and a little bit worrisome.

sit2510

10:08 am on Jun 17, 2005 (gmt 0)

I have a quick look on some of our sites. It appears that the smaller the size of the website, Google seems to report the number of pages quite accurately, especially those below 100 or 1000 pages (website without any dynamic url). The inflation of the web pages appears to occur when the site is big, with thousands of pages. My smallest medium-size website contains about 8,000 pages, and Google reports over 27,600 pages. Those bigger ones also show the same symptoms, i.e. 3-4 times more than the actual number of pages.

I don't have any idea as of why the web pages are inflated. The idea of phantom pages is impossible because we don't have those pages in our sites. I also don't think it is Google bug.

The only possibility that I could think of is that we interpret differently from what Google actually does. For ex. - With site:mydomain.com, Google reports: Results 1 - 10 of about 27,600 from mydomain for (0.13 seconds), we interpret that we have 27,600 pages from my mydomain.com, but actually we have about 8,000 pages only. In our view, this is not correct. But for Google, it is meant that there are 27,600 pages from mydomain.com in Google index. For Google, it is correct!

If this was true, then it means that Google has 3 or 4 copies of a particular page of the big website in its index (but in different databases). Perhaps these different copies were created or grasped by different Googlebots and they are "unprocessed" and kept in queue for iteration. After the copies are processed, then only 1 copy is maintained.

For smaller website (pages less than 100 or 1000), Google can complete its processes or iteration of different copies quite easily, so it reports the actual number of pages quite accurately. For larger ones, it is different; Google never completes its iteration of every copies, that is why we always see the size inflation or more page count that we are seeing now.

Smashing Young Man

12:21 pm on Jun 17, 2005 (gmt 0)

If you run forums, it will often index several different versions of the same thread; "view single post", "print thread", "reply to thread" and the archived version if you are using vBulletin.

Dayo_UK

12:36 pm on Jun 17, 2005 (gmt 0)

Yes - I have this too - and goes back to November time.

Google says I have about four times the amount of pages I actually have. (Includes a lot of links only that are denied in robots.txt)

nancyb

8:12 pm on Jun 17, 2005 (gmt 0)

my site isn't a forum, just a plain ecom site with plain vanilla html. It is under 2000 pages. I have always used absolute links to www.mydomain.com.

site:mydomain and site:www.mydomain return same number of pages.

prior to this past update, the total pages showed slightly less than the actual number of pages which I though was due to some not being indexed for various, and valid, reasons (little textual content and/or no index tags).

Using copyscape I have found hundreds of scraper sites with snippets of my code plus a number of ebay sellers that copy some content and/or images, but not a single page that was copied in total lately.

The site has been around since 1999 and there have been many changes to file names over the years. these were either allowed to go 404 or are 301'd in htaccess to the new file name.

However, since y! can't seem to understand 301's, they still have some old file names (from as far back as 2001!) which are still listed in their index and these are sometimes picked up by the smaller engines/directories.

What I've noticed over the last couple months is g'bot has repeatedly been trying to crawl many of those old 404/301'd files.

I wonder if that has something to do with the discrepancy in total number of pages - although there were never that many renamed riles.

g1smd

10:24 pm on Jun 17, 2005 (gmt 0)

Some of that inflation is caused by Google knowing about both the www and non-www version of your pages. Add a 301 redirect to fix that; allow many months for the actual numbers to change.

Some of it is over-estimation of the number of pages. I have a site where page 1 of the SERPs says 1 to 10 of 615, but when you get to page 2 it says 11 to 20 of 455 but when you get to the end you only get as far as 271 to 278 of 278.

Some of it is Google knowing about sites that redirect to you using a 302 redirect, and have, or are attempting to, hijack your listings. Google is including their URLs as being a part of your own site. Google has tried to hide these from the search results, but the data is still there in their database.

Many sites are showing an inflation of double to quadruple the real size of the site.

zeus

10:52 pm on Jun 17, 2005 (gmt 0)

My situation was different, when I made the first search site:mydomain it gave me the right no., then I went to the next pages and the no. of pages rised a 10%, from there, everytime I move to the next page it showed more sites.

I know google on have 40% of our site included, so about 1000 pages, but it said 16.400 pages and everytime I pushed the search button the no. increased.

Today its normal again.

What you are talking about is that google indexed everything from a site, even if it issnt a site, just a pop up, ad management soft and so on, but alsosometimes it includes what google thinks is dublicated sites, but on other domains.

nancyb

12:07 am on Jun 18, 2005 (gmt 0)

Thanks, g1smd for the explanation of possibilites ;) I've used the redirect for non-www to www for several years though.

I never noticed the difference in the number of pages on succeeding serp's before but it only shows a difference of 10 out of over 3000 pages for this site.

I also never found any results for a listing of mydomain.com that showed otherdomain.com when hovering over the link, but since this site didn't appear to be affected by the 302 "bug", I never spent a lot of time looking.

Hope it's just some new(er) anomaly and I don't have to worry.

Vimes

4:34 am on Jun 20, 2005 (gmt 0)

Hi,

Nancyb if you have secure pages make sure Gbot cannot index these pages, you might find that there's a link out there to the root https:// page.

Vimes.