Forum Moderators: Robert Charlton & goodroi
Now everytime I push the search button, the more pages are listed "it says" many time the amount of the pages that realy exists.
I don't have any idea as of why the web pages are inflated. The idea of phantom pages is impossible because we don't have those pages in our sites. I also don't think it is Google bug.
The only possibility that I could think of is that we interpret differently from what Google actually does. For ex. - With site:mydomain.com, Google reports: Results 1 - 10 of about 27,600 from mydomain for (0.13 seconds), we interpret that we have 27,600 pages from my mydomain.com, but actually we have about 8,000 pages only. In our view, this is not correct. But for Google, it is meant that there are 27,600 pages from mydomain.com in Google index. For Google, it is correct!
If this was true, then it means that Google has 3 or 4 copies of a particular page of the big website in its index (but in different databases). Perhaps these different copies were created or grasped by different Googlebots and they are "unprocessed" and kept in queue for iteration. After the copies are processed, then only 1 copy is maintained.
For smaller website (pages less than 100 or 1000), Google can complete its processes or iteration of different copies quite easily, so it reports the actual number of pages quite accurately. For larger ones, it is different; Google never completes its iteration of every copies, that is why we always see the size inflation or more page count that we are seeing now.
Google says I have about four times the amount of pages I actually have. (Includes a lot of links only that are denied in robots.txt)
site:mydomain and site:www.mydomain return same number of pages.
prior to this past update, the total pages showed slightly less than the actual number of pages which I though was due to some not being indexed for various, and valid, reasons (little textual content and/or no index tags).
Using copyscape I have found hundreds of scraper sites with snippets of my code plus a number of ebay sellers that copy some content and/or images, but not a single page that was copied in total lately.
The site has been around since 1999 and there have been many changes to file names over the years. these were either allowed to go 404 or are 301'd in htaccess to the new file name.
However, since y! can't seem to understand 301's, they still have some old file names (from as far back as 2001!) which are still listed in their index and these are sometimes picked up by the smaller engines/directories.
What I've noticed over the last couple months is g'bot has repeatedly been trying to crawl many of those old 404/301'd files.
I wonder if that has something to do with the discrepancy in total number of pages - although there were never that many renamed riles.
Some of it is over-estimation of the number of pages. I have a site where page 1 of the SERPs says 1 to 10 of 615, but when you get to page 2 it says 11 to 20 of 455 but when you get to the end you only get as far as 271 to 278 of 278.
Some of it is Google knowing about sites that redirect to you using a 302 redirect, and have, or are attempting to, hijack your listings. Google is including their URLs as being a part of your own site. Google has tried to hide these from the search results, but the data is still there in their database.
Many sites are showing an inflation of double to quadruple the real size of the site.
I know google on have 40% of our site included, so about 1000 pages, but it said 16.400 pages and everytime I pushed the search button the no. increased.
Today its normal again.
What you are talking about is that google indexed everything from a site, even if it issnt a site, just a pop up, ad management soft and so on, but alsosometimes it includes what google thinks is dublicated sites, but on other domains.
I never noticed the difference in the number of pages on succeeding serp's before but it only shows a difference of 10 out of over 3000 pages for this site.
I also never found any results for a listing of mydomain.com that showed otherdomain.com when hovering over the link, but since this site didn't appear to be affected by the 302 "bug", I never spent a lot of time looking.
Hope it's just some new(er) anomaly and I don't have to worry.