Forum Moderators: Robert Charlton & goodroi
Oh yeah! There are thousands of them. Now that is a programming bug.
.
Obviously when you place the <meta name="robots" content="noindex"> tag on a page, you expect that the page will be spidered, but not indexed. But if Google keeps no data about that URL, they will never "remember" anything about that URL, and will return every few minutes to "discover" the content, again and again.
However, when you think about it, Google must keep a copy of that page internally so that they can tell when the content has changed, and so that they know where that page links to, and so that they know the index/noindex status of the page.
Such a page should never appear in search results, ever. Well, now they do, and in very large numbers; all (so far) marked as Supplemental Results, and with cache dates from a year ago.
It appears that something in their system is forgetting to check the index/noindex status of the pages in their database and is showing them all in the SERPs whatever their status.
I first noticed this yesterday; but found it on some searches that I have not done for several months. I have no idea how long this bug has been showing up... it could be several months.
G* still delivers some annoying SUPPLEMENTAL RESULTS for some searches but that does seem to be getting better.
</Added>
URLs restricted by robots.txt [?]Below are URLs we tried to crawl (found either through links from your Sitemaps file or from other pages) that we didn't crawl because they are listed in your robots.txt file. You may have specifically set up a robots.txt file to prevent us from crawling this URL. If that is the case, there's no need to fix this; we will continue to respect robots.txt for this file. If you want us to crawl these pages, make sure that your robots.txt file doesn't restrict our access.
Our site has approximately 20,000 pages. Using the "site:" command, we have gone from around 260,000 pages down to about 145,000 in a week or so. Not sure what is up with this. Not sure what the other 120,000 pages are/were? It would be nice to see an accurate page count.
Thanks.
Our site has approximately 20,000 pages. Using the "site:" command, we have gone from around 260,000 pages down to about 145,000 in a week or so. Not sure what is up with this.
I'm watching one site now go from 12,000,000 to 32,000,000 to 18,000,000 to 34,000,000 day in, day out. The fluctuations in page counts are a sure sign that something is broken. ;)
Not sure what the other 120,000 pages are/were? It would be nice to see an accurate page count.
This may not apply to you but many sites that are dynamic have URI paths to the same content through many different queries. At some point, Googlebot got smart and started generating and indexing those queries. So, you may have a time when your page count is 12 times that of what it should be and there may be a possibility that Googlebot has found 12 different ways to reach the same content. :(
After having dissalowed G, MSN and Y access to a folder on the server G promtly went and index every page within it.
I then changed the robots file, after advice received here, and put noindex or none etc. into the head of each page.
Low and behold the pages were recrawled and indexed with the various commands displayed in the cached pages.
Seem to have not sorted it however by using rewrite to show a 403 to any robot in the list that tries to access these pages