| This 70 message thread spans 3 pages: < < 70 ( 1 2  ) || |
|Google ignores the meta robots noindex tag.|
Thousands of pages show that tag in the Google cache!
How many people have noticed the many thousands of Supplemental pages with a 2005 June or July cache date that have been indexed, show in the SERPs with a full title and description, rank, and have a <meta name="robots" content="noindex"> tag both on the live page and in the old cached copy linked from the SERPs.
Oh yeah! There are thousands of them. Now that is a programming bug.
Obviously when you place the <meta name="robots" content="noindex"> tag on a page, you expect that the page will be spidered, but not indexed. But if Google keeps no data about that URL, they will never "remember" anything about that URL, and will return every few minutes to "discover" the content, again and again.
However, when you think about it, Google must keep a copy of that page internally so that they can tell when the content has changed, and so that they know where that page links to, and so that they know the index/noindex status of the page.
Such a page should never appear in search results, ever. Well, now they do, and in very large numbers; all (so far) marked as Supplemental Results, and with cache dates from a year ago.
It appears that something in their system is forgetting to check the index/noindex status of the pages in their database and is showing them all in the SERPs whatever their status.
I first noticed this yesterday; but found it on some searches that I have not done for several months. I have no idea how long this bug has been showing up... it could be several months.
In the two hours since I started the note above, the number of my pages Google reported knowing dropped from "about 8640" to "about 297" which is about what it should be. Now if it will just STAY that way...
G* still delivers some annoying SUPPLEMENTAL RESULTS for some searches but that does seem to be getting better.
I had a set of results last week that implied Google was ignoring noindex and treating nocache as noindex.
I wonder if it's a parser bug.
Also, when I exculde a certain portion from my site from being indexed using robots.txt, it goes to supplemental.
Here is what Google Sitemap Diagnostic says about ROBOTS.TXT
|URLs restricted by robots.txt [?] |
Below are URLs we tried to crawl (found either through links from your Sitemaps file or from other pages) that we didn't crawl because they are listed in your robots.txt file. You may have specifically set up a robots.txt file to prevent us from crawling this URL. If that is the case, there's no need to fix this; we will continue to respect robots.txt for this file. If you want us to crawl these pages, make sure that your robots.txt file doesn't restrict our access.
G* doesn't say it won't display them, only that is didn't crawl them. :-(
Our page count is slowly moving down also... We have a couple of sections which are noindex, nofollow. Somehow, Google ended up with these in the index. They were all in the supplemental and dated back to around June-July 2005.
Our site has approximately 20,000 pages. Using the "site:" command, we have gone from around 260,000 pages down to about 145,000 in a week or so. Not sure what is up with this. Not sure what the other 120,000 pages are/were? It would be nice to see an accurate page count.
Yes, Supplemental Results with a 2005 June or July cache date (maybe others too) and a noindex meta tag, are shown in the index, rank for search terms related to content on that page, and have a Google cache that shows the noindex tag in place on the page.
|Our site has approximately 20,000 pages. Using the "site:" command, we have gone from around 260,000 pages down to about 145,000 in a week or so. Not sure what is up with this. |
I'm watching one site now go from 12,000,000 to 32,000,000 to 18,000,000 to 34,000,000 day in, day out. The fluctuations in page counts are a sure sign that something is broken. ;)
|Not sure what the other 120,000 pages are/were? It would be nice to see an accurate page count. |
This may not apply to you but many sites that are dynamic have URI paths to the same content through many different queries. At some point, Googlebot got smart and started generating and indexing those queries. So, you may have a time when your page count is 12 times that of what it should be and there may be a possibility that Googlebot has found 12 different ways to reach the same content. :(
I had a similar issue with robots commands recently.
After having dissalowed G, MSN and Y access to a folder on the server G promtly went and index every page within it.
I then changed the robots file, after advice received here, and put noindex or none etc. into the head of each page.
Low and behold the pages were recrawled and indexed with the various commands displayed in the cached pages.
Seem to have not sorted it however by using rewrite to show a 403 to any robot in the list that tries to access these pages
Being tired of GooGoo showing year old SUPPLEMENTAL RESULTS in some searches, I decided to limit the search to "Within the Last Three Months" in Advanced Search. Nope! Those pesky year old Cached pages still show up.
The connection with the logic of the Supplemental Database is broken. It serves whatever it feels like when a supplemental result is available for the query.
| This 70 message thread spans 3 pages: < < 70 ( 1 2  ) |