Welcome to WebmasterWorld Guest from 126.96.36.199
Oh yeah! There are thousands of them. Now that is a programming bug.
Obviously when you place the <meta name="robots" content="noindex"> tag on a page, you expect that the page will be spidered, but not indexed. But if Google keeps no data about that URL, they will never "remember" anything about that URL, and will return every few minutes to "discover" the content, again and again.
However, when you think about it, Google must keep a copy of that page internally so that they can tell when the content has changed, and so that they know where that page links to, and so that they know the index/noindex status of the page.
Such a page should never appear in search results, ever. Well, now they do, and in very large numbers; all (so far) marked as Supplemental Results, and with cache dates from a year ago.
It appears that something in their system is forgetting to check the index/noindex status of the pages in their database and is showing them all in the SERPs whatever their status.
I first noticed this yesterday; but found it on some searches that I have not done for several months. I have no idea how long this bug has been showing up... it could be several months.
Google has been ignoring the robots noindex meta tag. I know. I designed that site. I am the only one with FTP access and the files have not been altered since 2003, and have always had the robots noindex on them since that date. The whole site was and still is disallowed, as it is the development copy of the site. The live site is elsewhere, on some other server.
In looking further I have found many other pages on many other sites that have exactly the same problem. You can see the noindex tag on the cached copies!
I wonder if that will bring a dup content penalty. Man, things are a mess. G is deindexing pages I want indexed and sticking in pages you DON'T want indexed. Cheese!
When I do a site:domain.com search I do see the pages that I should see, but when I do a site:domain.com -inurl:www search, I can see a load of www pages that are disallowed. These all have an old cache and are marked as Supplemental Results.
The strange thing is that not all websites I have seem to suffer from the ignored noindex tag problem. It only appears on one site which has a bunch of supplemental results. the other sites have loads of noindex pages which are still hidden from the index.
Just as with the "-inurl:www bug" g1smd also discovered a week ago, this bug seems also a supplemental index problem. It seems Google rewrote the supplemental index software and forgot some basic functionality which has been there for years.
I was reading here about the advantages of blue-widget-parts over bluewidgetparts or blue_widget_parts as filenames. I'm starting to go through my sites and change the filenames over.
Unfortunately I'm using Frontpage (yeah I know..). FP doesn't play well with htaccess. The solution I'm using is to rename blue_widgets.htm to blue-widgets.htm etc. Then I take the original page and do a meta refresh to redirect to the new page and strip out everything but a 'this page has moved' notice. the final part is to add a robots tag of noindex, follow.
I was hoping that the googlebot would heed the robots tag with noindex, follow However it now appears that it may not heed this directive.
I think I'll be OK since the blank pages probably won't be indexed anyway. As a safeguard I'll probably have to add the old pages to robots.txt which will be a little tedious. If I don't add them to robots.txt then I will have a lot of blank pages which might cause a penalty.
I want to thank everyone here for helping keep me up to date on things..It's invaluable! This thread saved me from creating a major headache for myself.
I had a site with a directory that had been blocked with robots.txt since February. The robots.txt was submitted to G via the URL removal tool in Feb.
Pages almost immediately fell out of the index & weren't requested by Googlebot AT ALL in march or april. Come May 8th, Gbot requested over 6000 of these pages that were (at still are) blocked w/ robots.txt.
Shortly after, the supp index for this site disappeared.
So yeah, I've had a lot of fun with Gbot COMPLETELY ignoring a robots.txt exclusion.
This is a cache issue for us. The page does not even exist anymore. But there it is, full of inappropriate links in the google results.
When you click on the link in the results (the description is all about p0rn) you are taken to the new version of the blog. Very frustrating.
did you implement a robots.txt file, i always pair them with the ROBOTS meta tag and have no issues so far!
Did you implement a robots.txt file, i always pair them with the ROBOTS meta tag and have no issues so far!
The robots.txt file Disallow will produce URI only listings with Google.
Based on experience and discussions here at WebmasterWorld, you would remove the Disallow: and drop the Robots META Tag on the pages you don't want indexed.
When the bot requests the robots.txt file and there is a Disallow for a page you don't want indexed, Googlebot will index the URI only.
If you are disallowing the bot from visiting the page, it won't see the Robots META Tag to follow whatever directives you have in there.
So, the best method as described here on WebmasterWorld is to use the Robots META Tag to keep stuff out of the index and not the robots.txt file.
I've only seen the disallowed URI only listings when performing advanced search queries.
I see many results like this all over the place. They are all (so far) pages that are being shown as Supplemental Results.