I see it too -- this certainly is an error/bug and not just some intentional change. An unfortunately, it has the effect of taking away a valuable tool for keeping duplicate content and urls out of the index.
I see some pages that were last modified in 2003 May and which have had the <meta name="robots" content="noindex"> tag on each one since at least that time, that are listed in Google as Supplemental Results with a full title and snippet, and with a full cache from 2005 June. The Google cache shows the <meta name="robots" content="noindex"> tag.
Google has been ignoring the robots noindex meta tag. I know. I designed that site. I am the only one with FTP access and the files have not been altered since 2003, and have always had the robots noindex on them since that date. The whole site was and still is disallowed, as it is the development copy of the site. The live site is elsewhere, on some other server.
In looking further I have found many other pages on many other sites that have exactly the same problem. You can see the noindex tag on the cached copies!
>>>>The whole site was and still is disallowed, as it is the development copy of the site. The live site is elsewhere, on some other server. <<<
I wonder if that will bring a dup content penalty. Man, things are a mess. G is deindexing pages I want indexed and sticking in pages you DON'T want indexed. Cheese!
There was an issue with the adsense bot indexing pages and not following meta tags. Are these adsense or adwords sites?
No adsense or adwords. Site has been out of the SERPs for three years - or at least it should have been. I do know that it was definately not showing in the SERPs in 2003 and 2004. I haven't really checked since then. I had no reason to. I suspect that it has recently reappeared in the SERPs, perhaps just the last few weeks or months, even though the cache is from a year ago.
It's not just noindex. I have a site that uses noarchive,nofollow that are showing up in the index also. Never had a problem using it before. I just noticed this issue a few days ago.
The cached dates listed are also from mid-2005.
That would be a nasty bug. Well, robots.txt has always worked for me. Anyone seen a violation of robots.txt by G?
Sort of. For some other site (a forum) that has various pages (pages that are not threads) disallowed, and all pages are www pages, I see an error.
When I do a site:domain.com search I do see the pages that I should see, but when I do a site:domain.com -inurl:www search, I can see a load of www pages that are disallowed. These all have an old cache and are marked as Supplemental Results.
I wrote about that over at: [webmasterworld.com...] just a few days ago, specifically msg#75: [webmasterworld.com ]
serious Im getting tied of all there bugs now
Copyright infringement at its finest.
Of course, google wants to do good -- so we should let it steal our work.
when i posted this last year comments just suggested i had the tag wrong...now at least we can agree that google ingores the noindex tag..if you dont want to be spidered you must exclude the directory in robots.txt
Ignoring the noindex tag happens for me since two or three weeks.
The strange thing is that not all websites I have seem to suffer from the ignored noindex tag problem. It only appears on one site which has a bunch of supplemental results. the other sites have loads of noindex pages which are still hidden from the index.
Just as with the "-inurl:www bug" g1smd also discovered a week ago, this bug seems also a supplemental index problem. It seems Google rewrote the supplemental index software and forgot some basic functionality which has been there for years.
This is bad timing for me.
I was reading here about the advantages of blue-widget-parts over bluewidgetparts or blue_widget_parts as filenames. I'm starting to go through my sites and change the filenames over.
Unfortunately I'm using Frontpage (yeah I know..). FP doesn't play well with htaccess. The solution I'm using is to rename blue_widgets.htm to blue-widgets.htm etc. Then I take the original page and do a meta refresh to redirect to the new page and strip out everything but a 'this page has moved' notice. the final part is to add a robots tag of noindex, follow.
I was hoping that the googlebot would heed the robots tag with noindex, follow However it now appears that it may not heed this directive.
I think I'll be OK since the blank pages probably won't be indexed anyway. As a safeguard I'll probably have to add the old pages to robots.txt which will be a little tedious. If I don't add them to robots.txt then I will have a lot of blank pages which might cause a penalty.
I want to thank everyone here for helping keep me up to date on things..It's invaluable! This thread saved me from creating a major headache for myself.
>> Anyone seen a violation of robots.txt by G?
I had a site with a directory that had been blocked with robots.txt since February. The robots.txt was submitted to G via the URL removal tool in Feb.
Pages almost immediately fell out of the index & weren't requested by Googlebot AT ALL in march or april. Come May 8th, Gbot requested over 6000 of these pages that were (at still are) blocked w/ robots.txt.
Shortly after, the supp index for this site disappeared.
So yeah, I've had a lot of fun with Gbot COMPLETELY ignoring a robots.txt exclusion.
All of my NOINDEX pages are correctly not indexed, across 3 completely different sites where I use it (to prevent "printer friendly" pages getting into the index).
Any other common denominators that you can spot?
What makes NOINDEX work for some, but not for others?
I have a directory that was supplemental and blocked by the robots.txt
My pages are not showing up.
I would venture to say this has something to do with the crawling of the supplemental index.
We had a comment that linked to multiple pOrn sites on one of our corporate blogs -- we deleted the comment months ago. In fact, the blog itself has a completely new URL. The comment is back in the supplementals with a cache date of July 2005. Found it in a sitesearch for a keyword related to the blog.
This is a cache issue for us. The page does not even exist anymore. But there it is, full of inappropriate links in the google results.
When you click on the link in the results (the description is all about p0rn) you are taken to the new version of the blog. Very frustrating.
Use the google url removal tool and 404 the pages.
I have seen Google ignoring my robots.txt file as well. In my situation I have a page from my old site that I implemented a 301 redirect to the same page on my new site and excluded this page in my robots.txt file on the new site. For some reason Google is completely ignoring the robotx.txt file and this page is indexed and ranking on the 1st page of results.
Why should google have to obey standards? They ARE the web. :)
|Why should google have to obey standards? They ARE the web. :) |
That would be MS rather than G.
>>>That would be MS rather than G. <<<
MS thinks they are all things pc.
G is not alone. We have a site we have been developing for a couple of months. Every page has:
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
Yahoo's has listed 66 of the pages. Every cached page source reveales the meta tag.
<meta name="robots" content="none">
Been using this for years and I can't find any of my pages that utilize it in the index. I just spot checked 10 of them from different domains and they are nowhere to be found so it appears to be working on my end.
did you implement a robots.txt file, i always pair them with the ROBOTS meta tag and have no issues so far!
|did you implement a robots.txt file, i always pair them with the ROBOTS meta tag and have no issues so far! |
I have several pages on several different sites where I did this when the pages were added and not afterwards. The pages still appear in the index. It's a great feeling when you see sale items from Nov, 2005 preserved forever despite doing everything you are supposed to do.
|Did you implement a robots.txt file, i always pair them with the ROBOTS meta tag and have no issues so far! |
The robots.txt file Disallow will produce URI only listings with Google.
Based on experience and discussions here at WebmasterWorld, you would remove the Disallow: and drop the Robots META Tag on the pages you don't want indexed.
When the bot requests the robots.txt file and there is a Disallow for a page you don't want indexed, Googlebot will index the URI only.
If you are disallowing the bot from visiting the page, it won't see the Robots META Tag to follow whatever directives you have in there.
So, the best method as described here on WebmasterWorld is to use the Robots META Tag to keep stuff out of the index and not the robots.txt file.
I've only seen the disallowed URI only listings when performing advanced search queries.
My original observation is that pages containing the <meta name="robots" content="noindex"> tag (for the last three years) are now showing in SERPs with a full title and snippet, and have a link to a cached page from 2005 July, and that the cached page clearly shows the meta noindex tag too.
I see many results like this all over the place. They are all (so far) pages that are being shown as Supplemental Results.
I think we've concluded that Google is broken right now for many so whatever is occuring now is outside the norm.
| This 70 message thread spans 3 pages: 70 (  2 3 ) > > |