Google ignores the meta robots noindex tag. - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google ignores the meta robots noindex tag.

Thousands of pages show that tag in the Google cache!

1
2
3
»

g1smd

12:27 am on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

How many people have noticed the many thousands of Supplemental pages with a 2005 June or July cache date that have been indexed, show in the SERPs with a full title and description, rank, and have a <meta name="robots" content="noindex"> tag both on the live page and in the old cached copy linked from the SERPs.

Oh yeah! There are thousands of them. Now that is a programming bug.

.

Obviously when you place the <meta name="robots" content="noindex"> tag on a page, you expect that the page will be spidered, but not indexed. But if Google keeps no data about that URL, they will never "remember" anything about that URL, and will return every few minutes to "discover" the content, again and again.

However, when you think about it, Google must keep a copy of that page internally so that they can tell when the content has changed, and so that they know where that page links to, and so that they know the index/noindex status of the page.

Such a page should never appear in search results, ever. Well, now they do, and in very large numbers; all (so far) marked as Supplemental Results, and with cache dates from a year ago.

It appears that something in their system is forgetting to check the index/noindex status of the pages in their database and is showing them all in the SERPs whatever their status.

I first noticed this yesterday; but found it on some searches that I have not done for several months. I have no idea how long this bug has been showing up... it could be several months.

tedster

1:05 am on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I see it too -- this certainly is an error/bug and not just some intentional change. An unfortunately, it has the effect of taking away a valuable tool for keeping duplicate content and urls out of the index.

g1smd

1:06 am on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I see some pages that were last modified in 2003 May and which have had the <meta name="robots" content="noindex"> tag on each one since at least that time, that are listed in Google as Supplemental Results with a full title and snippet, and with a full cache from 2005 June. The Google cache shows the <meta name="robots" content="noindex"> tag.

Google has been ignoring the robots noindex meta tag. I know. I designed that site. I am the only one with FTP access and the files have not been altered since 2003, and have always had the robots noindex on them since that date. The whole site was and still is disallowed, as it is the development copy of the site. The live site is elsewhere, on some other server.

In looking further I have found many other pages on many other sites that have exactly the same problem. You can see the noindex tag on the cached copies!

texasville

2:07 am on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>>>The whole site was and still is disallowed, as it is the development copy of the site. The live site is elsewhere, on some other server. <<<

I wonder if that will bring a dup content penalty. Man, things are a mess. G is deindexing pages I want indexed and sticking in pages you DON'T want indexed. Cheese!

trinorthlighting

2:29 am on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

There was an issue with the adsense bot indexing pages and not following meta tags. Are these adsense or adwords sites?

g1smd

11:51 am on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

No adsense or adwords. Site has been out of the SERPs for three years - or at least it should have been. I do know that it was definately not showing in the SERPs in 2003 and 2004. I haven't really checked since then. I had no reason to. I suspect that it has recently reappeared in the SERPs, perhaps just the last few weeks or months, even though the cache is from a year ago.

Key_Master

12:18 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

It's not just noindex. I have a site that uses noarchive,nofollow that are showing up in the index also. Never had a problem using it before. I just noticed this issue a few days ago.

The cached dates listed are also from mid-2005.

John Carpenter

12:56 pm on Jun 14, 2006 (gmt 0)

10+ Year Member

That would be a nasty bug. Well, robots.txt has always worked for me. Anyone seen a violation of robots.txt by G?

g1smd

1:18 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Sort of. For some other site (a forum) that has various pages (pages that are not threads) disallowed, and all pages are www pages, I see an error.

When I do a site:domain.com search I do see the pages that I should see, but when I do a site:domain.com -inurl:www search, I can see a load of www pages that are disallowed. These all have an old cache and are marked as Supplemental Results.

I wrote about that over at: [webmasterworld.com...] just a few days ago, specifically msg#75: [webmasterworld.com ]

zeus

1:32 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

serious Im getting tied of all there bugs now

boplicity

1:48 pm on Jun 14, 2006 (gmt 0)

10+ Year Member

Copyright infringement at its finest.

Of course, google wants to do good -- so we should let it steal our work.

soapystar

1:49 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

when i posted this last year comments just suggested i had the tag wrong...now at least we can agree that google ingores the noindex tag..if you dont want to be spidered you must exclude the directory in robots.txt

lammert

1:55 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Ignoring the noindex tag happens for me since two or three weeks.

The strange thing is that not all websites I have seem to suffer from the ignored noindex tag problem. It only appears on one site which has a bunch of supplemental results. the other sites have loads of noindex pages which are still hidden from the index.

Just as with the "-inurl:www bug" g1smd also discovered a week ago, this bug seems also a supplemental index problem. It seems Google rewrote the supplemental index software and forgot some basic functionality which has been there for years.

cmendla

2:18 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

This is bad timing for me.

I was reading here about the advantages of blue-widget-parts over bluewidgetparts or blue_widget_parts as filenames. I'm starting to go through my sites and change the filenames over.

Unfortunately I'm using Frontpage (yeah I know..). FP doesn't play well with htaccess. The solution I'm using is to rename blue_widgets.htm to blue-widgets.htm etc. Then I take the original page and do a meta refresh to redirect to the new page and strip out everything but a 'this page has moved' notice. the final part is to add a robots tag of noindex, follow.

I was hoping that the googlebot would heed the robots tag with noindex, follow However it now appears that it may not heed this directive.

I think I'll be OK since the blank pages probably won't be indexed anyway. As a safeguard I'll probably have to add the old pages to robots.txt which will be a little tedious. If I don't add them to robots.txt then I will have a lot of blank pages which might cause a penalty.

I want to thank everyone here for helping keep me up to date on things..It's invaluable! This thread saved me from creating a major headache for myself.

circuitspore

2:25 pm on Jun 14, 2006 (gmt 0)

10+ Year Member

>> Anyone seen a violation of robots.txt by G?

I had a site with a directory that had been blocked with robots.txt since February. The robots.txt was submitted to G via the URL removal tool in Feb.

Pages almost immediately fell out of the index & weren't requested by Googlebot AT ALL in march or april. Come May 8th, Gbot requested over 6000 of these pages that were (at still are) blocked w/ robots.txt.

Shortly after, the supp index for this site disappeared.

So yeah, I've had a lot of fun with Gbot COMPLETELY ignoring a robots.txt exclusion.

trillianjedi

2:34 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

All of my NOINDEX pages are correctly not indexed, across 3 completely different sites where I use it (to prevent "printer friendly" pages getting into the index).

Any other common denominators that you can spot?

What makes NOINDEX work for some, but not for others?

TJ

trinorthlighting

2:41 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I have a directory that was supplemental and blocked by the robots.txt

My pages are not showing up.

I would venture to say this has something to do with the crawling of the supplemental index.

LJCoolB

2:54 pm on Jun 14, 2006 (gmt 0)

10+ Year Member

We had a comment that linked to multiple pOrn sites on one of our corporate blogs -- we deleted the comment months ago. In fact, the blog itself has a completely new URL. The comment is back in the supplementals with a cache date of July 2005. Found it in a sitesearch for a keyword related to the blog.

This is a cache issue for us. The page does not even exist anymore. But there it is, full of inappropriate links in the google results.

When you click on the link in the results (the description is all about p0rn) you are taken to the new version of the blog. Very frustrating.

trinorthlighting

2:56 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Use the google url removal tool and 404 the pages.

gdawg

2:59 pm on Jun 14, 2006 (gmt 0)

10+ Year Member

I have seen Google ignoring my robots.txt file as well. In my situation I have a page from my old site that I implemented a 301 redirect to the same page on my new site and excluded this page in my robots.txt file on the new site. For some reason Google is completely ignoring the robotx.txt file and this page is indexed and ranking on the 1st page of results.

stuntdubl

3:01 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Why should google have to obey standards? They ARE the web. :)

John Carpenter

3:28 pm on Jun 14, 2006 (gmt 0)

10+ Year Member

Why should google have to obey standards? They ARE the web. :)

That would be MS rather than G.

texasville

3:55 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>>That would be MS rather than G. <<<

Nope..that's G.
MS thinks they are all things pc.

herb

5:13 pm on Jun 14, 2006 (gmt 0)

10+ Year Member

G is not alone. We have a site we have been developing for a couple of months. Every page has:
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

Yahoo's has listed 66 of the pages. Every cached page source reveales the meta tag.

pageoneresults

7:10 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

<meta name="robots" content="none">

Been using this for years and I can't find any of my pages that utilize it in the index. I just spot checked 10 of them from different domains and they are nowhere to be found so it appears to be working on my end.

oxbaker

8:20 pm on Jun 14, 2006 (gmt 0)

10+ Year Member

did you implement a robots.txt file, i always pair them with the ROBOTS meta tag and have no issues so far!

hth,
mcm

Atomic

8:40 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

did you implement a robots.txt file, i always pair them with the ROBOTS meta tag and have no issues so far!

I have several pages on several different sites where I did this when the pages were added and not afterwards. The pages still appear in the index. It's a great feeling when you see sale items from Nov, 2005 preserved forever despite doing everything you are supposed to do.

pageoneresults

9:12 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Did you implement a robots.txt file, i always pair them with the ROBOTS meta tag and have no issues so far!

The robots.txt file Disallow will produce URI only listings with Google.

Based on experience and discussions here at WebmasterWorld, you would remove the Disallow: and drop the Robots META Tag on the pages you don't want indexed.

When the bot requests the robots.txt file and there is a Disallow for a page you don't want indexed, Googlebot will index the URI only.

If you are disallowing the bot from visiting the page, it won't see the Robots META Tag to follow whatever directives you have in there.

So, the best method as described here on WebmasterWorld is to use the Robots META Tag to keep stuff out of the index and not the robots.txt file.

I've only seen the disallowed URI only listings when performing advanced search queries.

g1smd

9:28 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

My original observation is that pages containing the <meta name="robots" content="noindex"> tag (for the last three years) are now showing in SERPs with a full title and snippet, and have a link to a cached page from 2005 July, and that the cached page clearly shows the meta noindex tag too.

I see many results like this all over the place. They are all (so far) pages that are being shown as Supplemental Results.

pageoneresults

9:54 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I think we've concluded that Google is broken right now for many so whatever is occuring now is outside the norm.

This 70 message thread spans 3 pages: 70

1
2
3
»