Welcome to WebmasterWorld Guest from 34.204.169.76

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google ignores the meta robots noindex tag.

Thousands of pages show that tag in the Google cache!

     
12:27 am on Jun 14, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


How many people have noticed the many thousands of Supplemental pages with a 2005 June or July cache date that have been indexed, show in the SERPs with a full title and description, rank, and have a <meta name="robots" content="noindex"> tag both on the live page and in the old cached copy linked from the SERPs.

Oh yeah! There are thousands of them. Now that is a programming bug.

.

Obviously when you place the <meta name="robots" content="noindex"> tag on a page, you expect that the page will be spidered, but not indexed. But if Google keeps no data about that URL, they will never "remember" anything about that URL, and will return every few minutes to "discover" the content, again and again.

However, when you think about it, Google must keep a copy of that page internally so that they can tell when the content has changed, and so that they know where that page links to, and so that they know the index/noindex status of the page.

Such a page should never appear in search results, ever. Well, now they do, and in very large numbers; all (so far) marked as Supplemental Results, and with cache dates from a year ago.

It appears that something in their system is forgetting to check the index/noindex status of the pages in their database and is showing them all in the SERPs whatever their status.

I first noticed this yesterday; but found it on some searches that I have not done for several months. I have no idea how long this bug has been showing up... it could be several months.

11:25 am on June 15, 2006 (gmt 0)

Full Member

10+ Year Member

joined:Oct 27, 2004
posts:201
votes: 0


Fine
If google is ignoring noindex tag, we still make it ignore pages through a robots.txt

Correct me if i am mistaken?

1:24 pm on June 15, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 4, 2005
posts:621
votes: 0


I hope those "secret" pictures don't appear in the index...

:)

8:17 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


The noindex sites that are now indexed and cached by Google, that I have found, have the meta robots noindex tag on every page of the site and do NOT have a robots.txt file at all.

Traditionally, a robots.txt disallow instruction for a URL simply led to the page appearing as a URL-only entry in Google's index if they ever saw a link to the disallowed page. The meta noindex tag ensured that nothing appeared for the page at all. That is now completely broken.

For Yahoo, when they find the meta noindex tag, they do include the page as a URL-only entry, but often they also try to build an entry for the page, by using the anchor text of one of the links pointing to the page (from an external site) as the SERPs entry title!

9:01 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 5, 2006
posts:2095
votes: 2


I also have been finding those sites do not have their robots.txt file set up right. Different versions:

Robot.txt
robot.txt
Robots.txt
robots.txt
ROBOTS.TXT

etc...

9:01 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Where are you Google? This is too big of an issue to be quiet about.

Anyhow, while we have your attention, I would like to suggest a robot.txt protocol that utilizes server headers. For example, the following example could be used to prevent Googlebot from indexing any jpg file on an Apache server.

<Files ~ "\.jpg$">
Header append Robots "noindex"
</Files>

This way, we can block specific files types on the server side and keep the main robots.txt file compliant for bots that choke on a lot of the proprietary directives or special characters that some bots use. Besides, it's a much more flexible solution than a standard robots.txt file or meta tag is and it would save bandwidth for both parties.

9:11 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 10, 2006
posts:666
votes: 0


Wouldn't something like this do the trick?

User-agent: Googlebot
Disallow: *.jpg
Disallow: *.jpeg

9:19 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Some bots choke on * in disallow directives. It's not supposed to be used in that fashion.

[robotstxt.org...]

9:22 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 10, 2006
posts:666
votes: 0


Indeed. But you referred to Google, so I replied with that in mind.

But, as you say, it won't work with Yahoo or MSN probably.

9:25 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Well you get Google on board first and the other major search engines will follow.
9:55 pm on June 15, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Oct 9, 2005
posts:80
votes: 0


yeah google is indexing all my popup "enlarge image" pages? when I told the robot not too.

Also when I checked google sitemap it said there was an error trying to crawl fusjkmgvhxkcx.html?
That is a non existent page!

10:15 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


The disallow notation with a * in the URL is for Googlebot only.

.

The bot asking for URLs that do not exist, is trying to see what your real "404 error page" really looks like. Many brain dead webmasters serve a 302 redirect for "page not found" and their bot gets confused by it. They now test for that condition.

10:44 pm on June 15, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:June 6, 2005
posts:56
votes: 0


I have it set so a directory is ignored via robots.txt recently started to get 404's for links within this directory (since been deleted), its still listed as disalow in robots.txt, doing a quick search shows 100's of suplemtal results all within this directory, cached from 10 June 2005

[EDIT]
After playing with the robots.txt analysis tool on google sitemaps stats page i came across the following

in robots.txt the following line

Disallow: /dirabc/

now if I use the "Test URLs against your robots.txt file" with http://www.site.com/dirabc comes up with "Blocked by line 63: Disallow: /dirabc/"

But if I try it with http://site.com/dirabc then I get "Allowed Not in domain"

Guess its time to go play with .htacess to get the http://site.com domains to goto www.site.com, still I dont think that the above should be happening

[edited by: lawman at 11:04 pm (utc) on June 15, 2006]

11:26 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


If you Disallow: /dirabc/ then you have only disallowed http://site.com/dirabc/ and http://www.site.com/dirabc/ and their files, but you have NOT disallowed http://site.com/dirabc and http://www.site.com/dirabc which if accessed can still return a valid DirectoryIndex.

Disallow, does not disallow folders, or files, but URLs that start with the partial-URL mentioned in the exclusion. The disallow is on a per-domain basis, with the robots.txt file being found in the same root folder of the domain as the files that it can exclude.

4:24 am on June 16, 2006 (gmt 0)

New User

10+ Year Member

joined:Jan 14, 2006
posts:22
votes: 0


I recently set up a bunch of pages that are duplicate of content on my site but I wanted to have a separate page for each widget to help my visitors locate what they are looking for once they are on my site.

I put the tag:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
on each page thinking that would be enough to stop google indexing it and penalising me for duplicate content.

Today I notice my site is disapearing intermitantly from the google listings for its most popular phrase. However, when I check out its rankings at the Big Daddy Dance Tool site I see that my site is still ranking number 3 for each datacentre.

Would it be possible that google is now penalising me for duplicate content? I don't have a robots.txt file, should I have one of these, will it help? Or should I just remove the duplicate content all together. When I say duplicate, its not really, but its similar. Kind of like a summary.

Is anyone else noticing anything like this? Or is this intermitant dropping in and out business just normal and related to regular index movements.

6:26 am on June 16, 2006 (gmt 0)

New User from IN 

10+ Year Member

joined:Mar 3, 2006
posts:4
votes: 0


Hi,

my site is more older than year but just one page was crawled that is home page and that was indexed in top SERPs, for few key word it was on top rank but yesterday my all pages got crawled and now i am no where in SERPs can any body suggest what is that i mean what must be reason behind that

11:52 am on June 17, 2006 (gmt 0)

New User

10+ Year Member

joined:Mar 1, 2005
posts:3
votes: 0


I have also seen a website which is indexed by google however site owner was using noindex tag.
googlebot is not following that tag nowadays.
9:33 pm on June 17, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 25, 2005
posts:677
votes: 0


I dont see it at my end
10:55 pm on June 18, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Oct 26, 2005
posts:66
votes: 0


Looks like the bad cache is disappearing, 64.233.183.104 is clean, any relationship to the bad data push?
7:24 am on June 19, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts:2047
votes: 1


Aside to those concerned about unauthorized caching: I don't know if Google is cleaning up its act re ignoring NOINDEX, but Alexa and MSN are making a mess of it now --

Amazon-owned Alexa shows MSN-crawled SERPs WITH code-forbidden Caches
NOARCHIVE and robots.txt instructions meaningless. Three strikes?
[webmasterworld.com...]

5:22 am on June 20, 2006 (gmt 0)

New User

10+ Year Member

joined:June 20, 2006
posts:4
votes: 0


A month ago, Google thought it had 19,000 of my pages. That's waaaay more than makes sense. None of my pages had <robots> tags then but I added <robots follow noindex noarchive> to my biggest category of pages about May 23. Google stopped replacing those pages but continued showing results along with a CACHED copy from just before I made the change.

Over time, G* forgot about the month-old CACHED copies and started using copies from a year or more ago and called them SUPPLEMENTAL. Those obsolete pages seem to be (seem to be) going away a few at a time. Google now thinks it has under 9,000 of my pages. That's still too many but it keeps getting better.

My guess (just a guess) is that the 19,000 counted multiple archived copies of my pages from different dates. Now each crawl of a page with a <robots follow noindex noarchive> tag seems to kill off a few offending obsolete copies.

[NOTE: I am forcing one specific server to remove that as a variable.]

This 70 message thread spans 4 pages: 70