Welcome to WebmasterWorld Guest from 50.17.117.221

Message Too Old, No Replies

Google ignores the meta robots noindex tag.

Thousands of pages show that tag in the Google cache!

     
12:27 am on Jun 14, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


How many people have noticed the many thousands of Supplemental pages with a 2005 June or July cache date that have been indexed, show in the SERPs with a full title and description, rank, and have a <meta name="robots" content="noindex"> tag both on the live page and in the old cached copy linked from the SERPs.

Oh yeah! There are thousands of them. Now that is a programming bug.

.

Obviously when you place the <meta name="robots" content="noindex"> tag on a page, you expect that the page will be spidered, but not indexed. But if Google keeps no data about that URL, they will never "remember" anything about that URL, and will return every few minutes to "discover" the content, again and again.

However, when you think about it, Google must keep a copy of that page internally so that they can tell when the content has changed, and so that they know where that page links to, and so that they know the index/noindex status of the page.

Such a page should never appear in search results, ever. Well, now they do, and in very large numbers; all (so far) marked as Supplemental Results, and with cache dates from a year ago.

It appears that something in their system is forgetting to check the index/noindex status of the pages in their database and is showing them all in the SERPs whatever their status.

I first noticed this yesterday; but found it on some searches that I have not done for several months. I have no idea how long this bug has been showing up... it could be several months.

10:40 pm on June 14, 2006 (gmt 0)

Full Member

10+ Year Member

joined:Aug 22, 2003
posts:333
votes: 0


Great, so in in addition to all the threads complaining that Googlebot is ignoring robots.txt, it's also ignoring meta tags? Are any pages with the noarchive meta tag showing up as supplimental results with caches?

Looks like the only sure way to stop google indexing a page is to serve Googlebot a 403 error for that page.

12:08 am on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 4, 2002
posts:1785
votes: 2


I just found a benefit to the noindex meta tag being ignored by Google.

I found a scraper directory with noindex tags on it's "details" page which copies descriptions/titles/keyword 6 times on this page and in the alt tags for transparent images and they list multiple keywords in the URL also (which means they can easily outrank any site in their directory) and they do it to every site in the directory.

These scrapers will show up in the "intitle:yourdomain.com" search in Google even though the noindex tag is in place.

This site is from a foreign country and host their own site so contacting the host is useless. I reported the site to Google AdSense and Google Spam and also have been writing to all the web design sites listed in the directory asking them to do the same.

3:18 am on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 28, 2002
posts:1324
votes: 0


If I was a clever chap and had an office in Mountain View CA, I might get the site maps people working on implementing some tools to identify technical problems that could be causing "crawling and indexing" issues. I'd then give some nice instructions that were NOT written by programming but someone who could channel Denzel (explain this to me like I'm a six year old) while doing it.

Some of the more cynical of us might notice that we had to sign up for a solution to a problem they created in the first place, but since they are they web, when Google ain't happy, ain't nobody happy.

3:21 am on June 15, 2006 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:11313
votes: 165


My original observation is that pages containing the <meta name="robots" content="noindex"> tag (for the last three years) are now showing in SERPs with a full title and snippet, and have a link to a cached page from 2005 July, and that the cached page clearly shows the meta noindex tag too.

What with Google using a "crawl caching proxy," shared by various bots, including the Adwords bot and Googlebot, and the Adwords bot ignoring robots.txt, I've been anticipating that problems might happen.

Some questions/thoughts...

While Google has talked publicly about the Adwords bot ignoring traditional robots.txt prohibitions, it hasn't mentioned anything about the robots meta tag. For a number of reasons, it would be nice to have an official word on how the Adwords bot regards this tag. I'm also wondering whether the Adwords bot ever goes beyond specified landing pages.

Additionally, with regard to g1smd's situation, and that of others whose <meta name="robots" content="noindex"> tagged pages are being indexed, I'm wondering if those affected might simultaneously be using robots.txt to disallow bots to these pages.

I'm also wondering whether anyone who'd disallowed the Adwords bot specifically has seen their blocked pages in the index.

3:26 am on June 15, 2006 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:11313
votes: 165


I'd then give some nice instructions that were NOT written by programming but someone who could channel Denzel (explain this to me like I'm a six year old) while doing it.

<offtopic> graywolf - I've long felt that this should apply to all software documentation and interfaces... even pitched it to some companies... but the idea hasn't caught on.</offtopic>

4:49 am on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


noarchive example [72.14.209.104]

Don't know how long that link will last.

7:57 am on June 15, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Oct 17, 2005
posts:79
votes: 0


Has anyone with the Google or Yahoo toolbar been looking at your noindex pages?

One thing I've noticed is that with the toolbar it tries to index those pages, my site is half designed with a robots.txt on it with no links and it has been indexed by Google... it has had the robots.txt since the day I bought the domain

8:32 am on June 15, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:May 30, 2005
posts:456
votes: 0


>>>That would be MS rather than G. <<<
Nope..that's G.
MS thinks they are all things pc.

You are wrong. MS does not dominate only the PC sphere. They also dominate the web/html. They have always enforced their own "web standards" or "extensions of standards" via their Internet Explorer dominance without any prior public discussion.
10:01 am on June 15, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:June 25, 2005
posts:179
votes: 1


From [google.com ]:
Google automatically takes a "snapshot" of each page it crawls and archives it.
(...)
Users can access the cached version by choosing the "Cached" link on the search results page.
(...)
Note: this tag only removes the "Cached" link for the page. Google will continue to index the page and display a snippet.

Metatags only prevent displaying "Cached" links or content on SERPs (as long as the display flag checking program runs perfectly).

11:19 am on June 15, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38047
votes: 11


I can confirm that scrapper spiders are running in massive numbers against Googles cached pages ripping the content was not suppose to be rippable...
11:25 am on June 15, 2006 (gmt 0)

Full Member

10+ Year Member

joined:Oct 27, 2004
posts:201
votes: 0


Fine
If google is ignoring noindex tag, we still make it ignore pages through a robots.txt

Correct me if i am mistaken?

1:24 pm on June 15, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 4, 2005
posts:621
votes: 0


I hope those "secret" pictures don't appear in the index...

:)

8:17 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


The noindex sites that are now indexed and cached by Google, that I have found, have the meta robots noindex tag on every page of the site and do NOT have a robots.txt file at all.

Traditionally, a robots.txt disallow instruction for a URL simply led to the page appearing as a URL-only entry in Google's index if they ever saw a link to the disallowed page. The meta noindex tag ensured that nothing appeared for the page at all. That is now completely broken.

For Yahoo, when they find the meta noindex tag, they do include the page as a URL-only entry, but often they also try to build an entry for the page, by using the anchor text of one of the links pointing to the page (from an external site) as the SERPs entry title!

9:01 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 5, 2006
posts:2094
votes: 2


I also have been finding those sites do not have their robots.txt file set up right. Different versions:

Robot.txt
robot.txt
Robots.txt
robots.txt
ROBOTS.TXT

etc...

9:01 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Where are you Google? This is too big of an issue to be quiet about.

Anyhow, while we have your attention, I would like to suggest a robot.txt protocol that utilizes server headers. For example, the following example could be used to prevent Googlebot from indexing any jpg file on an Apache server.

<Files ~ "\.jpg$">
Header append Robots "noindex"
</Files>

This way, we can block specific files types on the server side and keep the main robots.txt file compliant for bots that choke on a lot of the proprietary directives or special characters that some bots use. Besides, it's a much more flexible solution than a standard robots.txt file or meta tag is and it would save bandwidth for both parties.

9:11 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Mar 10, 2006
posts:661
votes: 0


Wouldn't something like this do the trick?

User-agent: Googlebot
Disallow: *.jpg
Disallow: *.jpeg

9:19 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Some bots choke on * in disallow directives. It's not supposed to be used in that fashion.

[robotstxt.org...]

9:22 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Mar 10, 2006
posts:661
votes: 0


Indeed. But you referred to Google, so I replied with that in mind.

But, as you say, it won't work with Yahoo or MSN probably.

9:25 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Well you get Google on board first and the other major search engines will follow.
9:55 pm on June 15, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Oct 9, 2005
posts:80
votes: 0


yeah google is indexing all my popup "enlarge image" pages? when I told the robot not too.

Also when I checked google sitemap it said there was an error trying to crawl fusjkmgvhxkcx.html?
That is a non existent page!

10:15 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


The disallow notation with a * in the URL is for Googlebot only.

.

The bot asking for URLs that do not exist, is trying to see what your real "404 error page" really looks like. Many brain dead webmasters serve a 302 redirect for "page not found" and their bot gets confused by it. They now test for that condition.

10:44 pm on June 15, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:June 6, 2005
posts:56
votes: 0


I have it set so a directory is ignored via robots.txt recently started to get 404's for links within this directory (since been deleted), its still listed as disalow in robots.txt, doing a quick search shows 100's of suplemtal results all within this directory, cached from 10 June 2005

[EDIT]
After playing with the robots.txt analysis tool on google sitemaps stats page i came across the following

in robots.txt the following line

Disallow: /dirabc/

now if I use the "Test URLs against your robots.txt file" with http://www.site.com/dirabc comes up with "Blocked by line 63: Disallow: /dirabc/"

But if I try it with http://site.com/dirabc then I get "Allowed Not in domain"

Guess its time to go play with .htacess to get the http://site.com domains to goto www.site.com, still I dont think that the above should be happening

[edited by: lawman at 11:04 pm (utc) on June 15, 2006]

11:26 pm on June 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


If you Disallow: /dirabc/ then you have only disallowed http://site.com/dirabc/ and http://www.site.com/dirabc/ and their files, but you have NOT disallowed http://site.com/dirabc and http://www.site.com/dirabc which if accessed can still return a valid DirectoryIndex.

Disallow, does not disallow folders, or files, but URLs that start with the partial-URL mentioned in the exclusion. The disallow is on a per-domain basis, with the robots.txt file being found in the same root folder of the domain as the files that it can exclude.

4:24 am on June 16, 2006 (gmt 0)

New User

10+ Year Member

joined:Jan 14, 2006
posts:22
votes: 0


I recently set up a bunch of pages that are duplicate of content on my site but I wanted to have a separate page for each widget to help my visitors locate what they are looking for once they are on my site.

I put the tag:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
on each page thinking that would be enough to stop google indexing it and penalising me for duplicate content.

Today I notice my site is disapearing intermitantly from the google listings for its most popular phrase. However, when I check out its rankings at the Big Daddy Dance Tool site I see that my site is still ranking number 3 for each datacentre.

Would it be possible that google is now penalising me for duplicate content? I don't have a robots.txt file, should I have one of these, will it help? Or should I just remove the duplicate content all together. When I say duplicate, its not really, but its similar. Kind of like a summary.

Is anyone else noticing anything like this? Or is this intermitant dropping in and out business just normal and related to regular index movements.

6:26 am on June 16, 2006 (gmt 0)

New User

5+ Year Member

joined:Mar 3, 2006
posts:4
votes: 0


Hi,

my site is more older than year but just one page was crawled that is home page and that was indexed in top SERPs, for few key word it was on top rank but yesterday my all pages got crawled and now i am no where in SERPs can any body suggest what is that i mean what must be reason behind that

11:52 am on June 17, 2006 (gmt 0)

New User

10+ Year Member

joined:Mar 1, 2005
posts:3
votes: 0


I have also seen a website which is indexed by google however site owner was using noindex tag.
googlebot is not following that tag nowadays.
9:33 pm on June 17, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 25, 2005
posts:677
votes: 0


I dont see it at my end
10:55 pm on June 18, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Oct 26, 2005
posts:66
votes: 0


Looks like the bad cache is disappearing, 64.233.183.104 is clean, any relationship to the bad data push?
7:24 am on June 19, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts:2038
votes: 1


Aside to those concerned about unauthorized caching: I don't know if Google is cleaning up its act re ignoring NOINDEX, but Alexa and MSN are making a mess of it now --

Amazon-owned Alexa shows MSN-crawled SERPs WITH code-forbidden Caches
NOARCHIVE and robots.txt instructions meaningless. Three strikes?
[webmasterworld.com...]

5:22 am on June 20, 2006 (gmt 0)

New User

5+ Year Member

joined:June 20, 2006
posts:4
votes: 0


A month ago, Google thought it had 19,000 of my pages. That's waaaay more than makes sense. None of my pages had <robots> tags then but I added <robots follow noindex noarchive> to my biggest category of pages about May 23. Google stopped replacing those pages but continued showing results along with a CACHED copy from just before I made the change.

Over time, G* forgot about the month-old CACHED copies and started using copies from a year or more ago and called them SUPPLEMENTAL. Those obsolete pages seem to be (seem to be) going away a few at a time. Google now thinks it has under 9,000 of my pages. That's still too many but it keeps getting better.

My guess (just a guess) is that the 19,000 counted multiple archived copies of my pages from different dates. Now each crawl of a page with a <robots follow noindex noarchive> tag seems to kill off a few offending obsolete copies.

[NOTE: I am forcing one specific server to remove that as a variable.]

This 70 message thread spans 3 pages: 70