Site Indexed with meta robots - noindex, nofollow - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Site Indexed with meta robots - noindex, nofollow

sunnyujjawal

11:54 am on May 28, 2012 (gmt 0)

Few days before I started a new site and decided to keep meta robots NOINDEX, NOFOLLOW but today I found it indexed in Google with site URL in title nothing else.
Status of robots.txt is:
User-agent: *
Disallow: /

sunnyujjawal

12:02 pm on May 28, 2012 (gmt 0)

< moved from another location >

If a URL has Meta robots - noindex,nofollow and same URL have allow condition in robots.txt

Which one will get preference in Google:
Meta robots or robots.txt
.

[edited by: Robert_Charlton at 5:06 pm (utc) on May 28, 2012]

Sand

5:19 pm on May 28, 2012 (gmt 0)

If you forbid Google from crawling a page via robots.txt, how are they supposed to read the noindex tag on the page? Short answer: they can't, so they will index it.

If you don't want any content indexed, remove the robots.txt restrictions and let them crawl so they can see the noindex commands.

Additionally, you need to understand a little bit about how robots.txt works. When you restrict a file or directory, you're telling Google not to *crawl* the page. You aren't telling them to ignore it.

Google can still do things like checking response headers without crawling the page, and they will index a page without crawling it (especially in this post-caffeine world).

g1smd

5:25 pm on May 28, 2012 (gmt 0)

The robots.txt disallow directive stops Google from crawling the page, so they don't get to see the on-page directives, or the title or page content.

They will, however, list the page as a URL-only entry in the SERPs and may even try to construct a title for the page using the anchor text in any incoming links.

If you want the page completely out of the SERPs, you need to not disallow the page in robots.txt and then use the meta robots noindex directive on the page itself.

lucy24

9:04 pm on May 29, 2012 (gmt 0)

So in order to prevent the world at large from knowing that a given page exists, you have to let google run rampant over the entire page, trusting it to obey your "noindex" wishes? This is not a pretty choice.

Serious question: If g### knows nothing about a page beyond the fact that it exists, under what circumstances would it turn up in SERPs? Exclude cases where people are searching for a page by name; you are not telling them anything they did not already know.

tedster

9:17 pm on May 29, 2012 (gmt 0)

under what circumstances would it turn up in SERPs?

When information and content Google gathers from backlinked pages make it a good candidate for the query - and in most cases, that should mean there aren't many good candidates.

netmeg

2:03 am on May 30, 2012 (gmt 0)

So in order to prevent the world at large from knowing that a given page exists, you have to let google run rampant over the entire page, trusting it to obey your "noindex" wishes? This is not a pretty choice.

If you don't want the world at large to know it exists, you pretty much need to put it behind a login and/or password. I mean, it *is* sitting on the biggest network in the world; eventually some person and/or bot will find it otherwise. And if they link to it, or they find it via the toolbar, it could well show up in the SERPs.

lucy24

9:08 am on May 30, 2012 (gmt 0)

Matter of fact, every single one of my noindex or roboted-out pages is readily accessible to humans by following ordinary links. The difference is that I don't want them to be entry pages.

netmeg

3:25 pm on May 30, 2012 (gmt 0)

Oh I understand. For the most part, NOINDEX seems to work for me for that, but I'm not gonna exclude the possibility that one or two strays might not make it into the index for some reason.

Sgt_Kickaxe

10:13 pm on May 30, 2012 (gmt 0)

If you want the page completely out of the SERPs, you need to not disallow the page in robots.txt and then use the meta robots noindex directive on the page itself.

Robots.txt is the suggested method of handling redirect pages, any suggestion on getting those to remain out of the index? (They show up as GWT 404 when removed, still, even though they don't show up in serps)

lucy24

4:02 am on May 31, 2012 (gmt 0)

Robots.txt is the suggested method of handling redirect pages, any suggestion on getting those to remain out of the index?

Bingo. I knew I had a "But, but, but..." nagging at my memory.

The horse's mouth [support.google.com] says (emphasis mine)

:: ### Forums software, I DID close the /url tag! ::

If you wish to remove your content using the URL removal request tool in our Google Webmaster Tools, you must first meet the criteria listed below.

To remove a page or image, you must do one of the following:

* Make sure the content is no longer live on the web. Requests for the page must return an HTTP 404 (not found) or 410 status code.
* Block the content using a robots.txt file.
* Block the content using a meta noindex tag.

To remove a directory and its contents, or your whole site, you must ensure that the pages you want to remove have been blocked using a robots.txt file. Returning a 404 isn't enough, because it's possible for a directory to return a 404 status code, but still serve out files underneath it. Using robots.txt to block a directory ensures that all of its children are disallowed as well.

Seems to me an ordinary person of ordinary intelligence* would take this to mean that once you've removed a page from the Index, blocking its directory in robots.txt is enough to ensure that it stays removed.

I detoured here for some trial-and-error involving a one-word search, constrained to my site. Took a couple of tries... but yes indeed, what should meet my gaze but the bare URL of a file from a roboted-out directory, linked to a public page via the word I searched for. (Without the site constraint, the same word gets close to a billion** hits, so nobody will ever land on the page by accident.)

So who you gonna believe-- google or your lyin' eyes?

* Insert boilerplate about "assuming for the sake of discussion" etc.
** 10^9, not 10^12.

sunnyujjawal

12:35 pm on May 31, 2012 (gmt 0)

Whatever.. I am sure that these days Google search is in testing mode. Even after disallowing my /wp-admin/ in robots.txt, it get indexed and after some time it get disappear.

not2easy

1:16 pm on May 31, 2012 (gmt 0)

I always need to remind myself when I noindex a page to get it out of the sitemap too and search the site for forgotten links to it, to nofollow them. Basic stuff, but sometimes it slips past.

lucy24

5:50 pm on May 31, 2012 (gmt 0)

I'm told that "nofollow" has the same unintended effect as roboting-out a directory: g### can't get to the page so they can't see its "noindex" tag.