Pages are indexed even after blocking in robots.txt

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Pages are indexed even after blocking in robots.txt

shaunm

11:04 am on Aug 31, 2012 (gmt 0)

Hi all,

I have blocked some of my website pages through robots.txt. Now I see that some of them are indexed while some are not. I am confused as I don't get the point in indexing pages that has been blocked in robots.txt

Can you please suggest? Or adding the noindex html tags in pages that I don't want to be indexed will do the trick?

Thanks a lot!

tedster

4:16 pm on Aug 31, 2012 (gmt 0)

"Disallow" rules in robots.txt do stop googlebot crawling - but as you've discovered Google may still list a URL that they didn't crawl. This happens, for instance, if they know about the through internal or external backlinks and anchor text. Google then constructs a title and snippet for the URL just from references rather than by crawling the page directly.

If you really don't want to see a URL in the Google index... then yes, use a noindex robots meta tag instead of robots.txt rules. And remember to change your robots.txt file so you now ALLOW googlebot to crawl the page. Unless they crawl, they won't ever read the robots meta tag.

not2easy

8:13 pm on Aug 31, 2012 (gmt 0)

use a noindex robots meta tag instead of robots.txt rules

And keep pages you do not want indexed out of your sitemap. Google will index pages it finds through internal and external links as tedster says so it helps to "nofollow" links to pages you do not want crawled, but the metatag on the page itself will prevent indexing. If the URL is in your sitemap, the page will be crawled.

aakk9999

8:18 pm on Aug 31, 2012 (gmt 0)

so it helps to "nofollow" links to pages you do not want crawled

I do not think this is a good idea.
a) if the page is not crawled, the robots noindex will not be seen
b) in any case, nofollowing the link does not stop Google crawling the page.
c) nofollow will stop natural page rank circulation within the site and PR will go in the "black hole"

lucy24

9:15 pm on Aug 31, 2012 (gmt 0)

If the URL is in your sitemap, the page will be crawled.

Even if it is in a roboted-out directory?

tedster

9:51 pm on Aug 31, 2012 (gmt 0)

Yep, I've seen it happen too. Apparently a URL in a sitemap is assumed to have a "crawl me, index me" sign on it that trumps any other signal that may be in place. So it pays to proofread your sitemap file and not just auto-generate it.

atlrus

1:42 pm on Sep 1, 2012 (gmt 0)

It appears that there is nothing you can do about this. As much as google likes to talk about rules, they are the biggest violator there is.

Example: I have a website without a sitemap. I have a directory, which is disallowed in robots.txt, all links to the pages in the directory are nofollow and there are no external links to those pages.
Yet, one of them made it's way to the index. When I use the "site:" command, the page shows with the URL as the title and the description is "A description for this result is not available because of this site's robots.txt � learn more"

Robert Charlton

10:33 am on Sep 2, 2012 (gmt 0)

Example: I have a website without a sitemap. I have a directory, which is disallowed in robots.txt, all links to the pages in the directory are nofollow and there are no external links to those pages.
Yet, one of them made it's way to the index.

atlrus, basically you shouldn't be using robots.txt to keep a url out of the index. Use either password protection or the noindex robots meta tag. Again, note that if you use both robots.txt and noindex, Googlebot won't spider the page and won't see noindex. It's a little confusing at first, but it starts making sense if you give it some thought.

Regarding how Google found the page in the first place, consensus on this forum and elsewhere is that it's most likely publicly available server logs. The topic is coming up with frightening frequency, so you are not alone. Here's the most recent discussion on the topic. I suggest you read all the references I link to in my post in the thread...

How has Google found a page with no links to it?
http://www.webmasterworld.com/google/4477784.htm [webmasterworld.com]

shaunm

7:07 am on Sep 3, 2012 (gmt 0)

@tedster
Thank you so much!

Google then constructs a title and snippet for the URL just from references rather than by crawling the page directly.

But it doesn't look like a snippet, it just looks how other pages are displayed in Google's search results. I have have come through this snippet stuff for other websites though.

And remember to change your robots.txt file so you now ALLOW googlebot to crawl the page. Unless they crawl, they won't ever read the robots meta tag.

Oops, I wouldn't have done that if you hadn't informed me. Thanks a ton!

@not2easy
Thanks buddy!

If the URL is in your sitemap, the page will be crawled.

Are you sure that even though I may block a web page using noindex meta tags, the page will still be indexed if the URL has been included in the SITEMAP?!? Because I have never heard of this before. Can you give me some references or share your personal experiences? THanks

@aakk9999
Thanks for that mate! I will keep those things in mind as well :)

@tedster
Thanks again, ted :)

@atlrus
Thank you!

@Robert
Thank you for replying on my post and for the provided URL reference :)

lucy24

8:07 am on Sep 3, 2012 (gmt 0)

If the URL is in your sitemap, the page will be crawled.

Are you sure that even though I may block a web page using noindex meta tags, the page will still be indexed if the URL has been included in the SITEMAP?!?

Careful. The whole point of this thread, and all those related ones, is that CRAWL and INDEX are different things.

Google will CRAWL any page that isn't blocked in robots.txt, even if the page is labeled "noindex".

It will INDEX any page that isn't flagged "noindex", even if it can't crawl the page and therefore has no idea what's on it. It will also INDEX any page that is flagged "noindex"-- if, again, it can't crawl the page and therefore can't see the "noindex".

Next question: If a page's meta tags say both "noindex" and "nofollow", will g### still crawl the entire page from top to bottom? What excuse does it have? (Maybe that's a non-question. I don't know if the googlebot even has an "off" switch that would let it stop crawling before it reaches the bottom of a page.)

shaunm

10:17 am on Sep 3, 2012 (gmt 0)

Thanks lucy24!

I am still confused of this two terms 'crawl' 'index'

This is my understanding about the above two, if I am wrong please guide me on the right path.

CRAWL - Spiders/Bots visits a website/webpage. It scans the webpage for content and links - this scanning process is called crawling am I right?

INDEX - Once done with the scanning process, they then index the content in the databases to display in the search results later - This process is indexing right?

My Question is:-

1. If I block a page as 'do not' crawl, how the spiders still index it? If they don't crawl a page how can they index it? Crawling is the very first step to indexing right?

2. Do the SE spiders actually care about what is in robots.txt? :(

Many thanks!

shaunm

10:48 am on Sep 3, 2012 (gmt 0)

@All

To my surprise, when I checked with GWT->HEALTH->BLOCKED URLs, all the URLs that I blocked in robots.txt are showing as 'Allowed'

Why is this happening? There is no problem with my robots.txt, though!

Thanks!

Shaddows

11:33 am on Sep 3, 2012 (gmt 0)

1. If I block a page as 'do not' crawl, how the spiders still index it? If they don't crawl a page how can they index it? Crawling is the very first step to indexing right?

Wrong!

If Google DISCOVERS it, it INDEXES it. It starts accumulating PageRank, and all the other externally defined factors that exist in Google's world. The referenced thread has discovery examples.

2. Do the SE spiders actually care about what is in robots.txt?

Generally, but not stricty.

There is no problem with my robots.txt, though!

Can you post an exemplified version?

shaunm

12:29 pm on Sep 3, 2012 (gmt 0)

@Shaddows

Thanks buddy!

It starts accumulating PageRank, and all the other externally defined factors that exist in Google's world

Could you please explain this to me?

Below is my robots.txt structure. I just put it as it is, only removing the actual URLs.

As for the 'Noindex' command in robots.txt, I know that there is a long debate on this topic. While this is strongly not a directive some claim that this is indirectly respected by Google bots even though not a directive. So I didn't remove the 'Noindex' section which was created by someone else before me.

#--------------------------------
#-- VERSION 08.15.2012.1 --
#-- LAST MODIFIED DATE 08-12-2012 --
User-agent: *
Disallow: /perf/folder/
Disallow: /this-is-a-page/test.aspx
Disallow: /this-is-another-page/example.aspx

User-agent: Googlebot
Noindex: /abc.example.com/folder/
Noindex: /abc.example.com/forum/folder/
Noindex: /abc.example.com/blog/folder/

Sitemap: http://www.example.com/sitemap.xml
Sitemap: http://www.example.com/de/sitemap.xml
#--------------------------------

lucy24

8:13 pm on Sep 3, 2012 (gmt 0)

! So THAT'S the problem. !

You can't put "noindex" in robots.txt. You can only say "Disallow". Some SEs-- notably including google-- also recognize "Allow". But the noindex directive has to go in the page itself.

Also: Once the googlebot finds its name in robots.txt, it ignores all other sections. So if you want to block some areas from googlebot, and some areas from all robots, you'll have to say those parts twice.

not2easy

8:34 pm on Sep 3, 2012 (gmt 0)

TRUE lucy24, the only way to prevent indexing is to have a robots meta tag in the page's header, you can't noindex from robots.txt. Still, if it shows up in your sitemap they may index it anyway. That is because if you read about the purpose of the sitemap, it is to have a list of the pages you want to have indexed. I found out the hard way a long time ago that you need to only have pages in the sitemap that you do want indexed, because a noindex metatag on the page gets ignored when they find it in the sitemap. I am reminded of it again whenever I try to do away with an old page and forget to remove it from the sitemap after I put a noindex metatag on the page.

Now, if anyone knows a way to prevent them from using an antique version of a sitemap, that would be helpful. I submit new sitemaps and still see 404s from pages that have not existed for two years, are not in any current sitemap. I appreciate that I can now mark them as "Fixed" but I know they will be back.

Shaddows

8:50 am on Sep 4, 2012 (gmt 0)

Specifically on the original question, drop the googlebot section. It doesn't actually do anything as written, for the reasons Lucy outlines.

It starts accumulating PageRank, and all the other externally defined factors that exist in Google's world

Could you please explain this to me?

Only briefly. Ranking is determined by many factors. One way of grouping them would be as "On Page" and "Off Page" factors.

If you forbid CRAWLING, SEs cannot garner any On Page factors.

If you forbid INDEXING, SEs will have all the information, but will not display it in SERPs.

If you want more info, read this forum, particularly the Hot Topics thread.

shaunm

10:33 am on Sep 4, 2012 (gmt 0)

@lucy24
Thank you!

Also: Once the googlebot finds its name in robots.txt, it ignores all other sections. So if you want to block some areas from googlebot, and some areas from all robots, you'll have to say those parts twice.

I have seen this on may websites and had wondered why do they repeat all the files diff spiders such as 'google bot' 'that of yahoo' 'that of alexa', ask etc. So from my above robots.txt, I am just going to remove the 'Noindex' section which as many of you have told is of no use. If I remove that section, then the command 'User-agent: Googlebot' will also get removed and there will be only one command for all the files 'User-agent: *'. It is enough right?

@not2easy
Thanks!

Still, if it shows up in your sitemap they may index it anyway. That is because if you read about the purpose of the sitemap, it is to have a list of the pages you want to have indexed. I found out the hard way a long time ago that you need to only have pages in the sitemap that you do want indexed, because a noindex metatag on the page gets ignored when they find it in the sitemap

This is what makes me learn more about SEO. Thanks for letting me know of that bud!

I am reminded of it again whenever I try to do away with an old page and forget to remove it from the sitemap after I put a noindex metatag on the page.

Out of curiosity, why don't you put a redirect in place?

I submit new sitemaps and still see 404s from pages that have not existed for two years, are not in any current sitemap. I appreciate that I can now mark them as "Fixed" but I know they will be back.

I have the same problems, my website has more than 600000 pages and I am getting 18k server errors through the GWT crawling error section. It shows pages that never existed in my website and whenever I mark them as fixed it again shows up, I fed up with the 'mark as fixed' process.

@Shaddows
Thanks :)
I thought we were talking about the Sitemap, not the ranking factors.

MikeNoLastName

10:37 am on Sep 4, 2012 (gmt 0)

Just for the record, I have a domain, which is entirely robot disallowed in the robots.txt since day one (user-agent: * , Disallow: /), yet G indexes about 90 pages from it (which is up from just 1 a few months ago), as the url only, and "A description for this result is not available because of this site's robots.txt � learn more" as the snippet. Now that's just plain rude!
It is a page-for-page duplicate of a part of another domain intended for a specific audience, with no advertisements included. But oddly enough G obviously knows what it is all about as they come up first in the SERPS for as little as a portion of the titles. So my guess is that G is probably also evaluating it as a duplicate of the other domain pages. Would consider password protecting it just to keep G out of it, but that is not really an option in this case.

not2easy

4:42 pm on Sep 4, 2012 (gmt 0)

Out of curiosity, why don't you put a redirect in place?

The pages I am trying to kill were related to customer service questions/answers. When I closed my business they are no longer required for anything and not related to any other content on that site. Every now and then I find a few that are in obscure directories. I keep them on the site because occasionally (3 years later) I still get customers asking "How do I ...". It is an old handmade site that I don't do much with but leave it there because now it gives instructions for others.

shaunm

5:59 am on Sep 5, 2012 (gmt 0)

@Mike
Thanks for that! That snippet - Yes it's been long discussed. Don't know why should G even index the URL only version even though we have blocked it. G only knows!

@not2easy
Thanks for that :)

Robert Charlton

8:38 am on Sep 5, 2012 (gmt 0)

Thanks for that! That snippet - Yes it's been long discussed. Don't know why should G even index the URL only version even though we have blocked it. G only knows!

shaunm and MikeNoLastName - Let's give it one more shot, as you haven't "blocked it" in the sense of keeping references to the page out of the index.

To repeat again what many here have said, by using robots.txt, you've kept the contents of the page from being crawled.

Because the page isn't crawled, Google doesn't see the meta robots noindex tag in the page contents.

Therefore, Google may still index the url of the page, and create "snippets" / titles / whatever it can, because of references it finds elsewhere on the web.

Google did not invent these protocols. It is simply following them.

For another take on this... here's a relevant section from an Official Google Blog article on the topic...

Using the robots meta tag
Official Google Webmaster Central Blog
http://googlewebmastercentral.blogspot.com/2007/03/using-robots-meta-tag.html [googlewebmastercentral.blogspot.com]

If you use both a robots.txt file and robots meta tags
If the robots.txt and meta tag instructions for a page conflict, Googlebot follows the most restrictive. More specifically:

�If you block a page with robots.txt, Googlebot will never crawl the page and will never read any meta tags on the page.
�If you allow a page with robots.txt but block it from being indexed using a meta tag, Googlebot will access the page, read the meta tag, and subsequently not index it.

PS... I've been exactly where you're at on this. It is initially confusing. It may take some work and some reading to understand it.

shaunm

9:48 am on Sep 5, 2012 (gmt 0)

@RobertCharlton
Thanks for all that explanation and I hope that I too will understand it clearly. :)

Before closing this thread, can you please tell me
When/Why do we need a robots.txt? So it can better utilized for blocking PAGES within a DIRECTORY right? But not a SINGLE PAGE?

Thanks a lot!

shaunm

1:33 pm on Sep 5, 2012 (gmt 0)

I just came across this post by Google

It says that

(You need a robots.txt file only if your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file�not even an empty one.

[support.google.com...]

Shaddows

2:20 pm on Sep 5, 2012 (gmt 0)

Index in that context means "comsume for ranking".

Technically, every page that is crawled gets indexed in the sense it gets rated, and sharded off into the distributed database.

All the "noindex" directive does is hide something from SERPs. There is an assumption that it stops pages from affecting sitewide factors, but such pages definitely pass PageRank.

This is one of the areas where precise terminology is key. However, the vast majority of casual conversations (and indeed some official resources) tend to be quite lax.

To reiterate
robots.txt stops Google fetching the page (including headers)
noindex is a directive that is only actioned once the page is fetched, and keeps a page out of SERPs
"indexed" means EITHER a page has been consumed by the algo, OR that is is showing in SERPs - and the vast overlap in those two groups means there is plenty of room for confusion.

lucy24

11:15 pm on Sep 5, 2012 (gmt 0)

But you have to concede that g### itself willfully contributes to the misunderstanding. Consider the page-removal section of WMT:

To remove a page or image, you must do one of the following:

* Make sure the content is no longer live on the web. Requests for the page must return an HTTP 404 (not found) or 410 status code.
* Block the content using a robots.txt file.
* Block the content using a meta noindex tag.

To remove a directory and its contents, or your whole site, you must ensure that the pages you want to remove have been blocked using a robots.txt file. Returning a 404 isn't enough, because it's possible for a directory to return a 404 status code, but still serve out files underneath it. Using robots.txt to block a directory ensures that all of its children are disallowed as well.
<snip>
Content removed with this tool will be excluded from the Google index for a minimum of 90 days.

Would not a person of ordinary intelligence interpret this to mean that a file in a roboted-out directory will stay out of the index, once removed?

shaunm

6:46 am on Sep 6, 2012 (gmt 0)

@shaddows
Could you please tell me WHEN/WHY DO I NEED A ROBOTS.TXT, THEN? I beg, could anyone please explain it to me precisely?

Thanks a TON!

@lucy24
Thanks :(

tedster

7:09 am on Sep 6, 2012 (gmt 0)

I use robots.txt to disallow certain URL patterns simply by showing the beginning of the pattern.

So, I can easily exclude various parameters that might lead googlebot into an major duplicate content area, such as:

Disallow /category?sort=

Google has listened to that every time I've used it, and I've used it a LOT.

shaunm

7:33 am on Sep 6, 2012 (gmt 0)

@tedster
Could you please tell me what do you mean by 'Disallow' here?

So, I can easily exclude various parameters that might lead googlebot into an major duplicate content area, such as:

Disallow /category?sort=

Are you sure that this patterning works? This is all new to me. Because I only heard that you should put the exact path(after the FIST SLASH till the end) to prevent a page/folder/file from getting crawled. Thanks

Shaddows

7:37 am on Sep 6, 2012 (gmt 0)

You use robots.txt to keep Google off your page. It stops them knowing stuff. That's it.

Real-world reasons for employing it include, but are not limited to
- Preserving Crawl budget (CSS files might not need crawling)
- Blocking file directories (/images/)
- Creating bad spider lists (block a directory, link to it in a hidden link, ban anything that finds its way there)

Would not a person of ordinary intelligence interpret this to mean that a file in a roboted-out directory will stay out of the index, once removed?

Suggests that using the removal tool will noindex the page for 90 days, as long as the initial check verifies that the WMT User chose the right page. Verification by checking an independant real-world signal.

Implied is that WMT tool will check regularly after the initial 90 days to see the signal is still in place, and will return the page to SERPs if not.

I have never used that tool, so I would be interested if that is actually what happens

This 56 message thread spans 2 pages: 56