Robots.txt blocking and Google's behavior

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Robots.txt blocking and Google's behavior

martinacastro

7:55 pm on Mar 20, 2011 (gmt 0)

I want to share my experience (seems to me extrange).

I block a directory of my site using robots.txt disallow:/directory/ and when I use the command site:mydomain.com in Google I don't see the pages of /directory , but if I use site:mydomain.com/directory/ google list me the pages of that directory, so seems that google does not block or read the disallow command?

Someone experience the same?

Robert Charlton

10:00 pm on Mar 20, 2011 (gmt 0)

What you're probably seeing is a list of urls that were posted on one (or several) of your pages not blocked by robots.txt

Dan01

11:25 pm on Mar 20, 2011 (gmt 0)

The same thing happened to me. I have noindex and disallow on one of my forums. Google still indexes it.

martinacastro

11:32 pm on Mar 20, 2011 (gmt 0)

Robert, thanks for your comments

The pages/urls that google list, are all of them in the directory that I blocked by robots.txt.

So I dont understand your comment "What you're probably seeing is a list of urls that were posted on one (or several) of your pages not blocked by robots.txt "

Can you explain me a bit, please?

By the way now I put the noindex meta...

tedster

12:25 am on Mar 21, 2011 (gmt 0)

Are the URLs listed with titles and descriptions as they appear on the page? That would mean Google is crawling the page. In my experience, that is not the case when a URL is blocked in robots.txt.

Instead, those blocked URLs are still sometimes shown when there are other pages linking to them, but either as "URL-only" or sometimes with a title and description that Google cobbles together based on other information, such as anchor text on the linking pages.

When this occurs, you can use the URL removal tool to have Google remove even this kind of minimal mention. In my experience it is rare that these URLs show up in ordinary search results (not site: operator lists) so they often aren't a big concern.

Robert Charlton

2:07 am on Mar 21, 2011 (gmt 0)

I've seen URL-only listings show up either in site: operator results (fairly common), or, in general search results (very occasionally, but much more prominently) where there's a very high PageRank link on a spiderable page that links to something, eg, like a "blocked" mirror site of syndicated content. It can be ugly when it happens.

By the way now I put the noindex meta...

If you have URLs that you want to keep out of site: operator results for whatever reason, then you need to use the noindex meta, but you need then to drop robots.txt... as robots.txt effectively "conflicts" with the meta robots noindex.

This is something that has been discussed periodically here for the past 8 years. Here's one of my more succinct statements of what's happening, from a discussion which discusses crawl budget as a factor in choosing robots.txt vs the noindex meta tag....

Consequences of blocking robots or just using noindex,follow
http://www.webmasterworld.com/google/3978798.htm [webmasterworld.com]

Note that while robots.txt will keep Google from spidering a page, it will not prevent Google from indexing other references to that page if they appear on pages which Google does spider. This is how those URI listings can end up in the serps.

If you want to keep both the page and references to the page from being indexed, then use the noindex,follow robots meta tag on the page (at least for Google).

There's a twist to this, though. If you use the robots meta tag on a page, don't also use robots.txt to block the spidering of the page. The reason?... if Google doesn't spider the page, it won't see the robots noindex meta tag.

And here's one of our more thorough discussions of robots.txt and the noindex meta in relation to indexing....

robots.txt - Google's JohnMu Tweets a tip
http://www.webmasterworld.com/google/4143083.htm [webmasterworld.com]

For pages whose urls I absolutely don't want to show up as references, I strongly lean toward the meta robots noindex tag. For something like keeping site search results out of the Google serps, robots.txt is just fine.

martinacastro

4:54 am on Mar 21, 2011 (gmt 0)

Thanks for the responses and ideas

Looking also this post [webmasterworld.com...] I decide to delete the urls, create new urls and these new urls (with similar content of those I delete) I will block them using Robots.txt because I don't want them in Google index.

What do you think about this workaround?

martinacastro

6:20 am on Mar 21, 2011 (gmt 0)

@tedster google show for these pages only their title as they are in my site...

Robert Charlton

6:24 am on Mar 21, 2011 (gmt 0)

If you don't want the urls to show up in a site: directory search, then don't use robots.txt, as you have no guarantee it will block them.

In this case, I would use the robots meta noindex tag, and I would not use robots.txt.