Forum Moderators: Robert Charlton & goodroi
I wanted to mention one more way to block Googlebot by using wildcards in robots.txt (Google supports wildcards like '*' in robots.txt). Here's how:1. Add the parameter like [mattcutts.com...] to pages that you don't want fetched by Googlebot.
2. Add the following to your robots.txt:
User-agent: Googlebot
Disallow: *googlebot=nocrawl[mattcutts.com...]
Because I strictly do not use the "nofollow" attribute, and for an example, one reason is because I would like to preserve the option to tell other search engines to follow certain internal and external links but Google not.
I already implemented the above method and I see in my Google Webmaster Tools that those links where forbidden. So far so good. It works.
In addition I added to the links destination pages the meta tag directives "noindex,nofollow,noarchive,nosnippet" or in some cases I took care of that with X-Robots.
Again I must repeat what Matt Cutts said:
"We may see links to the pages with the nocrawl parameter, but we won't crawl them. At most, we would show the url reference (the uncrawled link), but we wouldn't ever fetch the page."
For that case, about Google showing the url reference is not possible, since I have additionally implemented in the robots.txt the "noindex" directive like this:
Disallow: *googlebot=nocrawl
So here is my question:
Do those links accumulate PR? Do I leak PR?
Any thoughts?
[edited by: tedster at 8:13 am (utc) on Mar. 7, 2009]
[edit reason] shorter quote - fair use copyright rule [/edit]
Do those links accumulate PR? Do I leak PR?
My guess is that you should see this as a stricter meta noindex: it gets picked up and receives PR, but whereas a meta noindex page can still pass PR, a 'concrawled' page cannot.
Be wary of duplicate content though. Query strings can be a death trap if you don't have a good plan.
Obscure note #1: using the ‘googlebot=nocrawl’ technique would not be the preferred method in my mind. Why? Because it might still show ‘googlebot=nocrawl’ urls as uncrawled urls. You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file[...]
I don't know whether this would also apply to nofollow-links, but my gut feeling says it doesn't (anybody with some experience on this matter?).
I said:
...about Google showing the url reference is not possible, since I have additionally implemented in the robots.txt the "noindex" directive like this:
Disallow: *googlebot=nocrawl
And I met:
...about Google showing the url reference is not possible, since I have additionally implemented in the robots.txt the "noindex" directive like this:
Disallow: *googlebot=nocrawl
Noindex: *googlebot=nocrawl
Noindex: *googlebot=nocrawl
that's called wishful thinking.
there is no such robots.txt directive as "Noindex:" that i'm aware of.
also note that while google may have wildcarding implemented, if you decide to expand this technique the robots.txt exclusion protocol only permits wildcarding for this specific usage:
User-agent: *
The robots.txt directive "Noindex" exists, but is only supported by Google.
Wildcards are supported by Google, Yahoo, MSN and Ask.
Just to avoid any misunderstandings.
[edited by: Webnauts at 2:00 am (utc) on Mar. 8, 2009]
User-agent: Googlebot
Disallow: /*?
Disallow: *googlebot=nocrawl
Noindex: /*?
Noindex: *googlebot=nocrawl
In addition I have implemented in those pages the robots meta tag with the directives "noindex,nofollow,noarchive,nosnippet" and where not applicable I have implemented for those pages X-Robots with the directives "noindex,nofollow,noarchive,nosnippet".
How can those URL references show up in the index after all?
That means that you are far out-of-date.
The robots.txt directive "Noindex" exists, but is only supported by Google
Wildcards are supported by Google, Yahoo, MSN and Ask.
"You're right, Live Search does not support wildcards in robots.txt today; we are thinking about it. [webmasterworld.com]" (Nov 1, 2007)
About the REP of the last millennium I am very aware of it, that is why I assign rules for the bots individually.
[edited by: Webnauts at 12:11 pm (utc) on Mar. 8, 2009]
[edited by: Robert_Charlton at 4:44 pm (utc) on Mar. 8, 2009]
[edit reason] removed specifics [/edit]
User-Agent: *
Disallow: /
Noindex: /[pagead2.googlesyndication.com...]
Still, to my knowledge there has never been a public statement of Google's support for this directive. In December 2007 Yahoo did announce support for X-Robots-Tag: NOINDEX (Yahoo reference [ysearchblog.com]) but that's not the same thing.
Without official documentatation on this, the working assumption has been that it is experimental. If you add it and then use Google's robots.txt validator tool, you will see that the tool gives it the same status as Disallow.
20 November, 2007
"Good catch, Sebastian. How is your experiment going? At the moment we will usually accept the "noindex" directive in the robots.txt, but we are not yet at a point where we are willing to set it into stone and announce full support."Sebastian's Pamphlets [sebastians-pamphlets.com]
Since then I use it and it works perfectly. And by the way, I obviously also use X-Robots with the "noindex" directives.
That is all I can tell about that.
[edited by: tedster at 12:02 am (utc) on Mar. 9, 2009]
[edit reason] I added a link to attribute the quote [/edit]
When you say that a Noindex: directive in robots.txt "works" - do you mean that even with backlinks, the url never gets shown in Google search results, not even as a url-only listing? That would make its action different (and more far-reaching) than a Disallow: directive, which disallows crawling but not appearance in the SERPs.
Now what's next? Just another example:
I have a page linking to a page called example.html
The link looks like this:
http://www.example.com/example.html?bots=nocrawl
In the robots.txt I have this:
User-agent: Googlebot
Disallow: *bots=nocrawl
Noindex: *bots=nocrawl
Then I add in the .htaccess file X-Robots this:
<FilesMatch "\.(txt)$">
Header set X-Robots-Tag "noindex,nofollow,noarchive,nosnippet"
</FilesMatch>
That is setup so the robots.txt cannot be indexed,followed,etc.
If I take this to a next level, lets say I add for that single file another X-Robots rule like:
<FilesMatch "example\.html">
Header set X-Robots-Tag "noindex,nofollow,noarchive,nosnippet"
</FilesMatch>
What do you think? Or am I repeating myself?
[edited by: tedster at 7:48 pm (utc) on Mar. 9, 2009]
[edit reason] switch to example.com - it can never be owned [/edit]