The googlebot 'nocrawl' method and PageRank - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

The googlebot 'nocrawl' method and PageRank

Webnauts

6:22 am on Mar 7, 2009 (gmt 0)

10+ Year Member

First please read a post of the Google Web Spam Team Engineer Matt Cutts:

I wanted to mention one more way to block Googlebot by using wildcards in robots.txt (Google supports wildcards like '*' in robots.txt). Here's how:
1. Add the parameter like [mattcutts.com...] to pages that you don't want fetched by Googlebot.
2. Add the following to your robots.txt:
User-agent: Googlebot
Disallow: *googlebot=nocrawl
[mattcutts.com...]
Because I strictly do not use the "nofollow" attribute, and for an example, one reason is because I would like to preserve the option to tell other search engines to follow certain internal and external links but Google not.
I already implemented the above method and I see in my Google Webmaster Tools that those links where forbidden. So far so good. It works.
In addition I added to the links destination pages the meta tag directives "noindex,nofollow,noarchive,nosnippet" or in some cases I took care of that with X-Robots.
Again I must repeat what Matt Cutts said:
"We may see links to the pages with the nocrawl parameter, but we won't crawl them. At most, we would show the url reference (the uncrawled link), but we wouldn't ever fetch the page."
For that case, about Google showing the url reference is not possible, since I have additionally implemented in the robots.txt the "noindex" directive like this:
Disallow: *googlebot=nocrawl
So here is my question:
Do those links accumulate PR? Do I leak PR?
Any thoughts?
[edited by: tedster at 8:13 am (utc) on Mar. 7, 2009]
[edit reason] shorter quote - fair use copyright rule [/edit]

tedster

9:51 am on Mar 7, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Yes, a url that is not indexed still accumulates PageRank -- unless the links that point to it all use a nofollow attribute. If by "leaking" PR you mean these urls can accumulate PR but they can't circulate it back into the site, that is true. They become a kind of black hole for PR ;)

johnnie

10:19 am on Mar 7, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Do those links accumulate PR? Do I leak PR?

My guess is that you should see this as a stricter meta noindex: it gets picked up and receives PR, but whereas a meta noindex page can still pass PR, a 'concrawled' page cannot.

Be wary of duplicate content though. Query strings can be a death trap if you don't have a good plan.

Webnauts

11:21 am on Mar 7, 2009 (gmt 0)

10+ Year Member

So what is the difference between the "google=nocrawl" practice and the "nofollow" attribute?

Can you explain?

[edited by: Webnauts at 11:25 am (utc) on Mar. 7, 2009]

johnnie

3:32 pm on Mar 7, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The googlebot=nocrawl is specific to google and needs to be applied consistently throughout your site in order to prevent duplicate content problems. Regardless, both can be used to prevent discovery, although Matt Cutts says:

Obscure note #1: using the �googlebot=nocrawl� technique would not be the preferred method in my mind. Why? Because it might still show �googlebot=nocrawl� urls as uncrawled urls. You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file[...]

I don't know whether this would also apply to nofollow-links, but my gut feeling says it doesn't (anybody with some experience on this matter?).

Webnauts

5:58 pm on Mar 7, 2009 (gmt 0)

10+ Year Member

I have to correct my OP.

I said:

...about Google showing the url reference is not possible, since I have additionally implemented in the robots.txt the "noindex" directive like this:

Disallow: *googlebot=nocrawl

And I met:

...about Google showing the url reference is not possible, since I have additionally implemented in the robots.txt the "noindex" directive like this:

Disallow: *googlebot=nocrawl
Noindex: *googlebot=nocrawl

phranque

12:45 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Noindex: *googlebot=nocrawl

that's called wishful thinking.
there is no such robots.txt directive as "Noindex:" that i'm aware of.

also note that while google may have wildcarding implemented, if you decide to expand this technique the robots.txt exclusion protocol only permits wildcarding for this specific usage:

User-agent: *

Webnauts

1:50 am on Mar 8, 2009 (gmt 0)

10+ Year Member

Phranque, that is not called wishful thinking. That means that you are far out-of-date.

The robots.txt directive "Noindex" exists, but is only supported by Google.

Wildcards are supported by Google, Yahoo, MSN and Ask.

Just to avoid any misunderstandings.

[edited by: Webnauts at 2:00 am (utc) on Mar. 8, 2009]

wheel

2:21 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

the disallow googlebot=nocrawl posted in the OP in this thread should work for all bots that support robots.txt. It's not some secret Google thing that's added, it's just MC showing us an innovative way to use what we already have - if I'm not mistaken. He's saying to just brand all nocrawl pages with something that becomes part of the URL; and then of course anything with that URL can be blocked in robots.txt just like any other pattern can be blocked in robots.txt.

g1smd

2:43 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Of course, should anyone link to you without that parameter included, the pages will be indexed under whatever URL was included in that link.

I'd never use a technique like this. It's not robust in any way.

Webnauts

2:53 am on Mar 8, 2009 (gmt 0)

10+ Year Member

I have the robots.txt:

User-agent: Googlebot
Disallow: /*?
Disallow: *googlebot=nocrawl
Noindex: /*?
Noindex: *googlebot=nocrawl

In addition I have implemented in those pages the robots meta tag with the directives "noindex,nofollow,noarchive,nosnippet" and where not applicable I have implemented for those pages X-Robots with the directives "noindex,nofollow,noarchive,nosnippet".

How can those URL references show up in the index after all?

tedster

4:09 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

They can show up as a "url-only" reference if some other link to them exists anywhere. You can request their removal if you want, through your Webmaster Tools account.

phranque

10:52 am on Mar 8, 2009 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

That means that you are far out-of-date.
The robots.txt directive "Noindex" exists, but is only supported by Google

i would be happy to believe you.
i've simply never seen a reference to it before yours.
maybe i missed it:
Creating a robots.txt file - Webmasters/Site owners Help [google.com]

Wildcards are supported by Google, Yahoo, MSN and Ask.

i guess it's been almost a year now for MSN:

"You're right, Live Search does not support wildcards in robots.txt today; we are thinking about it. [webmasterworld.com]" (Nov 1, 2007)

i was merely pointing out that it is not part of the 1996 REP so if you want all well-behaved bots to respect your intentions you should be careful with your usage of extensions.
lest there be any misunderstanding...

Webnauts

12:09 pm on Mar 8, 2009 (gmt 0)

10+ Year Member

Phranque there is no official statement from Google about the robots.txt "noindex" directive, but some SEOs we know that Google was experimenting with it a long time ago, and it works 100%. I tested with endless of sites so far with full support and success!

About the REP of the last millennium I am very aware of it, that is why I assign rules for the bots individually.

[edited by: Webnauts at 12:11 pm (utc) on Mar. 8, 2009]

[edited by: Robert_Charlton at 4:44 pm (utc) on Mar. 8, 2009]
[edit reason] removed specifics [/edit]

tedster

5:51 pm on Mar 8, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

A note about the noindex: directive in robots.txt. There is speculation about this because it appears in Google's own Adserver robots.txt

User-Agent: *
Disallow: /
Noindex: /
[pagead2.googlesyndication.com...]
Still, to my knowledge there has never been a public statement of Google's support for this directive. In December 2007 Yahoo did announce support for X-Robots-Tag: NOINDEX (Yahoo reference [ysearchblog.com]) but that's not the same thing.
Without official documentatation on this, the working assumption has been that it is experimental. If you add it and then use Google's robots.txt validator tool, you will see that the tool gives it the same status as Disallow.

Webnauts

8:54 pm on Mar 8, 2009 (gmt 0)

10+ Year Member

Tedster, at the blog of Sebastians Pamphlets, Google's employee John Mu commented:

20 November, 2007

"Good catch, Sebastian. How is your experiment going? At the moment we will usually accept the "noindex" directive in the robots.txt, but we are not yet at a point where we are willing to set it into stone and announce full support."
Sebastian's Pamphlets [sebastians-pamphlets.com]

Since then I use it and it works perfectly. And by the way, I obviously also use X-Robots with the "noindex" directives.

That is all I can tell about that.

[edited by: tedster at 12:02 am (utc) on Mar. 9, 2009]
[edit reason] I added a link to attribute the quote [/edit]

tedster

12:04 am on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks for that comment from John Mueller.

When you say that a Noindex: directive in robots.txt "works" - do you mean that even with backlinks, the url never gets shown in Google search results, not even as a url-only listing? That would make its action different (and more far-reaching) than a Disallow: directive, which disallows crawling but not appearance in the SERPs.

Shaddows

10:06 am on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Sorry, but I fail to see how this is in anyway helpful. Ok, you stop crawl referal from your own site, but unless you are 301-ing to a canonical URL that includes the robots.txt-excluded string, it does not stop discovery from elsewhere. As PR is also 'lost', the only discernable benefit is to preserve crawl-budget, and surely that is going to be marginal at best.

Webnauts

5:28 pm on Mar 9, 2009 (gmt 0)

10+ Year Member

Ted the Noindex: directive works so far I can tell from several projects I worked with already.

Now what's next? Just another example:

I have a page linking to a page called example.html

The link looks like this:

http://www.example.com/example.html?bots=nocrawl

In the robots.txt I have this:

User-agent: Googlebot
Disallow: *bots=nocrawl
Noindex: *bots=nocrawl

Then I add in the .htaccess file X-Robots this:

<FilesMatch "\.(txt)$">
Header set X-Robots-Tag "noindex,nofollow,noarchive,nosnippet"
</FilesMatch>

That is setup so the robots.txt cannot be indexed,followed,etc.

If I take this to a next level, lets say I add for that single file another X-Robots rule like:

<FilesMatch "example\.html">
Header set X-Robots-Tag "noindex,nofollow,noarchive,nosnippet"
</FilesMatch>

What do you think? Or am I repeating myself?

[edited by: tedster at 7:48 pm (utc) on Mar. 9, 2009]
[edit reason] switch to example.com - it can never be owned [/edit]

tedster

7:57 pm on Mar 9, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

What is the issue you are trying to resolve? Are these urls still showing up in search results?

Webnauts

8:38 pm on Mar 9, 2009 (gmt 0)

10+ Year Member

The links do not show up in the search results. I am just wondering if there is a sort of PR pruning.