tedster

msg:3864965 | 9:51 am on Mar 7, 2009 (gmt 0) |
Yes, a url that is not indexed still accumulates PageRank -- unless the links that point to it all use a nofollow attribute. If by "leaking" PR you mean these urls can accumulate PR but they can't circulate it back into the site, that is true. They become a kind of black hole for PR ;)
|
johnnie

msg:3864971 | 10:19 am on Mar 7, 2009 (gmt 0) |
| Do those links accumulate PR? Do I leak PR? |
| My guess is that you should see this as a stricter meta noindex: it gets picked up and receives PR, but whereas a meta noindex page can still pass PR, a 'concrawled' page cannot. Be wary of duplicate content though. Query strings can be a death trap if you don't have a good plan.
|
Webnauts

msg:3864986 | 11:21 am on Mar 7, 2009 (gmt 0) |
So what is the difference between the "google=nocrawl" practice and the "nofollow" attribute? Can you explain? [edited by: Webnauts at 11:25 am (utc) on Mar. 7, 2009]
|
johnnie

msg:3865091 | 3:32 pm on Mar 7, 2009 (gmt 0) |
The googlebot=nocrawl is specific to google and needs to be applied consistently throughout your site in order to prevent duplicate content problems. Regardless, both can be used to prevent discovery, although Matt Cutts says: | Obscure note #1: using the ‘googlebot=nocrawl’ technique would not be the preferred method in my mind. Why? Because it might still show ‘googlebot=nocrawl’ urls as uncrawled urls. You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file[...] |
| I don't know whether this would also apply to nofollow-links, but my gut feeling says it doesn't (anybody with some experience on this matter?).
|
Webnauts

msg:3865151 | 5:58 pm on Mar 7, 2009 (gmt 0) |
I have to correct my OP. I said: ...about Google showing the url reference is not possible, since I have additionally implemented in the robots.txt the "noindex" directive like this: Disallow: *googlebot=nocrawl And I met: ...about Google showing the url reference is not possible, since I have additionally implemented in the robots.txt the "noindex" directive like this: Disallow: *googlebot=nocrawl Noindex: *googlebot=nocrawl
|
phranque

msg:3865447 | 12:45 am on Mar 8, 2009 (gmt 0) |
| Noindex: *googlebot=nocrawl |
| that's called wishful thinking. there is no such robots.txt directive as "Noindex:" that i'm aware of. also note that while google may have wildcarding implemented, if you decide to expand this technique the robots.txt exclusion protocol only permits wildcarding for this specific usage:
|
Webnauts

msg:3865477 | 1:50 am on Mar 8, 2009 (gmt 0) |
Phranque, that is not called wishful thinking. That means that you are far out-of-date. The robots.txt directive "Noindex" exists, but is only supported by Google. Wildcards are supported by Google, Yahoo, MSN and Ask. Just to avoid any misunderstandings. [edited by: Webnauts at 2:00 am (utc) on Mar. 8, 2009]
|
wheel

msg:3865491 | 2:21 am on Mar 8, 2009 (gmt 0) |
the disallow googlebot=nocrawl posted in the OP in this thread should work for all bots that support robots.txt. It's not some secret Google thing that's added, it's just MC showing us an innovative way to use what we already have - if I'm not mistaken. He's saying to just brand all nocrawl pages with something that becomes part of the URL; and then of course anything with that URL can be blocked in robots.txt just like any other pattern can be blocked in robots.txt.
|
g1smd

msg:3865504 | 2:43 am on Mar 8, 2009 (gmt 0) |
Of course, should anyone link to you without that parameter included, the pages will be indexed under whatever URL was included in that link. I'd never use a technique like this. It's not robust in any way.
|
Webnauts

msg:3865507 | 2:53 am on Mar 8, 2009 (gmt 0) |
I have the robots.txt: User-agent: Googlebot Disallow: /*? Disallow: *googlebot=nocrawl Noindex: /*? Noindex: *googlebot=nocrawl In addition I have implemented in those pages the robots meta tag with the directives "noindex,nofollow,noarchive,nosnippet" and where not applicable I have implemented for those pages X-Robots with the directives "noindex,nofollow,noarchive,nosnippet". How can those URL references show up in the index after all?
|
tedster

msg:3865535 | 4:09 am on Mar 8, 2009 (gmt 0) |
They can show up as a "url-only" reference if some other link to them exists anywhere. You can request their removal if you want, through your Webmaster Tools account.
|
phranque

msg:3865622 | 10:52 am on Mar 8, 2009 (gmt 0) |
That means that you are far out-of-date. The robots.txt directive "Noindex" exists, but is only supported by Google |
| i would be happy to believe you. i've simply never seen a reference to it before yours. maybe i missed it: Creating a robots.txt file - Webmasters/Site owners Help [google.com] | Wildcards are supported by Google, Yahoo, MSN and Ask. |
| i guess it's been almost a year now for MSN: i was merely pointing out that it is not part of the 1996 REP so if you want all well-behaved bots to respect your intentions you should be careful with your usage of extensions. lest there be any misunderstanding...
|
Webnauts

msg:3865638 | 12:09 pm on Mar 8, 2009 (gmt 0) |
Phranque there is no official statement from Google about the robots.txt "noindex" directive, but some SEOs we know that Google was experimenting with it a long time ago, and it works 100%. I tested with endless of sites so far with full support and success! About the REP of the last millennium I am very aware of it, that is why I assign rules for the bots individually. [edited by: Webnauts at 12:11 pm (utc) on Mar. 8, 2009] [edited by: Robert_Charlton at 4:44 pm (utc) on Mar. 8, 2009] [edit reason] removed specifics [/edit]
|
tedster

msg:3865780 | 5:51 pm on Mar 8, 2009 (gmt 0) |
A note about the noindex: directive in robots.txt. There is speculation about this because it appears in Google's own Adserver robots.txt Still, to my knowledge there has never been a public statement of Google's support for this directive. In December 2007 Yahoo did announce support for X-Robots-Tag: NOINDEX (Yahoo reference [ysearchblog.com]) but that's not the same thing. Without official documentatation on this, the working assumption has been that it is experimental. If you add it and then use Google's robots.txt validator tool, you will see that the tool gives it the same status as Disallow.
|
Webnauts

msg:3865859 | 8:54 pm on Mar 8, 2009 (gmt 0) |
Tedster, at the blog of Sebastians Pamphlets, Google's employee John Mu commented: 20 November, 2007 | "Good catch, Sebastian. How is your experiment going? At the moment we will usually accept the "noindex" directive in the robots.txt, but we are not yet at a point where we are willing to set it into stone and announce full support." Sebastian's Pamphlets [sebastians-pamphlets.com] |
| Since then I use it and it works perfectly. And by the way, I obviously also use X-Robots with the "noindex" directives. That is all I can tell about that. [edited by: tedster at 12:02 am (utc) on Mar. 9, 2009] [edit reason] I added a link to attribute the quote [/edit]
|
tedster

msg:3865959 | 12:04 am on Mar 9, 2009 (gmt 0) |
Thanks for that comment from John Mueller. When you say that a Noindex: directive in robots.txt "works" - do you mean that even with backlinks, the url never gets shown in Google search results, not even as a url-only listing? That would make its action different (and more far-reaching) than a Disallow: directive, which disallows crawling but not appearance in the SERPs.
|
Shaddows

msg:3866130 | 10:06 am on Mar 9, 2009 (gmt 0) |
Sorry, but I fail to see how this is in anyway helpful. Ok, you stop crawl referal from your own site, but unless you are 301-ing to a canonical URL that includes the robots.txt-excluded string, it does not stop discovery from elsewhere. As PR is also 'lost', the only discernable benefit is to preserve crawl-budget, and surely that is going to be marginal at best.
|
Webnauts

msg:3866443 | 5:28 pm on Mar 9, 2009 (gmt 0) |
Ted the Noindex: directive works so far I can tell from several projects I worked with already. Now what's next? Just another example: I have a page linking to a page called example.html The link looks like this: http://www.example.com/example.html?bots=nocrawl In the robots.txt I have this: User-agent: Googlebot Disallow: *bots=nocrawl Noindex: *bots=nocrawl Then I add in the .htaccess file X-Robots this: <FilesMatch "\.(txt)$"> Header set X-Robots-Tag "noindex,nofollow,noarchive,nosnippet" </FilesMatch> That is setup so the robots.txt cannot be indexed,followed,etc. If I take this to a next level, lets say I add for that single file another X-Robots rule like: <FilesMatch "example\.html"> Header set X-Robots-Tag "noindex,nofollow,noarchive,nosnippet" </FilesMatch> What do you think? Or am I repeating myself? [edited by: tedster at 7:48 pm (utc) on Mar. 9, 2009] [edit reason] switch to example.com - it can never be owned [/edit]
|
tedster

msg:3866586 | 7:57 pm on Mar 9, 2009 (gmt 0) |
What is the issue you are trying to resolve? Are these urls still showing up in search results?
|
Webnauts

msg:3866629 | 8:38 pm on Mar 9, 2009 (gmt 0) |
The links do not show up in the search results. I am just wondering if there is a sort of PR pruning.
|
|