|Robots.txt vs. meta robots noindex?|
I'm currently working with a site that has a large quantity of duplicate content. A canonical tag has been implemented on the original page, and Meta noindex tags have been implemented on the duplicates. Would robots.txt noindex be a better option? I'm thinking that using pattern matching in robots.txt may just be more efficient - i.e. Googlebot won't even look at those pages from the get-go - whereas with Meta robots, those pages will still be crawled, even if they're ultimately not indexed.
I prefer robots.txt for blocking crawler
Uggur - why?
Google download robots.txt avarage per a day. And If i put noindex metatag to pages, i must wait to googlebot for recrawl. I remember that Johm Mu (google employer) said that, "robots.txt caching for crawlers so if you wanna to block some pages, you should disallow them from robots.txt and wait for google caching it. You can view it when google latest download robots.txt" something like that.
I think robots.txt management is easy more then no index metatag.
i would tag any page which is a duplicate (or of little value for any other reason) with a 'noindex' tag. this allows google to crawl the page anyway.
i am not a big fan of the robots.txt as a way of sorting out duplicate issues.
So, you have 2 problems:
1) You allready have the double content in google, which you want to get rid of (and presumably: transfer all juice to a single page).
2) Afterwords, you do not want the double pages to be picked up by google again.
I've seen great results when implementing the canonical tag.
Putting it in all double pages, and pointing it to the 1 page you do want in the SERPs.
I would only use the robots.txt option afterwords. I dont know if putting the double pages in your robots.txt now will just delete your pages from the index, including their pr, but I believe so.
This way you give google a chance to first transfer all link juice / pr to the important page through the canonical tags.
lostdreamer - I kind of came to the site with the canonical tag AND meta robots noindex already implemented. so the question is where do I go from here - do I implement robots.txt? Frankly, I think in terms of maximizing how efficiently the search engine bots can crawl my site, it may be the way to go. also, SE bots can then auto-discover XML sitemaps...
and gn_wendy - I'll ask you the same question I asked ugger: Why? it's all well and good to say "you're not a fan"...what does that mean?
If you do not want something in Google, you NOINDEX it. The robots.txt file is not the best way to tell Google not to index something. You could still end up with urls in the index - only they'd be urls without titles or snippets.
I control indexing with the NOINDEX tag, and crawling with the robots.txt file. They're not the same thing.
|You could still end up with urls in the index - only they'd be urls without titles or snippets. |
...the exact same thing happens when you use the 'noindex' tag.
|I'll ask you the same question I asked ugger: Why? |
For several reasons. The first being I try and not limit Google's access to any pages a user or myself may link to. If a link is in the section of my website blocked by the robots.txt the PR will not flow back into the site from there.
The second reason I don't like to use it, is because it kills crawl-rate for some reason. I have tested this extensively and pre-caffeine blocking a larger portion of a site with robots.txt helped the actual crawl-rate for indexable pages, but the domain got crawled a lot less. This saved a ton of server costs for the client and did not have any noticeable impact on rankings. However, since then we have unblocked, because Google is crawling the site like crazy, which brings me to my third reason.
I do not like blocking anything with the robots.txt unless it is also blocked for users. ie. sites or directories still in ALPHA/BETA, admin pages and so on. There is no reason for anybody but a developer to be poking around in there so I'll block it. This is not something I have tested, but I am happy to let Google run wild and "look under the hood" of my website, as it were. The only reason to block portions of a website with robots.txt is because you don't want google-bot crawling around in there - which brings me to the next question of "why not"? Do you have something to hide? If so what and why? I just don't like having to answer those questions. If you are all white-hat then you shouldn't have a problem sticking with just the 'noindex' tag for pages.
The fourth reason is that it just doesn't work as well. A lot of the time I want to block parts of a directory and not a whole directory. robots.txt is great for limited stuff - but as soon as you go outside the box, you're stuck having to dig out a work-around.
|...the exact same thing happens when you use the 'noindex' tag. |
Really? Across several hundred sites, I've never seen that.
|The only reason to block portions of a website with robots.txt is because you don't want google-bot crawling around in there - which brings me to the next question of "why not"? Do you have something to hide? |
Because there are some places Google-bot doesn't need to be ("why not" is not relevant - if I say I don't want Google in portions of my site, that's the final answer) Shopping cart pages, for example - no reason at all for Google bot to be crawling those.
|Shopping cart pages, for example - no reason at all for Google bot to be crawling those. |
I agree ;) - there is no need for Google to be poking around in there. But nobody would link there either. I was referring to pages with duplicate content.
Really? Across several hundred sites, I've never seen that.
Just confirmed it to make sure I hadn't messed it up. The pages found were nowhere in the robots.txt and only had the robots 'noindex' tag in place.
The URLs showed up at the very end of the site:- search.
Redid the search using the inurl:- operator and found loads more. Always at the end though, except for URLs unique to the 'noindex'-ed content.
The function of robots.txt and the <meta name="robots" content="noindex"> tag in an HTML page are not at all the same:
A Disallow entry in robots.txt says "Do not fetch URL-paths beginning with this prefix." The purpose, as stated in "A Standard for Robot Exclusion," is to reduce the amount of server bandwidth consumed by Web robots/spiders/crawlers fetching resources from your server. The fetching of any kind of resource -- HTML pages, documents, images, multimedia files, etc. can be controlled with robots.txt exclusions.
A <meta name="robots" content="noindex"> tag in an HTML page says, "Do not include this page in your index or search results." The de-facto purpose of this tag is to prevent the URL for this HTML resource from appearing in search results. Because <meta> tags only have meaning in the context of HTML pages, this tag is only useful for controlling the indexing and listing of HTML documents.
In order for the <meta name="robots" content="noindex"> tag in an HTML page to be processed, the page must be fetched; The robots.txt file must not Disallow URL-paths which match the prefix of that page's URL.
The interaction of these two very-different functions --and the failure to recognize them as different functions used for different purposes -- tends to lead to a lot of confusion.
Throw in the semi-proprietary HTTP response header "X-Robots-Tag: noindex" recognized by some robots, and it only gets more confusing. The server can be configured to send this header instead of including the <meta name="robots" content="noindex"> tag on HTML pages, and this is particularly useful when the page or object is not an HTML document -- PDF, MS Word and Excel documents, images, and multimedia files, for example.
Thanks; very helpful.
< moved from another location >
I rarely if ever use the robots file. I have just started looking at them.
If you Disallow a page in robots it would seem that page could still acquire pagerank? Will that page be able to pass pagerank to pages it links to? I have looked around the web and have seen arguments for both sides.
[edited by: Robert_Charlton at 9:33 pm (utc) on Aug 29, 2010]
A page blocked by robots.txt can neither acquire PageRank nor pass it on, since it's not spidered by Google.
If you want to keep a page out of the index but allow it to acquire and pass on PageRank, use the robots meta tag, on a per page basis, in this format...
<meta name="robots" content="noindex,follow">
Even though follow is default behavior, I'm including in the above just to remove ambiguity.
As jdMorgan points out, do not use both the meta robots tag and robots.txt, for reasons that he explains.
I guess for the most part I just don't want to waste pagerank. Lets say I have a homepage with some pr (page A). And I have two outgoing links on it to page B and page C. If page C is disallowed in robots does this mean page B is getting less juice than if page A only linked to page B.
Page C is a "unsubscribe" page. While it doesn't need to be in the index losing pagerank on it is a bigger issue.
|A page blocked by robots.txt can neither acquire PageRank nor pass it on, since it's not spidered by Google. |
Some time ago, in an interview with Eric Enge, Matt Cutts said the following:
"Now, robots.txt says you are not allowed to crawl a page, and Google therefore does not crawl pages that are forbidden in robots.txt. However, they can accrue PageRank, and they can be returned in our search results."
My understanding is that:
a) page blocked by robots.txt can acquire PR, but cannot pass it as it is not crawled (hence its outbound links are not seen by Google). Therefore such page stops PR flow.
b) page with meta noindex can both, acquire and pass page rank, but the page itself will not be shown in Google index. PR still flows.
c) for the page not to acquire PR to use "nofollow" on links to that page, in which case PR that would have gone to that page goes to "black hole". Page does not acquire PR from originated page and but there could be PR flow from that page if PR is acquired from some OTHER linked from page where the link was not "nofollow"
So to answer romerome question, my understanding is that the PR juice to page B will be unchanged whether you disallow page C in robots.txt or not.
Please someone correct me if my understanding is wrong.
 Corrected point c) above - believe PR can be passed FROM page that was nofollowed [/edit]
[edited by: aakk9999 at 11:24 pm (utc) on Aug 29, 2010]
|I'm currently working with a site that has a large quantity of duplicate content. A canonical tag has been implemented on the original page, and Meta noindex tags have been implemented on the duplicates. |
My opinion is that the above is not quite right. If the original page has canonical tag, but duplicate pages do not have canonical tag (to original page) and instead only have meta noindex, then you are not consolidating PR between original page with canonical URL and its duplicates.
The duplicate pages should have canonical tag implemented to point to original page URL. The original pages may or may not have canonical tag implemented (I usually have canonical tag on original pages too, to point to own URL and seems to do no harm).
Personally I do not think you need noindex on duplicate pages if you have implemented canonical tag on these pages correctly as they will drop from index anyway when canonical is processed by Google.
|If you Disallow a page in robots it would seem that page could still acquire pagerank? Will that page be able to pass pagerank to pages it links to? I have looked around the web and have seen arguments for both sides. |
(This is what I'd typed up before I saw that aakk9999 had posted. I'm going to go ahead and post as is.)
Assuming that the links to B & C are the only two links from A...
No matter what you do, Google will split the PageRank between those two links... in this case sending 50% to each.
Taking it further... assuming plain vanilla links, if you had 3 links from A to 3 pages, B, C, and D, each would get 1/3 of the top down PageRank from A. If you had "n" links from A, the PageRank would be divided up n ways, and each page would receive 1/n of the PageRank distributed from A.
But, looking at the example of just B & C...
...if you disallow C in robots.txt, that PageRank will go no further, no matter what kind of linking you set up from C to the rest of the site. You will have lost the use of that PageRank, because Google will not be spidering C, and will not therefore know of any links from C in order to follow them.
Similarly, if you used the rel="nofollow" attribute on the link from A to C, page C will also lose the PageRank as distributed from A. The PageRank will be divided up as before, ie, between B and C, but rel="nofollow" effectively creates a PageRank "black hole" on the link to C. The PageRank goes into the black hole to C, but it doesn't come out.
Now, other links to C from other pages on the web, or from other pages on your site, might also transmit PageRank to C. But if those links are also nofollowed on your site, then those too would create PageRank black holes, and the fractional PageRank from each source page would be thrown away. If you do that very many times, you will have lost PageRank that possibly could have been helpful elsewhere.
Suppose, though, that you did nothing, and just let page C be crawled and indexed. B and C would still be splitting the PageRank from A... but under this scenario a link or links from C could transmit the PageRank that C accumulates to other pages. You could recirculate the PageRank throughout your site. So, you might chose to link back to A from C, or you might choose to link to other pages within the site. Some PageRank will be lost due to a damping factor inherent in the PageRank algorithm, but most will be transmitted through links from C and then throughout the site, depending on navigation setup. While page C would be spidered and it would be in the index, it's easy enough to set it up that it won't rank for anything likely to be searched.
If you absolutely don't want page C or any reference to page C to appear in the serps, and there are sometimes reasons for this, you should use the noindex robots meta tag...
<meta name="robots" content="noindex,follow">
As I noted, the "follow" attribute, which is default behavior, allows PageRank to recirculate from C to other pages throughout your site. Note that the noindex,follow robots meta tag does not create a PageRank black hole.
The noindex meta tag is my method of choice for keeping user accessible pages out of the index... but I use noindex only if I want to hide a page from searchers. Otherwise, I let Google crawl and index the page. Given that Google has set things up so that there are PageRank black holes, there's no PageRank loss caused by letting Google crawl the page. As noted above, never use the robots meta tag in combination with robots.txt.
The challenge, of course, is to avoid very many unimportant links from home or from high up in your nav hierarchy, because homepage links to unimportant pages will be diverting PageRank from higher priority areas.
You can use iframes for some of your footer links, as has often been discussed here, or you can take it on some degree of faith that Google has figured it out enough that Google understands that these links are unimportant and doesn't drain PageRank off from more important navigation.