Forum Moderators: Robert Charlton & goodroi
If there's a way to disallow those "sort" URL patterns in robots.txt, I prefer that approach even more. Googlebot sees that disallow rule and doesn't even request the URL. With a noindex meta tag, it still needs to spider the URL just to read it, and that can be a drain on the crawl budget they've allocated to your domain.
crawl budget
Never heard that term before. Nice one though!
I used it and google crawled my pages anyway. Use noindex. It is better.
Aye, this is where it gets silly. I've always had problems with google indexing robots.txt excluded files so went with noindex meta tags ... but then Bing started indexing all my noindexed pages!
I've added sorting capability to my site which has created many new URLs, but I've noindex'ed all of them.
I have a site with around 550 pages and only 13 of them are indexed by Google. The others are all noindexed and I have not seen any problems from that. The 13 pages that are indexed rank fine.
Due to the number of variants and customisations websites have these days I doubt google would be stupid enough to think that there is something wrong with a site that noindexes even 90% of its content. Even if you only have a "list alphabetically" and "list by price" button on your product pages you've just tripled your content and the extra 2/3 are all, essentially, duplicates. noindex is really the best thing.
I used it and google crawled my pages anyway. Use noindex. It is better.
Did G' actually crawl the pages, or did you see the URL in the SERPs?
There is a major difference between these two. If G' finds the URL as a link from somewhere (which is usually the case) then G' will return that URL if your search is specific enough - but without any pagetitle or snippet.
If you see that G' has crawled a page (through your server stats) it may well be a problem with your robots.txt
Robots.txt has been a great help in steering G' away from large portions of 'noindex' sites in order to get more out of the crawl budget.
The "detected" date next to each disallowed URL is updated frequently so Google is requesting these URLs again and again. But I think they just request the URL without reading the content.
What I'm not sure about is if thousands of URLs which are disallowed in robots.txt will have a negative effect on your crawling or indexing budget, it might be the case since the URLs are "detected".
I've been using "parameter handling" in WMT to stop Google visiting the disallowed URLs over and over but it did not help.
Disallowing URLs in the robots.txt should keep you from using your crawl budget on pages not in the index, but I personally use noindex,follow (follow being default, so unnecessary in the actual robots meta tag) because I want credit for any links pointing to those page and want the content to be known to exist on the site. I usually do this for things other than a sort though where the content is similar, but not identical.
I think it comes down to which is more valuable in the situation: Crawl budget or inbound links to the pages. Personally, in my situation(s) the pages do attract links and I want the content to be known by G, so I use noindex, which gives them access to the content and keep some of the link weight being passed into the site.
Noindex or robots.txt are really the best answers to this question.