Forum Moderators: Robert Charlton & goodroi
Robots-tip: crawlers cache your robots.txt; update it at least a day before adding content that is disallowed.
[twitter.com...]
Most major SEs do not use HEAD very often in my experience, and that's another conversation, but the reason I've read them stating is because requesting the entire resource does not save as much resource wise as it would seem, especially since the requests are only for the HTML, not all the graphics and associated resources, so they use a conditional GET and request the whole thing.
Matt Cutts: In terms of crawling the web and text content and HTML, we'll typically just use a GET and not run a HEAD query first. We still use things like If-Modified-Since, where the web server can tell us if the page has changed or not. There are still smart ways that you can crawl the web, but HEAD requests have not actually saved that much bandwidth in terms of crawling HTML content, although we do use it for image content.
Webmaster Guidelines: Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell Google whether your content has changed since we last crawled your site. Supporting this feature saves you bandwidth and overhead.
[Google.com...]
^ Using 304 to manage crawl activities.
You do understand noindex, even at the header level, does not change the crawling of the pages, right? Those pages are still crawled. They have to be. So, it actually changes the bots behavior to the opposite of what it seems like you're saying it does.
I should probably stay away from these discussions as the improper use of terminology can sure change the meaning of things.
I understand the pages are crawled. My point is that using noindex pulls all references to that page out of the index. Unlike robots.txt entries which may show up as a URI only listing, in most instances they do.
Robots.txt keeps the bots totally off the pages and leaves them uncrawled*, so there are no 'wasted' resources there (crawl or server), but noindex does the opposite and makes sure the pages are crawled, but not shown in the results.
In the context of keeping document references out of the index, I believe noindex would be the preferred method?
I should note that my solutions may not be the best option when micro-managing bandwidth. I'm looking at this more from a "what's in the index" perspective. I want the bots crawling those pages so they know exactly what to do at each and every level of the site. They'll either get no directives or noindex, or noindex, nofollow. Since the bulk of the sites I work with are less than 100,000 documents, so far this has worked out well and have produced the desired results.
1. Didn't want to use a robots.txt file and expose the entire structure/dynamics of the site.
2. Didn't want to find URI only listings to documents that shouldn't be easily available to prying eyes.
The differences in this type of conversation, when dealing with resource savings, bandwidth, crawl rate, pages crawled, etc., are important to draw IMO, because unless people understand how things work and the fact the contents of the resources are actually 'sent' to the requester rather than being 'visited' they really can't figure out what saves resources and what does not.
Interesting! So it works the other way around? All these freakin years and I still have a lot more to learn. This isn't my job! ;)
The noindex tag does not really save any resources or even change crawl frequency in my experience, but it changes what the SEs show in the results, even for a site: search, so it might appear to 'save resources' (crawl budget, bandwidth, server use) unless people understand what is actually happening and why the results displayed change with it's use.
I might have to disagree with the "change crawl frequency" although it could be a misinterpretation on my part. I just recently worked with a WebmasterWorld member and we had some unusual after effects when dealing with the above type scenario. Removing robots.txt entries and implementing noindex, nofollow at the document level. Unfortunately it wasn't the only thing done so it is very difficult to pinpoint any one thing. I do know that there has been an overall positive effect for their website. Crawl patterns have normalized. All sorts of documented improvements across the board by changing crawling directives.
In the context of keeping document references out of the index, I believe noindex would be the preferred method?
I understand the pages are crawled. My point is that using noindex pulls all references to that page out of the index. Unlike robots.txt entries which may show up as a URI only listing, in most instances they do.
I might have to disagree with the "change crawl frequency" although it could be a misinterpretation on my part. I just recently worked with a WebmasterWorld member and we had some unusual after effects when dealing with the above type scenario. Removing robots.txt entries and implementing noindex, nofollow at the document level. Unfortunately it wasn't the only thing done so it is very difficult to pinpoint any one thing. I do know that there has been an overall positive effect for their website. Crawl patterns have normalized. All sorts of documented improvements across the board by changing crawling directives.
IOW: By allowing the pages to be crawled (using noindex rather than disallow) you allow the link, hierarchy and site structure picture to be completed and also 'capture' link weight from any inbound links to those pages from other sites and it gets passed around too, so with robots.txt disallow you have a 'link weight black hole' and with the noindex you complete the picture of the site structure and pass weight from links to those pages back to pages in the index.
regarding the bandwidth savings discussion, i am wondering what the effects would be if noindexed/nofollowed pages were user agent-cloaked such that the document contained a head with the necessary meta element but essentially no body content.