TheMadScientist - 2:24 am on May 31, 2010 (gmt 0)
What exactly happens during the crawl routines of this website?
I think those URI only entries are black holes for crawl equity. I don't want the bot wasting its resources on referencing 60,000 URIs, I really don't. I don't even want the bots to know that those URIs exist. No, I want to grab that bot by the balls and send them on a pre-planned crawling adventure.
You do understand noindex, even at the header level, does not change the crawling of the pages, right? Those pages are still crawled. They have to be. So, it actually changes the bots behavior to the opposite of what it seems like you're saying it does.
Robots.txt keeps the bots totally off the pages and leaves them uncrawled*, so there are no 'wasted' resources there (crawl or server), but noindex does the opposite and makes sure the pages are crawled, but not shown in the results.
* IOW: Unrequested... 'crawled' can be very misleading, so see below, because I think there are some misconceptions about what actually happens when a page is 'crawled' since bots do not actually visit, but rather request and receive.
Maybe you're saying something different and I'm not understanding, but I think it's important to point out the page(s) with noindex on them are still crawled often (at exactly the same frequency as pages with equivalent 'importance') in my experience.
If you think noindex saves 'crawl budget' could you please check your server logs to try and confirm whether the crawl frequency of the pages in question is different than other pages without noindex present, because I have a script that logs bot visits to pages and in my experience bots crawl noindex pages with the same frequency as they crawl indexed pages, so there are no 'crawl budget' or server resource savings I've seen from using the noindex directive in any manner of implementation, but the 'presentation' of indexed pages using the site: operator looks much 'cleaner' and more well organized.
Anyway, sorry if we're saying the same thing, but I think it should be pointed out noindex absolutely does not keep bots from crawling the pages it's used in conjunction with, so it does not save any resources, even if used in the header of the page, mainly because once the page is requested by the bot it's sent to them, even with the header... The server does not exit from serving the page when the X-Robots-Tag header is sent. It sends the page normally and browsers ignore the header because it does not mean anything to them, but the full page is served to bots when they request it too.
I suppose the serving of the pages could be adjusted to stop if a bot makes a request for the page and a 206 Partial Content header or something similar could be served, but for the bot to get the header it must request the page and once the page is requested the content is served, so there are not any resource savings from using noindex.
I think it's also important to note: Bots do not actually 'go out and crawl the web'; they do the exact opposite. They stay where they are and cycle through URLs making requests to servers for the web to be sent to them in whatever order they request the pages. So, they can't just 'leave and stop visiting' a page 1/2 way through once they request it or 'stop the visit to save resources' once they see a noindex tag. When they request a page the normal action of the server is to send them the whole thing, and that's what it does in most cases.*
* There are some exceptions to the preceding (EG 304 Not Modified in some cases for a conditional GET, or when a bot makes a HEAD request rather than GET they only get the HEAD section**), but for the most part: The bot(s) on the web request the page and just like you or me using a browser our servers are kind enough to send them the whole thing.
** Most major SEs do not use HEAD very often in my experience, and that's another conversation, but the reason I've read them stating is because requesting the entire resource does not save as much resource wise as it would seem, especially since the requests are only for the HTML, not all the graphics and associated resources, so they use a conditional GET and request the whole thing.