TheMadScientist - 2:22 pm on Jun 2, 2010 (gmt 0)
In the context of keeping document references out of the index, I believe noindex would be the preferred method?
Just to keep them out of the index?
I think without too much question, yes.
ADDED NOTE: I do use robots.txt disallow for JS and PHP that only does processing, but I rarely disallow page and give an indication of where the actual files are that way... I usually disallow directories and have my options set to -indexes (the ones the server generates when /dir/ is requested) in the .htaccess so there is no reference to the actual location or name of the file in the directory being used for processing in the robots.txt.
The JS files are fairly easy to find by viewing the source code of the pages, but the PHP files are very difficult, because I can use random names for them and without the server generated index of the directory to give away the actual file name they are very difficult to 'guess' since they can be called anything I feel like.
I understand the pages are crawled. My point is that using noindex pulls all references to that page out of the index. Unlike robots.txt entries which may show up as a URI only listing, in most instances they do.
Okay, I thought you probably did, but wanted to make sure other readers did too, because of the terminology differences you point out.
I might have to disagree with the "change crawl frequency" although it could be a misinterpretation on my part. I just recently worked with a WebmasterWorld member and we had some unusual after effects when dealing with the above type scenario. Removing robots.txt entries and implementing noindex, nofollow at the document level. Unfortunately it wasn't the only thing done so it is very difficult to pinpoint any one thing. I do know that there has been an overall positive effect for their website. Crawl patterns have normalized. All sorts of documented improvements across the board by changing crawling directives.
By 'crawl frequency' I mean how often the bot visits those pages...
With robots.txt bot visits to the pages disallowed = 0
With noindex bot visits to the pages not in the index (results) = The same as visits to pages in the index (results) in my experience.
I think this is important to point out because you have noted 'stabilization' and 'improved results', or something to those effects a couple of times, and I think you're correct in your conclusion that it helps. Here's why...
In a robots.txt 'disallow' situation you (and probably others) have links to those pages and they accumulate PR. There are links to them somewhere (generally) and since the bots cannot (don't) visit those pages the links turn into a type of 'black hole' you mention, where with noindex the links go somewhere, and what's more important is the links from the noindex pages 'point back' and complete the site structure and hierarchy picture.
With robots.txt disallow links on the pages pointing to the disallowed pages still pass PR (AFAIK).
With noindex links on the pages pointing to the pages noindexed still pass PR, but links on the noindex pages also pass PR (possibly at a discounted weight, but it still passes) to the pages they link to, because even though the pages noindexed are not shown in the results they are present for calculations.
So, if you have 10 links to 'disallowed' pages you have 10 links passing PR 'out' to other pages and nothing is passed back 'in' to the 'allowed' pages, because the bots cannot get the information from the pages which means the links on those pages are not present for calculations so they would have no idea where to pass the weight back from them.
If you have 10 links to 'noindex' pages bots still have access to the information on the pages 'noindexed' and the structure of the links on those pages, which means so do the calculation processes and when the calculations are performed PR is passed back 'in' to pages 'indexed' with each link pointing to an 'indexed' page.
IOW: By allowing the pages to be crawled (using noindex rather than disallow) you allow the link, hierarchy and site structure picture to be completed and also 'capture' link weight from any inbound links to those pages from other sites and it gets passed around too, so with robots.txt disallow you have a 'link weight black hole' and with the noindex you complete the picture of the site structure and pass weight from links to those pages back to pages in the index.