pageoneresults - 10:23 am on Mar 16, 2010 (gmt 0)
Excellent interview with Matt Cutts by Eric Enge!
"... if you are trying to block something out from robots.txt, often times we'll still see that URL and keep a reference to it in our index. So it doesn't necessarily save your crawl budget"
Some may not have caught that little tidbit that floats off to the right of the discussion about KML Files. I've been involved in some recent robots.txt discussions and my stance is that you SHOULD NOT use them to block indexing of content. Google broke the protocol when they decided to show URI only listings. I've been reading that protocol top to bottom, left to right, etc. to see where it states that a UA can index a URI and display it while performing specific queries. IT DOESN'T!
So, folks are left with the BEST option which is to control the indexing and following of content at the page level either via META Robots or X-Robots-Tag (or whatever other methods you've conjured up). X-Robots-Tag seems to be the preferred method amongst some of me high tech peers, we're using it also for global NoArchive directives.
Back to this robots.txt and crawl equity. I'm working with a real world example now. I'll generalize it a bit but it goes like this. Site should ONLY have about 10k pages indexed, these are the final destination pages that have the meat. There is a Disallow for sub-directories which contain content that SHOULD NOT be crawled.
The internal linking structure of the website points to those Disallowed directories. Googlebot indexes the site and continually gets instructions on a large group of URIs that are Disallowed via robots.txt. How many URIs? Oh, about 40k+ that are now URI only listings.
Question, what do you think that does to Crawl Equity? I'd be interested to know your thoughts.
We're going to find out. The Disallows are coming out and we will be implementing page level directives to block indexing. My experience over the years shows me that the SE bots obey META Robots NoIndex, or NoFollow, or both NoIndex, NoFollow. NoIndex removes the page from the index - period. There appear to be no questions there. I see some folks stating otherwise but I've yet to see a real world example.
Note: I see a lot of people who Disallow: /search/ in their robots.txt files, that's like a MikeBoneŽ for Googlebot and others. Do a site:example.com/search/ and expand the results. How many URI only listings do you have? Do a site:example.com/****** for any items listed in robots.txt, expand the results. :(