robots.txt - Google's JohnMu Tweets a tip - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

robots.txt - Google's JohnMu Tweets a tip

tedster

5:14 pm on May 28, 2010 (gmt 0)

From Google's John Mueller:

Robots-tip: crawlers cache your robots.txt; update it at least a day before adding content that is disallowed.

[twitter.com...]

dstiles

8:01 pm on May 31, 2010 (gmt 0)

An alternative to robots.txt and noindex in preventing a page being read by a bot AND only serving up a very short bit of code is, of course, to return a 404, 403 or 405, according to relevance. Obviously this is no good if the page has "follow" links on it.

If noindex links are commonly found by a bot on indexed pages, leading to a rejection of some sort, then those links could be omitted from the page served to bots, saving their time and your resources.

Of course, if google decides you are cloaking the page then you may have a problem. On the other hand, who are they, really, to say what you can and cannot serve up to visitors? It's YOUR site, not theirs.

I do all of the above on a small scale - none of my sites is very big, anyway - and so far it all works.

pageoneresults

7:54 am on Jun 2, 2010 (gmt 0)

tedster, didn't mean to derail your original topic but I still think we're within the scope of the discussion? ;)

TheMadScientist, thank you for your detailed responses, that helps me to solidify a few things where I had some questions.

Most major SEs do not use HEAD very often in my experience, and that's another conversation, but the reason I've read them stating is because requesting the entire resource does not save as much resource wise as it would seem, especially since the requests are only for the HTML, not all the graphics and associated resources, so they use a conditional GET and request the whole thing.

It wasn't long ago that I had to do the research while involved with another discussion similar to this. Google states that they'll just use a GET...

Matt Cutts: In terms of crawling the web and text content and HTML, we'll typically just use a GET and not run a HEAD query first. We still use things like If-Modified-Since, where the web server can tell us if the page has changed or not. There are still smart ways that you can crawl the web, but HEAD requests have not actually saved that much bandwidth in terms of crawling HTML content, although we do use it for image content.

^ Mirrors exactly what you said above.

Webmaster Guidelines: Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell Google whether your content has changed since we last crawled your site. Supporting this feature saves you bandwidth and overhead.
[Google.com...]

^ Using 304 to manage crawl activities.

You do understand noindex, even at the header level, does not change the crawling of the pages, right? Those pages are still crawled. They have to be. So, it actually changes the bots behavior to the opposite of what it seems like you're saying it does.

I should probably stay away from these discussions as the improper use of terminology can sure change the meaning of things.

I understand the pages are crawled. My point is that using noindex pulls all references to that page out of the index. Unlike robots.txt entries which may show up as a URI only listing, in most instances they do.

Robots.txt keeps the bots totally off the pages and leaves them uncrawled*, so there are no 'wasted' resources there (crawl or server), but noindex does the opposite and makes sure the pages are crawled, but not shown in the results.

In the context of keeping document references out of the index, I believe noindex would be the preferred method?

I should note that my solutions may not be the best option when micro-managing bandwidth. I'm looking at this more from a "what's in the index" perspective. I want the bots crawling those pages so they know exactly what to do at each and every level of the site. They'll either get no directives or noindex, or noindex, nofollow. Since the bulk of the sites I work with are less than 100,000 documents, so far this has worked out well and have produced the desired results.

1. Didn't want to use a robots.txt file and expose the entire structure/dynamics of the site.
2. Didn't want to find URI only listings to documents that shouldn't be easily available to prying eyes.

The differences in this type of conversation, when dealing with resource savings, bandwidth, crawl rate, pages crawled, etc., are important to draw IMO, because unless people understand how things work and the fact the contents of the resources are actually 'sent' to the requester rather than being 'visited' they really can't figure out what saves resources and what does not.

Interesting! So it works the other way around? All these freakin years and I still have a lot more to learn. This isn't my job! ;)

The noindex tag does not really save any resources or even change crawl frequency in my experience, but it changes what the SEs show in the results, even for a site: search, so it might appear to 'save resources' (crawl budget, bandwidth, server use) unless people understand what is actually happening and why the results displayed change with it's use.

I might have to disagree with the "change crawl frequency" although it could be a misinterpretation on my part. I just recently worked with a WebmasterWorld member and we had some unusual after effects when dealing with the above type scenario. Removing robots.txt entries and implementing noindex, nofollow at the document level. Unfortunately it wasn't the only thing done so it is very difficult to pinpoint any one thing. I do know that there has been an overall positive effect for their website. Crawl patterns have normalized. All sorts of documented improvements across the board by changing crawling directives.

leadegroot

8:08 am on Jun 2, 2010 (gmt 0)

I think there is an important distinction to make in 'wasted resources'.
When I have oodles of bandwidth, I may not care how much the bots crawl, the 'resources' I am trying to control is the pages that appear in the SERPs (there, avoided the whole use of 'indexed' ;))
This would see me not 'wasting resources' by meta noindexing.
Whereas if I am on a tighter hosting budget, the resource I may be more immediately concerned with is bandwidth used. I know google can crawl 10,000 pages a day at times on some of my sites, and if I was on restricted bandwidth that would be an issue and might see me not 'wasting resources' by robots.txt blocking.
So where the resources that need to be conserved are is contextual.

TheMadScientist

2:22 pm on Jun 2, 2010 (gmt 0)

In the context of keeping document references out of the index, I believe noindex would be the preferred method?

Just to keep them out of the index?
I think without too much question, yes.

ADDED NOTE: I do use robots.txt disallow for JS and PHP that only does processing, but I rarely disallow page and give an indication of where the actual files are that way... I usually disallow directories and have my options set to -indexes (the ones the server generates when /dir/ is requested) in the .htaccess so there is no reference to the actual location or name of the file in the directory being used for processing in the robots.txt.

The JS files are fairly easy to find by viewing the source code of the pages, but the PHP files are very difficult, because I can use random names for them and without the server generated index of the directory to give away the actual file name they are very difficult to 'guess' since they can be called anything I feel like.

I understand the pages are crawled. My point is that using noindex pulls all references to that page out of the index. Unlike robots.txt entries which may show up as a URI only listing, in most instances they do.

Okay, I thought you probably did, but wanted to make sure other readers did too, because of the terminology differences you point out.

I might have to disagree with the "change crawl frequency" although it could be a misinterpretation on my part. I just recently worked with a WebmasterWorld member and we had some unusual after effects when dealing with the above type scenario. Removing robots.txt entries and implementing noindex, nofollow at the document level. Unfortunately it wasn't the only thing done so it is very difficult to pinpoint any one thing. I do know that there has been an overall positive effect for their website. Crawl patterns have normalized. All sorts of documented improvements across the board by changing crawling directives.

By 'crawl frequency' I mean how often the bot visits those pages...
With robots.txt bot visits to the pages disallowed = 0
With noindex bot visits to the pages not in the index (results) = The same as visits to pages in the index (results) in my experience.

I think this is important to point out because you have noted 'stabilization' and 'improved results', or something to those effects a couple of times, and I think you're correct in your conclusion that it helps. Here's why...

In a robots.txt 'disallow' situation you (and probably others) have links to those pages and they accumulate PR. There are links to them somewhere (generally) and since the bots cannot (don't) visit those pages the links turn into a type of 'black hole' you mention, where with noindex the links go somewhere, and what's more important is the links from the noindex pages 'point back' and complete the site structure and hierarchy picture.

With robots.txt disallow links on the pages pointing to the disallowed pages still pass PR (AFAIK).

With noindex links on the pages pointing to the pages noindexed still pass PR, but links on the noindex pages also pass PR (possibly at a discounted weight, but it still passes) to the pages they link to, because even though the pages noindexed are not shown in the results they are present for calculations.

So, if you have 10 links to 'disallowed' pages you have 10 links passing PR 'out' to other pages and nothing is passed back 'in' to the 'allowed' pages, because the bots cannot get the information from the pages which means the links on those pages are not present for calculations so they would have no idea where to pass the weight back from them.

If you have 10 links to 'noindex' pages bots still have access to the information on the pages 'noindexed' and the structure of the links on those pages, which means so do the calculation processes and when the calculations are performed PR is passed back 'in' to pages 'indexed' with each link pointing to an 'indexed' page.

IOW: By allowing the pages to be crawled (using noindex rather than disallow) you allow the link, hierarchy and site structure picture to be completed and also 'capture' link weight from any inbound links to those pages from other sites and it gets passed around too, so with robots.txt disallow you have a 'link weight black hole' and with the noindex you complete the picture of the site structure and pass weight from links to those pages back to pages in the index.

phranque

9:01 am on Jun 4, 2010 (gmt 0)

TheMadScientist, i assume your discussion of noindexed pages in the preceding post refers to noindex,follow - correct?

regarding the bandwidth savings discussion, i am wondering what the effects would be if noindexed/nofollowed pages were user agent-cloaked such that the document contained a head with the necessary meta element but essentially no body content.

phranque

9:58 am on Jun 4, 2010 (gmt 0)

i received a few hints about crawling and indexing issues from a small alpine bird that i thought should be shared.

if robots.txt is your primary line of defense to prevent indexing of content, consider what happens when the request for robots.txt returns a 404.
this means robots.txt is Not Found and therefore anything is fair game for the crawlers.
this status may occur when there is any number of server glitches or perhaps an incorrectly configured CDN.

therefore if you are using robots.txt to reduce server load or bandwidth usage but noindexing is also important, you should also use the robots meta tag, x-robots-tag HTTP header or perhaps authentication if appropriate to block indexing.
for example you could block indexing by using the .htaccess to add an x-robots-tag HTTP header for all URLs in a specific path or matching a URL pattern similar to your robots.txt disallows.

in any case, you should avoid making fast-paced changes in the robots.txt, such as dayparting to control crawler access times, since it can result in unintended consequences due to the cacheing issues.

pageoneresults

5:25 pm on Jun 4, 2010 (gmt 0)

IOW: By allowing the pages to be crawled (using noindex rather than disallow) you allow the link, hierarchy and site structure picture to be completed and also 'capture' link weight from any inbound links to those pages from other sites and it gets passed around too, so with robots.txt disallow you have a 'link weight black hole' and with the noindex you complete the picture of the site structure and pass weight from links to those pages back to pages in the index.

That's probably one of the best summaries I've seen to date of what I think. Thank you! :)

regarding the bandwidth savings discussion, i am wondering what the effects would be if noindexed/nofollowed pages were user agent-cloaked such that the document contained a head with the necessary meta element but essentially no body content.

I think the effects would be exactly what you intended. But, this type of implementation requires a bit of regular maintenance and not something the average Webmaster have the ability to do. One of these days I'll get into the cloaking stuff. No I won't. :)

This 37 message thread spans 2 pages: 37