robots.txt - Google's JohnMu Tweets a tip - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

robots.txt - Google's JohnMu Tweets a tip

tedster

5:14 pm on May 28, 2010 (gmt 0)

From Google's John Mueller:

Robots-tip: crawlers cache your robots.txt; update it at least a day before adding content that is disallowed.

[twitter.com...]

tedster

1:51 am on May 29, 2010 (gmt 0)

I never thought about this before, but we certainly know that googlebot works off a cache of the robots.txt most of the time. Otherwise it would need to ping robots.txt right before each URL request, and that would get nasty pretty fast.

So apparently, 24 hours is what John is saying is the length of the cache time. Good to know. When Disallowed content gets place online, this is one precaution I never thought about.

tangor

2:01 am on May 29, 2010 (gmt 0)

Actually a very reasonable suggestion to content creators... make sure it is disallowed x hours (this case 24) before posting the content. One of those good ideas/practices that just seems to take a long time to be uncovered/learned.

Thanks, tedster!

leadegroot

9:34 am on May 29, 2010 (gmt 0)

yahoo hits robots much more often than google does, but - yes! I've always changed robots and waited to see the next crawl in the logs before putting up the content to be blocked.
(Oh, how I hate waiting for something like that!)

shri

1:53 pm on May 30, 2010 (gmt 0)

>> I've always changed robots and waited to see the next crawl in the logs before putting up the content to be blocked.

Safer in my opinion to check it in the webmaster's console than to work on a schedule.

tiki16

3:10 pm on May 30, 2010 (gmt 0)

Unfortunately, if I am disallowing content (news story), I don't get 24 hours notice. No wonder it takes so long to have content removed from google search.

topr8

3:22 pm on May 30, 2010 (gmt 0)

of ocurse there are different reasons to disallow robots accessing pages.

but in the case of 'permanently' disallowed pages, eg ones you would never want indexed, rather than ones you now don't want indexed ...

i have a disallow prefix in robots text, eg

Disallow
/#*$!
/-

etc etc, so anything i want disallowed starts in that way:
eg

/#*$!_donotindexme.php
or
/-donotindexme/
etc etc

this way i can add pages to the site that are disallowed without thinking about a time delay

pageoneresults

4:21 pm on May 30, 2010 (gmt 0)

Personally I think robots.txt is the arch nemesis when it comes to crawling. I don't recommend its use anymore and haven't for quite a few years. Yes, we do use one and typically block /js/ directories and use it as a whitelisting method, we Disallow all but known bots.

In my mind there's nothing worse than performing a site: search for Disallowed files on your site and finding 50,000 URI only listings. I think Google broke the robots.txt protocol when they started showing Disallowed files in the SERPs.

Keep in mind that the robots.txt file is a road map for documents you don't want indexed. I've found some very interesting stuff when previewing robots.txt files. Not to mention performing site: searches and seeing thousands upon thousands of URI only listings, that's not good if you ask me.

We've been using noindex, nofollow at the document level without fail. We're also using X-Robots-Tags and doubling up on the directives, one from the server header and one at the document level. To date, it's worked like a charm. None of the SEs will list a page that is noindex.

Think about this from a crawling perspective. Do you really want Googlebot crawling all of those documents and displaying URI only listings for someone to do who knows what with them? Could I generate a page of links from those Disallowed files and create some havoc with your crawling routines? I think so. ;)

tedster

4:41 pm on May 30, 2010 (gmt 0)

Do you really want Googlebot crawling all of those documents and displaying URI only listings

I don't see Googlebot crawling robots.txt disallowed documents, pageone, At least not routinely, but there is the occasional edge case. They can and do add disallowed URLs to their index, but that's just the URL, not the content. I also don't like that, but it isn't a violation of the robots.txt protocol.

They do crawl documents with a noindex robots meta-tag, however - in fact, that's the only way they can even SEE that type of directive.

i have a disallow prefix in robots text

Thanks - that's a good tip.

pageoneresults

4:55 pm on May 30, 2010 (gmt 0)

I don't see Googlebot crawling robots.txt disallowed documents.

If they don't crawl them, why are there so many URI only listings when performing site: searches for Disallowed files? They do crawl them but, they don't index them based on my understanding.

I've yet to find a single page of mine that has a noindex directive in the index using site: searches. Those pages using noindex are pretty much invisible to the bots. They may know about them via their own internal mechanisms but the general public doesn't.

robots.txt is a great way for Googlebot to discover URIs. It will crawl anything and everything if it finds a link. Most of those pages being Disallowed have links to them. Googlebot is going to find that link one way or the other. If folks want to give the bots a starting point by using robots.txt, that's fine. I've seen too much stuff showing up in site: searches that really should NOT be there.

I think many folks overlook the potential risks involved with all those URI only listings that occur from robots.txt Disallow entries. Think about all those internal links that you have which point to Disallowed files. I think you're creating one big round robin of crawling and that's why I use noindex instead. I don't want my document URIs showing up when someone performs site: searches - that's not right.

g1smd

5:25 pm on May 30, 2010 (gmt 0)

No. The URLs disallowed in robots.txt are not crawled.

The URLs are discovered when they are published in the links found on other pages.

Those URLs are listed as URL-only entries in the SERPs because the content is never crawled or indexed.

tedster

5:32 pm on May 30, 2010 (gmt 0)

If they don't crawl them, why are there so many URI only listings when performing site: searches

Definition: crawl = request the file from the server. Only server logs can tell you what files were crawled.

URI-only listings are not evidence that the document was crawled, only that the existence of the URL is known to Google.

And the concept of URI discovery brings me to a criticism of the crawling pattern John Mueller tweeted about. Surely Google has a record that a URI was not previously crawled. In that kind of case, why isn't a new check of robots.txt mandatory?

[edited by: tedster at 5:36 pm (utc) on May 30, 2010]

g1smd

6:06 pm on May 30, 2010 (gmt 0)

I'd not expect there to be a refresh of robots.txt data before every such page request, but I do wonder what happens if the page is crawled and then robots.txt is pulled between then and before the page would have showed up in the SERPs. Is the data dumped, or does it make it into the index only to be removed again within hours or days? I have always disallowed the URLs for stuff several days before publishing it.

pageoneresults

6:07 pm on May 30, 2010 (gmt 0)

No. The URLs disallowed in robots.txt are not crawled.

Time for me to get edumucated. ;)

Okay, if they are not crawled what is the proper terminology when Googlebot requests the robots.txt file and takes action on the directives?

Definition: crawl = request the file from the server. Only server logs can tell you what files were crawled.

Understood. Googlebot comes to the server, requests robots.txt and then does what? It takes all those Disallowed directives back home with it, right? Then what does it do with them? Does it make URI only entries in the index? Or is that an action from Googlebot discovering the URI elsewhere?

URI-only listings are not evidence that the document was crawled, only that the existence of the URL is known to Google.

I need more literal definitions of crawling, indexing, parsing. I always thought crawling was the bot requesting the file and performing actions based on the directives in that file?

When a bot crawls a robots.txt file, particularly Googlebot, what is it doing with the Disallow entries?

tedster

6:30 pm on May 30, 2010 (gmt 0)

if they are not crawled what is the proper terminology when Googlebot requests the robots.txt file and takes action on the directives?

The robots.txt file itself is certainly "crawled". Then what is supposed to happen (and usually does if there is no glitch) is that those Disallowed URIs are NOT crawled - not requested from the server - from then on, as long as they are present in robots.txt. That is, the CONTENT of the disallowed URI is not indexed, but the URI itself - that character string - is definitely stored.

If the URI is listed as a URI-only search result, it won't automatically be removed from the search index because it appears in robots.txt. However, because it (or its pattern) is listed in robots.txt, the webmaster can then request its removal from Google's index.

Of course, by listing the URIs in the first place, robots.txt has definitely announced to the world that they do exist, and not all bots are as well-behaved as search engine crawlers. That alone is a good reason to limit the use of robots.txt. But in some situations, robots.txt can really help clean up duplicate URI situations and the like.

tedster

6:38 pm on May 30, 2010 (gmt 0)

I need more literal definitions of crawling, indexing, parsing.

I'd say the entire webmaster community needs more precision in this vocabulary - it's way too easy for us to be casual. In this case, I think our disconnect was between crawling the robots.txt file itself, and crawling the URIs that it disallows.

I have never seen Google create a URI-only entry because that URI was listed in robots.txt, by the way. It's certainly not the routine, at any rate.

So where do they come from? Somewhere on the web there's a link - that's the most common way. I suspect other forms of URI discovery may also play in - such as Google toolbar data.

TheMadScientist

6:54 pm on May 30, 2010 (gmt 0)

When a bot crawls a robots.txt file, particularly Googlebot, what is it doing with the Disallow entries?

Making a note to NOT visit those locations on your site.

The noindex allows crawling of the pages. It is an explicit note to SEs to not index the page the tag is present on. (The index is searched to generate results, so noindex removes it from searches by removing it from the index.)

If you want to know this is what they are doing, put a noindex meta tag on a URL only listing since you know it keeps pages from being returned in the results or listed as part of the site.

The URL only listing will remain, because Google is doing what your robots.txt file tells it to: NOT visiting (crawling, spidering) the page(s) listed as disallowed in the robots.txt to find the noindex tag.

If they were crawling (spidering) the page(s) disallowed in robots.txt, then the noindex tag on those pages would work and you would not see the URL only listing...

* This note from Mu seems like it could explain some of the complaints about G crawling (spidering) pages disallowed in robots.txt doesn't it?

Technical Definitions:
Crawling = Spidering, Visiting, Accessing
Parsing = Processing, Analyzing, Scanning
Indexing = Listing a reference to, Adding to Possible Results

They crawl a URL, then parse the information to determined the page should be returned in the results and where, and if it should be shown to searchers it is indexed.

To understand the direct relationship between the word index and results as used by search, think DataBase... In databases an index is used as a reference or 'key' to make the information more quickly accessible from storage. It's basically a 'note' or 'short reference' that says for 'blah' look here in the storage system (disk, memory, etc.). So, what they do to make the results more searchable and returned quicker is use an index of the possible choices to be returned for a search, then search the index for the 'key' or 'reference' to the location of the actual information for the results they generate to show visitors, hence: Index = Short reference to the storage location of the actual information shown in the results. Another way to think of it is as a 'catalog' of the possibilities the results are generated from.

ADDED: The best example of an index is probably a map...
You look for a city name in an alphabetized list and it says: A2 so you go to col A row 2 on the actual map and only have to look at a small portion of the possibilities to find the city you were looking for rather than having to review the entire map until you find it.

If a reference is not in the index of a search engine the page does not show in the results, but this does not mean they don't use or have the information... It simply means when someone searches for it they say: 'Sorry, can't find it. Try again.'

Robert Charlton

7:25 pm on May 30, 2010 (gmt 0)

URI only listings...

This gets discussed periodically here. As I'll elaborate in a moment, Google has referrred to these listings as "references".

The only way I've found to prevent these "references" is, as pageoneresults suggests, to use the meta robots tag with noindex,nofollow in the CONTENT of those URIs... and (with a tip of the hat to jdMorgan for this)... not to simultaneously use robots.txt to disallow the crawling of the CONTENT of those URIs.

If you disallow the crawling, then the robots meta tag with noindex, nofollow is not seen, and the reference... if from a spiderable page... will be indexed. Apparently, different engines treat the meta robots tag differently, but that's another can of worms.

My first experience with this situation was in this discussion in 2003....

Problem with Googlebot and robots.txt?
Google indexing links to blocked urls even though it's not following them
http://www.webmasterworld.com/forum3/11621.htm
[webmasterworld.com...]

I think it's worth quoting here GoogleGuy's comment on the situation and my response... both of which are still applicable...

GoogleGuy, with my emphasis added...

If we have evidence that a page is good, we can return that reference even though we haven't crawled the page.

My response...

GoogleGuy - Since you've asked in the past for suggestions for improving Google's serps, I'd suggest that less aggressive indexing here would be helpful. I can't imagine why Google would want to return a link to a blocked page.

Robert Charlton

7:30 pm on May 30, 2010 (gmt 0)

PS to the above... Mad Scientist posted while I was still posting, so my use of the word "reference" is not a comment on his post, but I think the uses are more or less consistent.

TheMadScientist

7:35 pm on May 30, 2010 (gmt 0)

Yeah, Robert, I agree there is definitely a consistency in the use of the word reference, because that's exactly what they are doing with a URL only listing... They're 'indexing' (creating a reference to) the location of the information for the results, which in some cases is a URL only, but it still has to be indexed (have a 'short location reference' added) for them to show it to visitors, so in some (many?) cases they still create the reference (like GoogleGuy stated) in the index for URL only listings even though they do not access the information on the URL.

ADDED: It would be cool if they would extend the robots.txt to allow us to somehow say: 'Don't visit or show', maybe with something like 'NoView: /url' or 'NoReference: /url' which would not only tell the bots to stay away, but also tell them it should not be indexed or shown in the results.

tedster

8:11 pm on May 30, 2010 (gmt 0)

One other point to look out for - "indexing a URI" is an ambiguous phrase. It could mean either "indexing the CONTENT served at a URI" or "adding the ADDRESS to the index".

supercyberbob

8:23 pm on May 30, 2010 (gmt 0)

TheMadScientist

Google has had an unofficial directive for robots.txt for a while now, an equivalent of a noindex tag.

Like this...

User-agent: Googlebot
NOINDEX: /page.html

I believe it only works for the Googlebot user agent. Been using it and it works.

You need to be careful integrating it with directives for other bots though.

TheMadScientist

8:28 pm on May 30, 2010 (gmt 0)

@ supercyberbob

That's a good note, and I think I remember reading about that, but for some reason I think it works the same as a 'noindex' on the pages, but what I'm not sure about is if it stops them from crawling the page too... I think it doesn't much, like the X_ header directive you can use rather than the on-page noindex tag.

Please let me know if you've read it stops them from accessing the page too, because it's been a while since I've read anything about either and I think they're '6-of-one, half-dozen-of-the-other' type uses. IOW: I think they all do exactly the same thing, but I could be incorrect on that point.

buckworks

8:31 pm on May 30, 2010 (gmt 0)

Here's a gotcha to be aware of:

If you have some pages that the robots.txt says not to crawl, and those are getting indexed as URI only listings, it makes intuitive sense to add noindex at the document level.

But remember that the noindex directive will only be seen if the page is allowed to be crawled!

Give thought to the order you're doing things in. If you do something new in one corner, you might have to undo something old in another corner to get the desired results.

pageoneresults

11:40 pm on May 30, 2010 (gmt 0)

Here's my theories of using robots.txt to Disallow documents.

First off, anything in the robots.txt file is an invitation for folks to explore, not to mention bots. What they do during that exploration process is anyone's guess.

I've done enough retractions of Disallow directives to see positive results after the fact. For example, a website capable of generating 10,000 documents of unique content. When performing site: searches for Disallowed documents, 60,000 URI only listings show up after expanding the result set. What exactly happens during the crawl routines of this website?

I think those URI only entries are black holes for crawl equity. I don't want the bot wasting its resources on referencing 60,000 URIs, I really don't. I don't even want the bots to know that those URIs exist. No, I want to grab that bot by the balls and send them on a pre-planned crawling adventure.

Not a single person have convinced me that robots.txt is useful. I find noindex, nofollow to be the perfect solution to keep documents out of the indices and to keep folks from peeping. Every single site where we've removed the Disallow and went to the doc level with noindex, nofollow had improvements in their crawl routines, every single one of them. The one common denominator was normalization. Crawls became much more normal after that and the sites appeared to perform that much better overall.

Anyone else doing it this way?

buckworks

12:44 am on May 31, 2010 (gmt 0)

noindex, nofollow to be the perfect solution

Do you ever use "noindex, follow"?

supercyberbob

12:53 am on May 31, 2010 (gmt 0)

pageoneresults

I use the noindex meta tag on some of my sites instead of robots.txt entries. I agree robots.txt is like having the curtains open.

I haven't examined crawling on those sites though. Might be a good idea at some point.

pageoneresults

1:04 am on May 31, 2010 (gmt 0)

Do you ever use "noindex, follow"?

Yes but I do not specify the follow directive, that would be the default behavior.

We now use X-Robots-Tag for quite a bit of this.

We noarchive all documents by default. Cache is another area I don't like.

We proactively noindex.

We aggressively noindex, nofollow.

On rare occasions, we may nofollow only. I typically don't like to send that type of signal. Very few documents qualify for the nofollow only treatment.

I agree robots.txt is like having the curtains open.

I've had the opportunity to communicate with hacker types and I overhead them say that robots.txt is an area they review. ;)

Note: Every document out of the box contains the noindex, nofollow directive. We don't take any chances. Google finds URIs if you just think about them - they're psychic like that. ;)

Since we are quoting John Mueller on crawling, indexing, you'll find him recommending noindex. Note his suggestion on NOT including pages in your sitemaps that you don't want indexed. I see many folks throw the entire kitchen sink into their sitemap files, that's not good, a waste of crawl equity.

John Mueller: It�s always a good idea for your XML Sitemap file to include all pages which you want to have indexed. If you have pages such as tag or archive pages which you prefer not to have indexed, it�s recommended to add a �noindex� robots meta tag to the pages (and of course, not to include them in the Sitemap file).

TheMadScientist

2:24 am on May 31, 2010 (gmt 0)

What exactly happens during the crawl routines of this website?

I think those URI only entries are black holes for crawl equity. I don't want the bot wasting its resources on referencing 60,000 URIs, I really don't. I don't even want the bots to know that those URIs exist. No, I want to grab that bot by the balls and send them on a pre-planned crawling adventure.

You do understand noindex, even at the header level, does not change the crawling of the pages, right? Those pages are still crawled. They have to be. So, it actually changes the bots behavior to the opposite of what it seems like you're saying it does.

Robots.txt keeps the bots totally off the pages and leaves them uncrawled*, so there are no 'wasted' resources there (crawl or server), but noindex does the opposite and makes sure the pages are crawled, but not shown in the results.

* IOW: Unrequested... 'crawled' can be very misleading, so see below, because I think there are some misconceptions about what actually happens when a page is 'crawled' since bots do not actually visit, but rather request and receive.

Maybe you're saying something different and I'm not understanding, but I think it's important to point out the page(s) with noindex on them are still crawled often (at exactly the same frequency as pages with equivalent 'importance') in my experience.

If you think noindex saves 'crawl budget' could you please check your server logs to try and confirm whether the crawl frequency of the pages in question is different than other pages without noindex present, because I have a script that logs bot visits to pages and in my experience bots crawl noindex pages with the same frequency as they crawl indexed pages, so there are no 'crawl budget' or server resource savings I've seen from using the noindex directive in any manner of implementation, but the 'presentation' of indexed pages using the site: operator looks much 'cleaner' and more well organized.

Anyway, sorry if we're saying the same thing, but I think it should be pointed out noindex absolutely does not keep bots from crawling the pages it's used in conjunction with, so it does not save any resources, even if used in the header of the page, mainly because once the page is requested by the bot it's sent to them, even with the header... The server does not exit from serving the page when the X-Robots-Tag header is sent. It sends the page normally and browsers ignore the header because it does not mean anything to them, but the full page is served to bots when they request it too.

I suppose the serving of the pages could be adjusted to stop if a bot makes a request for the page and a 206 Partial Content header or something similar could be served, but for the bot to get the header it must request the page and once the page is requested the content is served, so there are not any resource savings from using noindex.

I think it's also important to note: Bots do not actually 'go out and crawl the web'; they do the exact opposite. They stay where they are and cycle through URLs making requests to servers for the web to be sent to them in whatever order they request the pages. So, they can't just 'leave and stop visiting' a page 1/2 way through once they request it or 'stop the visit to save resources' once they see a noindex tag. When they request a page the normal action of the server is to send them the whole thing, and that's what it does in most cases.*

* There are some exceptions to the preceding (EG 304 Not Modified in some cases for a conditional GET, or when a bot makes a HEAD request rather than GET they only get the HEAD section**), but for the most part: The bot(s) on the web request the page and just like you or me using a browser our servers are kind enough to send them the whole thing.

** Most major SEs do not use HEAD very often in my experience, and that's another conversation, but the reason I've read them stating is because requesting the entire resource does not save as much resource wise as it would seem, especially since the requests are only for the HTML, not all the graphics and associated resources, so they use a conditional GET and request the whole thing.

TheMadScientist

3:53 am on May 31, 2010 (gmt 0)

Just another little note on this topic and the associated terminology.

Most of the distinctions I'm trying to draw are purely semantic for most intents and purposes... It does not really matter to the average person if they 'visited a page' or if they 'requested a resource and it was sent to them so they could view it'. It's much easier and descriptive enough of what happens to say 'visited' which implies the resource stays where it is and they went there, even if what really happens is the browser basically says 'hey, send me the contents of 'blah' so it can be displayed.'

The differences in this type of conversation, when dealing with resource savings, bandwidth, crawl rate, pages crawled, etc., are important to draw IMO, because unless people understand how things work and the fact the contents of the resources are actually 'sent' to the requester rather than being 'visited' they really can't figure out what saves resources and what does not.

I think quite a bit of what's been said could be a full thread on it's own, so I'll not elaborate too much more, but I think it's important to understand the process a bit better than the general terminology implies to really understand where 'resource savings' occur which is different than what may appear to be 'resource savings' and is really only 'display difference' by the SEs.

The noindex tag does not really save any resources or even change crawl frequency in my experience, but it changes what the SEs show in the results, even for a site: search, so it might appear to 'save resources' (crawl budget, bandwidth, server use) unless people understand what is actually happening and why the results displayed change with it's use.

This 37 message thread spans 2 pages: 37