|Do search engines cache robots.txt?|
Don't worry about the welcome message, I'm not new here, just changed name (previous login name became a keyword for a site of mine!). In fact, if one of the admins could send me a sticky about deleting my old posts (only about 4) I'd be grateful?
I'm doing a new commercial site (my first actually) which will launch in about 6 months time. I do not want to get indexed by any search engines during the build (and stuff will have to be posted to test it out). If I block all in my robots.txt, can I just remove the block a month or so before we're ready to go and it's then SE friendly? If any of them cache the robots file, or take note for future ref. not to bother, then that could be a problem!
The longest time I've ever heard of a 'bot saving robots.txt is a day. So yes, your plan should work.
This query brings some robots.txt files in the index : allinurl: robots.txt
It seems they are cached by Google.
I never had problems before by swapping the robots.txt file regarding the old cached version.
Thanks for the response.
Even if they are cached, presumably that doesn't last forever and it's a case of us timing it right? Maybe we remove the robots file a couple of months before launch and point to a holding page of some sort? Or is that bad form?
I can't think of any other way of stalling a listing. We have a handful of good PR sites who want to link to us when we're operational. I suspect we will get fresh-botted quite quickly (maybe within a week) and indexed.
I can always ask the sites not to do the link until we're ready, but you know what people are like. I'd rather have control over the search engines instead.
Google gets a new version of robots.txt every time it requests a number of pages from your site:
|In order to save bandwidth Googlebot only downloads the robots.txt file once a day or whenever we have fetched many pages from the server. |
|Maybe we remove the robots file a couple of months before launch |
I would keep an eye on the deep crawl schedule and remove robots.txt in the crawl before your launch.
list of robots.txt indexed by Google [google.com] in 'pr' order - cnn have some funny messages on banned pages ;)
I typically put a password on my news sites while I'm developing them. That's because not only do I not want robots crawling it, but I also don't want potential customers/users getting confused or turned off and I don't want competitors to know what I'm up to.
It's a pretty simple .htaccess file, typically, so I don't see it as being all that much work.
Perhaps that would better suit your needs and you wouldn't have to worry about your robots.txt file being cached.
Even if a robots.txt file is indexed and cached by Google, I doubt that Googlebot would use it. It would be in the "wrong space" for Googlebot to use. I'd be willing to bet that the 'bot references robots.txt by appending that filename to the domain name being spidered, rather than using any data from Google's index - it just wouldn't make any sense to do so.
It's strange that anyone would link to a robots.txt file, but I guess it could happen occasionally on a site like WebmasterWorld where people cite the robots.txt file here as an example.
TJ, if you remove your disallows from robots.txt a week before you want to go live, your should be fine. You should time that to coincide with the deep crawl. A single "holding page" for test purposes would be a great idea, though, just to boost your confidence.
Just a touch off topic here, but after doing a allinurl: robots.txt google search, I spotted WW's bot.txt. Terrific listing of all our friendly neighborhood spiders/bots.
Thanks JD, I think I'll do just that. If we're late in getting decent SERPS by a month, I can live with that....
|It's strange that anyone would link to a robots.txt file |
22 people link to google's robots.txt
One of the sites linking to them proposes a 'forbidden web' of only content banned by robots.txt ;)
A cool idea, but one I feel would not be very popular around here :)