homepage Welcome to WebmasterWorld Guest from 54.166.53.169
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Do search engines cache robots.txt?
trillianjedi

WebmasterWorld Senior Member trillianjedi us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 163 posted 3:04 pm on Apr 16, 2003 (gmt 0)

Hi,

Don't worry about the welcome message, I'm not new here, just changed name (previous login name became a keyword for a site of mine!). In fact, if one of the admins could send me a sticky about deleting my old posts (only about 4) I'd be grateful?

Question:-

I'm doing a new commercial site (my first actually) which will launch in about 6 months time. I do not want to get indexed by any search engines during the build (and stuff will have to be posted to test it out). If I block all in my robots.txt, can I just remove the block a month or so before we're ready to go and it's then SE friendly? If any of them cache the robots file, or take note for future ref. not to bother, then that could be a problem!

Thanks,

TJ

 

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 163 posted 3:17 pm on Apr 16, 2003 (gmt 0)

TJ,

The longest time I've ever heard of a 'bot saving robots.txt is a day. So yes, your plan should work.

Jim

Macguru

WebmasterWorld Senior Member macguru us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 163 posted 3:19 pm on Apr 16, 2003 (gmt 0)

This query brings some robots.txt files in the index : allinurl: robots.txt

It seems they are cached by Google.

I never had problems before by swapping the robots.txt file regarding the old cached version.

trillianjedi

WebmasterWorld Senior Member trillianjedi us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 163 posted 4:32 pm on Apr 16, 2003 (gmt 0)

Thanks for the response.

Even if they are cached, presumably that doesn't last forever and it's a case of us timing it right? Maybe we remove the robots file a couple of months before launch and point to a holding page of some sort? Or is that bad form?

I can't think of any other way of stalling a listing. We have a handful of good PR sites who want to link to us when we're operational. I suspect we will get fresh-botted quite quickly (maybe within a week) and indexed.

I can always ask the sites not to do the link until we're ready, but you know what people are like. I'd rather have control over the search engines instead.

TJ

pixel_juice

10+ Year Member



 
Msg#: 163 posted 4:46 pm on Apr 16, 2003 (gmt 0)

Google gets a new version of robots.txt every time it requests a number of pages from your site:

In order to save bandwidth Googlebot only downloads the robots.txt file once a day or whenever we have fetched many pages from the server.

- [google.com...]

Maybe we remove the robots file a couple of months before launch

I would keep an eye on the deep crawl schedule and remove robots.txt in the crawl before your launch.

list of robots.txt indexed by Google [google.com] in 'pr' order - cnn have some funny messages on banned pages ;)

amoore

10+ Year Member



 
Msg#: 163 posted 4:52 pm on Apr 16, 2003 (gmt 0)

I typically put a password on my news sites while I'm developing them. That's because not only do I not want robots crawling it, but I also don't want potential customers/users getting confused or turned off and I don't want competitors to know what I'm up to.

It's a pretty simple .htaccess file, typically, so I don't see it as being all that much work.

Perhaps that would better suit your needs and you wouldn't have to worry about your robots.txt file being cached.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 163 posted 4:57 pm on Apr 16, 2003 (gmt 0)

Even if a robots.txt file is indexed and cached by Google, I doubt that Googlebot would use it. It would be in the "wrong space" for Googlebot to use. I'd be willing to bet that the 'bot references robots.txt by appending that filename to the domain name being spidered, rather than using any data from Google's index - it just wouldn't make any sense to do so.

It's strange that anyone would link to a robots.txt file, but I guess it could happen occasionally on a site like WebmasterWorld where people cite the robots.txt file here as an example.

TJ, if you remove your disallows from robots.txt a week before you want to go live, your should be fine. You should time that to coincide with the deep crawl. A single "holding page" for test purposes would be a great idea, though, just to boost your confidence.

Jim

juniperwasting

10+ Year Member



 
Msg#: 163 posted 7:22 pm on Apr 16, 2003 (gmt 0)

Just a touch off topic here, but after doing a allinurl: robots.txt google search, I spotted WW's bot.txt. Terrific listing of all our friendly neighborhood spiders/bots.

446user

10+ Year Member



 
Msg#: 163 posted 10:42 pm on Apr 16, 2003 (gmt 0)

Thanks JD, I think I'll do just that. If we're late in getting decent SERPS by a month, I can live with that....

TJ

pixel_juice

10+ Year Member



 
Msg#: 163 posted 11:18 pm on Apr 16, 2003 (gmt 0)

It's strange that anyone would link to a robots.txt file

22 people link to google's robots.txt

One of the sites linking to them proposes a 'forbidden web' of only content banned by robots.txt ;)

A cool idea, but one I feel would not be very popular around here :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved