| 5:46 pm on Mar 5, 2012 (gmt 0)|
Assuming for the sake of discussion that robots read and honor the robots.txt directive? Yes, the wording is correct, and yes, they'll be back tomorrow.
But if you're talking about keeping people out for a day or so while you do maintenance on the site, you might be better off sending a 500-class response. Is the site open to humans?
| 5:52 pm on Mar 5, 2012 (gmt 0)|
I only have one domain right now, and I need a domain for Wordpress Multisite installation, so I am using it for a while.
I've seen search engines indexing the site (Google / Bing Webmaster Tools), and I just dont want to end up sutck in some Google Supplemental Index for some strange terms etc, as I work on this site. I would like to make the site what I want it to be first, and then open it up / submit it to the search engines.
(I just copied and pasted some articles for Yahoo, because I need to have some text in posts in order to get something to work etc...)
| 6:19 pm on Mar 5, 2012 (gmt 0)|
I'd put the site behind a password until it is ready to go live.
.htpasswd file if you use Apache server.
| 7:02 pm on Mar 5, 2012 (gmt 0)|
Is this something that would be better than robots.txt method?
I have an empty .htpasswds folder when I log in to the server through ftp. Is this something that can be done through cPanel's "Password Protect Directories"?
| 2:38 am on Mar 6, 2012 (gmt 0)|
robots.txt = "I would appreciate it if you would be so good as to wait politely outside, thank you kindly."
htaccess / htpasswd = "Let me introduce you to Mongo the doorman and his brother Gonzo. If they don't like your face, you're not getting in."
If you are mainly concerned about search engines going haywire, there's another thing to consider: The major searches tend to outsource robots.txt rather than checking at the beginning of each and every visit. That means it can take up to 24 hours for them to assimilate a change. (I found one German site that claimed to take up to several weeks, but I think they're exaggerating on the "Allow 6-8 weeks for delivery" principle.) This can be a little inefficient if what you want to do is lock the door right now.
| 5:57 am on Mar 6, 2012 (gmt 0)|
your best option is to use Basic HTTP Authentication:
it appears that cPanel's "Password Protect Directories" can be used to configure this.
this challenges the bot's request with a 401 Unauthorized status code:
the next best option is to respond bot requests with a 503 Service Unavailable status code:
this is perhaps the only case in which a Retry-After header is relevant.
if you rely on the robots.txt to exclude crawlers then any urls discovered by search crawlers may be indexed with a url-only snippet because the crawler won't make the request and therefore won't see the 401 or 503 (or other) response.
|A robots.txt request is generally cached for up to one day, but may be cached longer in situations where refreshing the cached version is not possible (for example, due to timeouts or 5xx errors). |
while some people prefer a more frequent fetch, others complain that too many robots.txt requests waste resources.
some crawlers fetch robots.txt prior to every request.
| 7:42 am on Mar 6, 2012 (gmt 0)|
I accidentally blocked a folder using robots.txt for less than an hour. Of course, Google came by less than 5 minutes before I fixed it.
Having fixed it on the server, they continued to believe the version with the error as Gospel for the next 24 to 36 hours. During that time, the WMT Fetch as Googlebot function could not be used as it reported pages as blocked.
Several months later the WMT Crawl Error reports are still listing hundreds of "denied by robots.txt" entries. The password challenge or 503 method is much better.
| 9:04 am on Mar 6, 2012 (gmt 0)|
Gosh, how familiar that sounds. If I misspell a link and discover the error five seconds later, some major search engine will have been there after three seconds. One time I goofed on a set of :: cough, cough :: relative links. Threw the robots into such confusion that I ended up shoving a permanent redirect on the nonexistent URLs. ("Permanent" = I'm leaving it there indefinitely.)
I thought they were going to sulk for years because I accidentally left off the .html from one link in one place for one day. But luckily they went off and found a batch of grossly malformed URLs from some fly-by-night site's search-results page and are now fussing about those instead. Also about the pages that are no longer where the links they discovered in January 2011 say they are. Neither are the pages with the links. Funny how that works. (Bing is much better at this. They only mention links that they've seen on your site within the present geological age.)
But anyway... If a robot picks up even five to ten pages on one visit, how big a strain on your resources is the detour to robots.txt? Mine is a portly 546 bytes. Smaller than, ahem, the 403 page. I think robots just make up that reasoning because they're too lazy to read.