homepage Welcome to WebmasterWorld Guest from 23.23.12.202
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque

Webmaster General Forum

    
robots.txt
tnet21



 
Msg#: 4425093 posted 4:02 pm on Mar 5, 2012 (gmt 0)

I want to temporarily block all spiders from indexing my page, I created a robots.txt file in root directory and added this:

User-agent: *
Disallow: /

Just want to make sure it is correct, and that I can just delete the file and everything will come back to normal etc (without any negative effects in the future etc)

Thank you.

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4425093 posted 5:46 pm on Mar 5, 2012 (gmt 0)

Assuming for the sake of discussion that robots read and honor the robots.txt directive? Yes, the wording is correct, and yes, they'll be back tomorrow.

But if you're talking about keeping people out for a day or so while you do maintenance on the site, you might be better off sending a 500-class response. Is the site open to humans?

tnet21



 
Msg#: 4425093 posted 5:52 pm on Mar 5, 2012 (gmt 0)

I only have one domain right now, and I need a domain for Wordpress Multisite installation, so I am using it for a while.

I've seen search engines indexing the site (Google / Bing Webmaster Tools), and I just dont want to end up sutck in some Google Supplemental Index for some strange terms etc, as I work on this site. I would like to make the site what I want it to be first, and then open it up / submit it to the search engines.

(I just copied and pasted some articles for Yahoo, because I need to have some text in posts in order to get something to work etc...)

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4425093 posted 6:19 pm on Mar 5, 2012 (gmt 0)

I'd put the site behind a password until it is ready to go live.

Examine the
.htpasswd file if you use Apache server.
tnet21



 
Msg#: 4425093 posted 7:02 pm on Mar 5, 2012 (gmt 0)

Is this something that would be better than robots.txt method?

I have an empty .htpasswds folder when I log in to the server through ftp. Is this something that can be done through cPanel's "Password Protect Directories"?

Thank you.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4425093 posted 2:38 am on Mar 6, 2012 (gmt 0)

robots.txt = "I would appreciate it if you would be so good as to wait politely outside, thank you kindly."
htaccess / htpasswd = "Let me introduce you to Mongo the doorman and his brother Gonzo. If they don't like your face, you're not getting in."

If you are mainly concerned about search engines going haywire, there's another thing to consider: The major searches tend to outsource robots.txt rather than checking at the beginning of each and every visit. That means it can take up to 24 hours for them to assimilate a change. (I found one German site that claimed to take up to several weeks, but I think they're exaggerating on the "Allow 6-8 weeks for delivery" principle.) This can be a little inefficient if what you want to do is lock the door right now.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4425093 posted 5:57 am on Mar 6, 2012 (gmt 0)

your best option is to use Basic HTTP Authentication:
http://www.w3.org/Protocols/HTTP/1.0/spec.html#AA
it appears that cPanel's "Password Protect Directories" can be used to configure this.
this challenges the bot's request with a 401 Unauthorized status code:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.2

the next best option is to respond bot requests with a 503 Service Unavailable status code:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.5.4
this is perhaps the only case in which a Retry-After header is relevant.

if you rely on the robots.txt to exclude crawlers then any urls discovered by search crawlers may be indexed with a url-only snippet because the crawler won't make the request and therefore won't see the 401 or 503 (or other) response.

http://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [developers.google.com]:
A robots.txt request is generally cached for up to one day, but may be cached longer in situations where refreshing the cached version is not possible (for example, due to timeouts or 5xx errors).

while some people prefer a more frequent fetch, others complain that too many robots.txt requests waste resources.
some crawlers fetch robots.txt prior to every request.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4425093 posted 7:42 am on Mar 6, 2012 (gmt 0)

I accidentally blocked a folder using robots.txt for less than an hour. Of course, Google came by less than 5 minutes before I fixed it.

Having fixed it on the server, they continued to believe the version with the error as Gospel for the next 24 to 36 hours. During that time, the WMT Fetch as Googlebot function could not be used as it reported pages as blocked.

Several months later the WMT Crawl Error reports are still listing hundreds of "denied by robots.txt" entries. The password challenge or 503 method is much better.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4425093 posted 9:04 am on Mar 6, 2012 (gmt 0)

Gosh, how familiar that sounds. If I misspell a link and discover the error five seconds later, some major search engine will have been there after three seconds. One time I goofed on a set of :: cough, cough :: relative links. Threw the robots into such confusion that I ended up shoving a permanent redirect on the nonexistent URLs. ("Permanent" = I'm leaving it there indefinitely.)

I thought they were going to sulk for years because I accidentally left off the .html from one link in one place for one day. But luckily they went off and found a batch of grossly malformed URLs from some fly-by-night site's search-results page and are now fussing about those instead. Also about the pages that are no longer where the links they discovered in January 2011 say they are. Neither are the pages with the links. Funny how that works. (Bing is much better at this. They only mention links that they've seen on your site within the present geological age.)

But anyway... If a robot picks up even five to ten pages on one visit, how big a strain on your resources is the detour to robots.txt? Mine is a portly 546 bytes. Smaller than, ahem, the 403 page. I think robots just make up that reasoning because they're too lazy to read.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved