homepage Welcome to WebmasterWorld Guest from 54.237.54.83
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
how to block ALL bots?
Makaveli2007




msg:3769123
 2:44 pm on Oct 19, 2008 (gmt 0)

Hello,

I'm currently using a robots.txt file with

user-agent: *
disallow: /

to block all robots (this was suggested to me - Im a complete beginner at this!) from viewing my site (Im just testing a design and asking other people for feedback but dont want the current content to be indexed!)

Is this enough or do I need to create an .htaccess file to be on the safe side?

Where exactly does the robot.txt file have to be located? I put one in the root of my account (/myusername/home) ...and one in the public_html folder.

I assume the root (the one from where all other folders start) is the only place where I need this file and the public_html folder doesnt need to contain it? I was a bit confused, because I also have an add-on domain and was thinking if I put the robots.txt in the file wouldnt that block bots from viewing the add-on domain, too?

(Its not a problem if bots cannot view the add-on domain, because I dont have a website up at it..but it just confused me a lot!)

thanks for helping me learn about this tech stuff :)

 

jdMorgan




msg:3769134
 3:15 pm on Oct 19, 2008 (gmt 0)

Your robots.txt file must be a plain-text file located at http://www.example.com/robots.txt, and to suit your stated needs, must contain exactly this, including the blank line after each policy record:
User-agent: *
Disallow: /


Note that it is a risk to change casing or spacing, or anything else in such a file -- Robots vary widely in their "flexibility" at reading and interpreting robots.txt files, and you'll do best to stick exactly to the specified format.

This is enough to disallow all robots that respect robots.txt, but there are an awful lot of bad (i.e. malicious) robots which won't pay any attention to your robots.txt file. Some won't fetch it, some will fetch it (so as to look "good" in your log file) and then disregard it, while others will fetch it and use any specifically-disallowed URLs as a "shopping list" to try to grab restricted files.

So yes, you need stronger measures such as .htaccess- or script-based restrictions, because robots.txt does not "block" anything: robots.txt is only a polite request to polite robots to not fetch certain URLs, and has no "force" whatsoever.

If you want to learn quickly, find and read the source information on all Web-related issues, rather than relying exclusively on forums. In this case, the source is the Standard for Robot Exclusion [robotstxt.org] by Martijn Koster, June 1994.

Jim

Makaveli2007




msg:3769440
 7:52 am on Oct 20, 2008 (gmt 0)

Thanks for the input. I just read the document you linked to.

I'm still a little bit confused with all this server stuff, but I assume that if I type in www mydomain com/robots.txt and this shows:

User-agent: *
Disallow: /

then I have done it correctly (put it at the right location on my server) if I want no robot to visit any page of my site, right?

Would it not work if I had spelled it this way (see below) or wouldn't it matter?

user-agent: *
disallow: /

You said there were a lot of malicious robots that I can't really block with this, because it only works with bots who respect it (makes sense). However, I assume Googlebot (and yahoo's/MSN's bot) does respect this (if I remember correctly I read a Google spokesperson stating that they did), right?

I'm really just doing this because of the search engines..as I have some random content up there to make the design look somewhat normal (containing some phrases I repeated multiple times, which I dont want to be mistaken for keyword spamming). Will this do in this case or would I have to do something else (.htaccess?), too?

Do I have to be worried about those bad/malicious robots? If I hadn't put this design (with random content) up, I wouldnt have been worried about it either. I assume I don't have to be worried about it if I make sure that I don't have any files on my server that I don't want to be stolen/made public (such as baby photos of mine or whatever ;))?

thanks for the help!

g1smd




msg:3770360
 1:16 pm on Oct 21, 2008 (gmt 0)

Stick with the exact spelling, case and syntax, including the blank line after the last record:

User-agent: *
Disallow: /

The other problem with using robots.txt is that if anyone else links to the site, then URLs from the site can show up as URL-only entries in the SERPs.

For test sites I always set up a password using the features in .htaccess combined with a .htpasswd file. Using that method stops any and all access from everything, unless you have shared the password with them.

Makaveli2007




msg:3770369
 1:25 pm on Oct 21, 2008 (gmt 0)

Ah sorry, jdmorgan already told me this, but I didn't really get what "blank line" meant when I read it (English is not my first language, I guess that has something to do with it). I do not have to "create" a blank line, though, or do I? If I do not have anything else in the robots.txt, just the

User-agent: *
Disallow: /

then there's a "blank line" underneath automatically, right? :-) or do I have to "create" that blank line?

I assume how to do what you do for your test sites using .htaccess and .htpasswd isn't something you could explain in just a few lines, here, but something I'd have to read up on? Is that something I could learn to do quickly, though or would it take at least multiple hours? (not to say that multiple hours is a lot but my time is very limited right now. However, if it's as easy as a few lines of code, I'd do it of course)

thanks

jdMorgan




msg:3770396
 1:54 pm on Oct 21, 2008 (gmt 0)

You must create the blank line after each policy record in robots.txt, even if there is only one policy record, as shown in your example.

Setting up password-protection is not a trivial task, but consider this: Your server configuration is the basis for the success or failure of your site, and spending time reading the documentation at apache.org is an excellent investment in your future...

Jim

g1smd




msg:3770423
 2:28 pm on Oct 21, 2008 (gmt 0)

Yes. After the last visible character in the file, you need to hit "Enter" at least twice.

pageoneresults




msg:3770469
 3:39 pm on Oct 21, 2008 (gmt 0)

Yes. After the last visible character in the file, you need to hit "Enter" at least twice.

Wait, ya'll are confusing me and we know how easy that is. ;)

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL).

I've never heard of or seen the claim that you must have two blank lines after the last entry in your robots.txt file. I understand that is the syntax for separating multiple records. I cannot locate any specific reference to this "two blank lines" after the last visible character in the file. I'm confused!

[edited by: pageoneresults at 3:50 pm (utc) on Oct. 21, 2008]

g1smd




msg:3770483
 3:48 pm on Oct 21, 2008 (gmt 0)

Hitting Enter Twice after the last visible character doesn't make two blank lines.

It makes one blank line.

Hitting Enter Once after the last visible character makes no blank lines.

jdMorgan




msg:3770496
 3:55 pm on Oct 21, 2008 (gmt 0)

There aren't two blank lines, just one. g1smd was very specific about hitting 'enter' twice after the last visible character. So the first 'enter' ends the last visible line, and the second one ends a blank line.

The "blank line after the last policy record" requirement is one that I found from experience, and is not explicitly stated in the robots.txt Standard. This was from a mishap I had with (as I recall) a French robot that got confused because the final blank line was not present in my robots.txt file. After corresponding with the robot's administrator, he agreed that the trailing blank line should not be required, but I added one just in case another robot came along that also required it.

For the same reason, I don't put comments on the same lines as directives, and don't fiddle with deviating from the casing or spacing shown in the examples included in the Standard. I like to keep the file as easy to parse as possible because robot authors, like everyone else, make mistakes in both coding and in interpreting the Standard. As a current example, we've got Twiceler from cuil.com, which apparently doesn't correctly implement this requirement, and gets confused and won't spider:
The record starts with one or more User-agent lines...

Jim

pageoneresults




msg:3770497
 3:57 pm on Oct 21, 2008 (gmt 0)

You must create the blank line after each policy record in robots.txt, even if there is only one policy record, as shown in your example.

I'm still a bit confused after all this time working with robots.txt. I just read the protocols again for about the umpteeth time. I cannot find a single reference to forcing a blank line at the end of the robots.txt file. Why don't they put this stuff in writing for us common folk? You know, like Joe the SEO? :)

<added> Thank you jdMorgan. We were posting at the same time.

Hitting Enter Twice after the last visible character doesn't make two blank lines.

For Joe the SEO, hitting the Enter Key twice creates two blank lines. Or two <p></p> elements. ;)

jdMorgan




msg:3770509
 4:10 pm on Oct 21, 2008 (gmt 0)

But if Joe the SEO has just finished typing a printable text character and hits Enter twice, the first Enter adds a </p> and only the second one creates a blank line. By being very specific about "after the last visible character," g1smd's post leaves no wiggle-room on this... ;)

Jim

g1smd




msg:3770516
 4:17 pm on Oct 21, 2008 (gmt 0)

I don't wiggle, not even just a little bit. :-)

pageoneresults




msg:3770689
 8:16 pm on Oct 21, 2008 (gmt 0)

I don't wiggle, not even just a little bit.

Okay, so ya'll have me wiggling a little bit right now. All this time, 10+ years, and I've never seen a reference to having this blank line at the end of the robots.txt file, not once.

So, I've spent some time going through various robots.txt files and can say that I've not found one that is doing this. Why is that? Why does this need to be a secret? ;)

I'm not going out to add all those blank lines real soon as things seem to be working just fine with the way my robots.txt files have been for years. And yes they do validate. I mean, you guys are throwing something into the mix here that I think few even know. There is nothing written about it in any of the official documentation either.

jdMorgan




msg:3770722
 9:02 pm on Oct 21, 2008 (gmt 0)

Like I said above, I ran into a (one, singular) bot that required it, and might do so again. Cheap insurance.

Jim

Makaveli2007




msg:3773342
 11:43 am on Oct 25, 2008 (gmt 0)

Strange, I just tried hitting enter twice (hehe) to create one blank line, but when I save it that way, the blank line isn't there anymore the next time I open the robots.txt file. Any idea what I could do to prevent this?

I assume that it isn't really necessary for what I'm trying to do, though, anyway: I'm really just trying to make sure googlebot (and possibly yahoo's and microsoft's bots) do not crawl my website, yet. And I assume if they had caused that blank line problem before, more people would know about this (as it would affect a lot of guys)...as if I understood it correctly, this is really just an issue with a few (maybe just one) bot, but not with the bots of the big SEs right?

thanks

rajkhatri




msg:3782894
 10:36 am on Nov 9, 2008 (gmt 0)

Add NOIDEX, NOFOLOW in meta of each page or dissalow all search engines in robots.txt

[edited by: jatar_k at 2:54 pm (utc) on Nov. 9, 2008]
[edit reason] no sigs thanks [/edit]

tangor




msg:3782903
 10:50 am on Nov 9, 2008 (gmt 0)

Strange, I just tried hitting enter twice (hehe) to create one blank line, but when I save it that way, the blank line isn't there anymore the next time I open the robots.txt file. Any idea what I could do to prevent this?

What editor are you using? Notepad or any other ASCII level editor will insert a hard carriage with each enter key. Also make sure your robots.txt is ASCII. Makes a difference.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved