Welcome to WebmasterWorld Guest from 188.8.131.52
Forum Moderators: goodroi
I'm currently using a robots.txt file with
to block all robots (this was suggested to me - Im a complete beginner at this!) from viewing my site (Im just testing a design and asking other people for feedback but dont want the current content to be indexed!)
Is this enough or do I need to create an .htaccess file to be on the safe side?
Where exactly does the robot.txt file have to be located? I put one in the root of my account (/myusername/home) ...and one in the public_html folder.
I assume the root (the one from where all other folders start) is the only place where I need this file and the public_html folder doesnt need to contain it? I was a bit confused, because I also have an add-on domain and was thinking if I put the robots.txt in the file wouldnt that block bots from viewing the add-on domain, too?
(Its not a problem if bots cannot view the add-on domain, because I dont have a website up at it..but it just confused me a lot!)
thanks for helping me learn about this tech stuff :)
This is enough to disallow all robots that respect robots.txt, but there are an awful lot of bad (i.e. malicious) robots which won't pay any attention to your robots.txt file. Some won't fetch it, some will fetch it (so as to look "good" in your log file) and then disregard it, while others will fetch it and use any specifically-disallowed URLs as a "shopping list" to try to grab restricted files.
So yes, you need stronger measures such as .htaccess- or script-based restrictions, because robots.txt does not "block" anything: robots.txt is only a polite request to polite robots to not fetch certain URLs, and has no "force" whatsoever.
If you want to learn quickly, find and read the source information on all Web-related issues, rather than relying exclusively on forums. In this case, the source is the Standard for Robot Exclusion [robotstxt.org] by Martijn Koster, June 1994.
I'm still a little bit confused with all this server stuff, but I assume that if I type in www mydomain com/robots.txt and this shows:
then I have done it correctly (put it at the right location on my server) if I want no robot to visit any page of my site, right?
Would it not work if I had spelled it this way (see below) or wouldn't it matter?
You said there were a lot of malicious robots that I can't really block with this, because it only works with bots who respect it (makes sense). However, I assume Googlebot (and yahoo's/MSN's bot) does respect this (if I remember correctly I read a Google spokesperson stating that they did), right?
I'm really just doing this because of the search engines..as I have some random content up there to make the design look somewhat normal (containing some phrases I repeated multiple times, which I dont want to be mistaken for keyword spamming). Will this do in this case or would I have to do something else (.htaccess?), too?
Do I have to be worried about those bad/malicious robots? If I hadn't put this design (with random content) up, I wouldnt have been worried about it either. I assume I don't have to be worried about it if I make sure that I don't have any files on my server that I don't want to be stolen/made public (such as baby photos of mine or whatever ;))?
thanks for the help!
The other problem with using robots.txt is that if anyone else links to the site, then URLs from the site can show up as URL-only entries in the SERPs.
For test sites I always set up a password using the features in .htaccess combined with a .htpasswd file. Using that method stops any and all access from everything, unless you have shared the password with them.
then there's a "blank line" underneath automatically, right? :-) or do I have to "create" that blank line?
I assume how to do what you do for your test sites using .htaccess and .htpasswd isn't something you could explain in just a few lines, here, but something I'd have to read up on? Is that something I could learn to do quickly, though or would it take at least multiple hours? (not to say that multiple hours is a lot but my time is very limited right now. However, if it's as easy as a few lines of code, I'd do it of course)
Setting up password-protection is not a trivial task, but consider this: Your server configuration is the basis for the success or failure of your site, and spending time reading the documentation at apache.org is an excellent investment in your future...
Yes. After the last visible character in the file, you need to hit "Enter" at least twice.
Wait, ya'll are confusing me and we know how easy that is. ;)
The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL).
I've never heard of or seen the claim that you must have two blank lines after the last entry in your robots.txt file. I understand that is the syntax for separating multiple records. I cannot locate any specific reference to this "two blank lines" after the last visible character in the file. I'm confused!
[edited by: pageoneresults at 3:50 pm (utc) on Oct. 21, 2008]
The "blank line after the last policy record" requirement is one that I found from experience, and is not explicitly stated in the robots.txt Standard. This was from a mishap I had with (as I recall) a French robot that got confused because the final blank line was not present in my robots.txt file. After corresponding with the robot's administrator, he agreed that the trailing blank line should not be required, but I added one just in case another robot came along that also required it.
For the same reason, I don't put comments on the same lines as directives, and don't fiddle with deviating from the casing or spacing shown in the examples included in the Standard. I like to keep the file as easy to parse as possible because robot authors, like everyone else, make mistakes in both coding and in interpreting the Standard. As a current example, we've got Twiceler from cuil.com, which apparently doesn't correctly implement this requirement, and gets confused and won't spider:
The record starts with one or more User-agent lines...
You must create the blank line after each policy record in robots.txt, even if there is only one policy record, as shown in your example.
I'm still a bit confused after all this time working with robots.txt. I just read the protocols again for about the umpteeth time. I cannot find a single reference to forcing a blank line at the end of the robots.txt file. Why don't they put this stuff in writing for us common folk? You know, like Joe the SEO? :)
<added> Thank you jdMorgan. We were posting at the same time.
Hitting Enter Twice after the last visible character doesn't make two blank lines.
For Joe the SEO, hitting the Enter Key twice creates two blank lines. Or two <p></p> elements. ;)
I don't wiggle, not even just a little bit.
Okay, so ya'll have me wiggling a little bit right now. All this time, 10+ years, and I've never seen a reference to having this blank line at the end of the robots.txt file, not once.
So, I've spent some time going through various robots.txt files and can say that I've not found one that is doing this. Why is that? Why does this need to be a secret? ;)
I'm not going out to add all those blank lines real soon as things seem to be working just fine with the way my robots.txt files have been for years. And yes they do validate. I mean, you guys are throwing something into the mix here that I think few even know. There is nothing written about it in any of the official documentation either.
I assume that it isn't really necessary for what I'm trying to do, though, anyway: I'm really just trying to make sure googlebot (and possibly yahoo's and microsoft's bots) do not crawl my website, yet. And I assume if they had caused that blank line problem before, more people would know about this (as it would affect a lot of guys)...as if I understood it correctly, this is really just an issue with a few (maybe just one) bot, but not with the bots of the big SEs right?
Strange, I just tried hitting enter twice (hehe) to create one blank line, but when I save it that way, the blank line isn't there anymore the next time I open the robots.txt file. Any idea what I could do to prevent this?
What editor are you using? Notepad or any other ASCII level editor will insert a hard carriage with each enter key. Also make sure your robots.txt is ASCII. Makes a difference.