>>>Am I missing out on a lot of traffic by having NO /robots.txt file?
i don't think so, robots.txt is more about trying to PREVENT spiders.
i'm sure brett won't mind if you check the robots.txt for webmasterworld [webmasterworld.com]
as you can see its full of prohibits.
>>there seems to be a lot of concern in the generation of the robot.txt file.
you can use notepad its pretty straightforward, copy the format of the WebmasterWorld one.
Using a robots.txt file I think is really up to the individual site. I use it to disallow images and certain files as well as the ia_archiver.
You can use FP to create the file. Just open a new page, go to the HTML view and delete everything. Type in your robots.txt information then save it as robots.txt. To do so, when you go to save it, there should be a drop down menu that by default says html file. Select All Files from that menu and type in the full file name and extension: robots.txt.
So the FP portion of my question has been answered, but that leaves the all inclusive question:
Is it NEEDED?
I see at alltheweb, only my homepage is indexed, and I'm wondering if this is the reason?
Am I ok without ANY robot.txt?
I looked at the robot.txt for this site....I'm tempted to copy and paste it onto my site for the sheer wisdom of , "He's GOT to know what HE's doing...."
Or is that a dumb thing to do?
I've heard there are a few minor spiders who take the absence of a robots.txt as a universal "disallow," but I don't know if that's still the case.
At any rate, none of the major spiders will let a little thing like a missing robots.txt stand in their way, AFAIK
>>>Is it NEEDED?
one way of checking to see is go to google type in a search term
and for any of the highly placed sites see if they have a robots text
by typing in www.domainname.com/robots.txt into the address bar
you'll find loads of top ranking sites don't have one.
There was some debate some time ago as to whether some search engines would fail to index a site if there was no robots text file present. I would speculate that any decent spider would not be thus affected, however it would be better to include it to be safe.
Try using the search [searchengineworld.com] function of the site and search for "robots text file" without the quotes and you should find a number of threads discussing this very topic along with the original versions of Bretts Super robots file.
Also, check out the Robots Text Validator [searchengineworld.com].
I did that....found it to be VERY informational. But I was trying to figure out if it was needed now, as in, "2002-robot.txt update".
I get the impression the robot.txt is more for DISallowing any spiders...than for allowing.
(Still doesn't 'splain the fast/alltheweb problem with only my home/index page being included, but perhaps that is a post for another thread...)
Thanks folks for the help! You guys are so fast! I'm still smiling that I found this website! (thanks Brett..!)
It is your choice but I would be including a robots file just to be sure and to protect any parts of your web that need protecting.
I suspect you fast/alltheweb problem is something else which, as you say, is a topic for another thread.
BTW, Welcome to WebmasterWorld.
It is my understanding that you need a robots.txt, even if you only disallow an empty folder (you must have some text in your robots.txt). Most (all) SE's ask for the robots.txt and will get an 404 error if you don't have it.
In my mind a missing robots could be interperted as sign of amateurisem, but I don't know for sure
>missing out on a lot of traffic
None. Robots.txt is about preventing spiders, not enabling them.
If you have a large site, you may want to start banning spiders that are abusive.
<Marshall ...I use it to disallow images and certain files as well as the ia_archiver>
Why do you want to disallow the ia_archiver.
What traffic could ia_archiver/alexa possibly send you?
Then why allow them to use your data and content? Why spend your bandwidth and server resources to fuel their business?
I have ia_archiver disallowed as well.
I can't see any reason for letting it waste my bandwidth :)
but hey, if there is let me know :)
Aside from what Brett said about bandwidth, I don't want my sites archived. Some of this is out of copyright concerns, but most of it has to do with why should they benefit from my work. They're not doing me any favors.
Just because you have a robots.txt file dosen't mean robots will obey it. Its like a "keep off the grass" sign - to some it's an invitation.
Here are two sure ways to stop abusive robots:
1) Get the ip addresses of the badbots and block at the router, firewall (or ipchains) - not an option on a hosting package.
2) mod_rewrite - block the user agents. See toolman's close to perfect badbot blocker here [webmasterworld.com]
Then there is the Scooter issue: is jeckyl & hyde bot - so maybe not a canidate for a permanent ban - here's [webmasterworld.com] a recent example. (I almost fell off my chair hearing that Alta's crawler support team actually did something ;))
Thanks everyone, I guess I need to look into what the different spideres are spidering for, and mayby disallow a few of them
The one problem with banning alexa/ia_archiver from a htaccess file is this: if you choose to get your site removed from their db, you have to have a robots.txt in place that they can read. If they can't read it, they won't remove you. It's a catch-22.
just for the record ia_archiver seems to obey my robots.txt.
There is no reason not to have a very simple robots.txt files. Depending on your web server, if the file does not exist, a spider may get an error code that will prevent from going further into the site.
I noticed above that some said to use Note Pad to set up the robots.txt file or FP. Unfortunately you cannot do this and have a file that validates 100%. You will end up with a warning that reads...
Robots.txt contains DOS or Mac line enders. Although it is so common most spiders will deal correctly with the file, it is not a valid format.
You can use Note Pad to set it up, but you will need to upload via FTP using the ASCII mode so that the above DOS or Mac line enders are not present. I learned this after numerous testing runs and using Brett's robots.txt validator.
You can also use a number of other programs that support UNIX mode.
I've read many topics on this during the past couple of years and I've seen too many comments about a spider calling the robots.txt file, not finding it, and then moving on its merry way never to return. Its a simple file to set up and I think all web sites should have one present in their root directory, even if its just the standard...
(edited by: pageoneresults at 4:36 pm (utc) on Jan. 25, 2002)
One reason to have a short robots.txt is that it may actually take less bandwidth than the 404 page that you send when it doesn't find it. The web spider asks for the robots.txt. When it doesn't find it, your web server sends a 404 error, that actually has a web page attached with error text. That 404 eats bandwidth. It also takes more server processing time, I think, to process a 404 than to send a file that is actually there.
So you are probably better off just having a blank robots.txt than not having one at all.
ScottM, in regards to your Fast issues, have you tried Brett's SIM Spider to make sure that all of your links are indexible? I've found this to be a very useful tool when investigating problems relating to single page indexing and a spider not following links.
Robots.txt won't hurt either. It works for me. I prevent, as I have a frames-sites, that certain pages are indexed in the SE's. So it can be quite useful.
I have heard that dissallowed directories in a robots.txt file will sometimes help badbots or hackers find your files you don't want public. Its like an open invitation to hack that particular directory, whereas if you don't have anything there, the hacker doesn't know which directory name to hack into.
>One reason to have a short robots.txt is that it may actually take less bandwidth than the 404 page that you send when it doesn't find it.
What Xoc said! When I first experimented with toolman's barbed-wire htaccesstm I also added the guardian cgi to email me 404s. That lasted about a weekend. Indexing issues aside, the robots.txt/htaccess combo is a good site management mechanism.
Some points about robots.txt, with reference to [robotstxt.org...]
1. "[an empty robots.txt] ...will be treated as if it was not present, i.e. all robots will consider themselves welcome"
If robots.txt is not present then its pretty much up to the robot to make its own decision on its next action - supply a robots.txt if you think that a search engine is by passing your site due to a 404 for /robots.txt, but don't assume a missing robots.txt will stop robots coming in,
2. If the robot does find /robots.txt then its important that it is correctly formatted. Again, its up to the robot to take a decision on what to do next if the robots.txt file contains a syntax error. The author of the robot might give the benefit of the doubt and try to work out the meaning, but more than likely will ignore an incorrect robots.txt and carry on as if it wasn't there. [searchengineworld.com...] shows some common syntax errors,
3. Some syntax points from the earlier posts:
“The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL).”
CR (Mac), CRLF(Windows/MS-DOS) and LF(Unix) are all valid line terminators, so notepad and FP 2002 are valid editors for creating robots.txt with respect to end of line characters.
“The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome”
You are allowed a robots.txt that has no contents.
So, as for the original questions, I would agree with jlara that a blank robots.txt is the thing to do if you are worried that robots might be ignoring you, and you should have no problem creating robots.txt in FP 2002.
I wouldn't want to take the chance of having a robots.txt file that wasn't 100% validated...
|DOS Line Enders: |
Another common mistake, is editing your robots.txt in DOS mode. Although it is such a common problem, that we are sure search engines account for it, it is bad practice. Always edit your robots.txt in UNIX mode and upload in ASCII. Many FTP clients will make the transformation to Unix line enders for you seamlessly, but obviously some will not. Kick your text editor into Unix mode before editing a robots.txt file.
Its a very simple step to make sure that the file validates and there is no reason to neglect having it validate 100%!
|[DOS mode robots.txt] … is bad practice. Always edit your robots.txt in UNIX mode and upload in ASCII.. Many FTP clients will make the transformation to Unix line enders for you seamlessly |
Perhaps I’m being picky again, but I think I disagree with this. RFC 959 in section 184.108.40.206 states
|The sender converts the data from an internal character representation to the standard 8-bit NCT-ASCII representation (see the Telnet specification). The receiver will convert the data from the standard form to his own internal form. |
So, under ASCII mode, the end of lines are converted by the client from CR (Mac) or LF (Unix) to CRLF. Windows/DOS clients need make no conversion as it already uses the telnet standard end of lines. The file is transferred across the network and the FTP server makes the translation from CRLF to (Mac) or LF (Unix) when it stores it. Again, Windows/DOS servers make no translation.
If you want to guarantee LF for end of line in the uploaded robots.txt you should edit the file locally in Unix mode and upload in binary mode. This will stop both the client and the server making any transformations.
DOS end of lines is 100% valid for robots.txt, but, if you feel that some search engines don’t follow the spec and are confused by then there is no harm in adding a few extra rules of your own to the robots.txt spec.
I would be surprised, however, if search engines cannot handle DOS end of lines, since a lot of internet protocols use CRLF as the end of line (TELNET, FTP, HTTP).