Forum Moderators: goodroi
Am I missing out on a lot of traffic by having NO /robots.txt file?
Or is that a thing of the past?
also, there seems to be a lot of concern in the generation of the robot.txt file....can I use FP 2002 to generate the file/page or am I asking for trouble?
Thanks for the help in advance.
i don't think so, robots.txt is more about trying to PREVENT spiders.
i'm sure brett won't mind if you check the robots.txt for webmasterworld [webmasterworld.com]
as you can see its full of prohibits.
>>there seems to be a lot of concern in the generation of the robot.txt file.
you can use notepad its pretty straightforward, copy the format of the WebmasterWorld one.
Using a robots.txt file I think is really up to the individual site. I use it to disallow images and certain files as well as the ia_archiver.
You can use FP to create the file. Just open a new page, go to the HTML view and delete everything. Type in your robots.txt information then save it as robots.txt. To do so, when you go to save it, there should be a drop down menu that by default says html file. Select All Files from that menu and type in the full file name and extension: robots.txt.
Is it NEEDED?
I see at alltheweb, only my homepage is indexed, and I'm wondering if this is the reason?
Am I ok without ANY robot.txt?
I looked at the robot.txt for this site....I'm tempted to copy and paste it onto my site for the sheer wisdom of , "He's GOT to know what HE's doing...."
Or is that a dumb thing to do?
no.
one way of checking to see is go to google type in a search term
and for any of the highly placed sites see if they have a robots text
by typing in www.domainname.com/robots.txt into the address bar
you'll find loads of top ranking sites don't have one.
Try using the search [searchengineworld.com] function of the site and search for "robots text file" without the quotes and you should find a number of threads discussing this very topic along with the original versions of Bretts Super robots file.
Also, check out the Robots Text Validator [searchengineworld.com].
Onya
Woz
I did that....found it to be VERY informational. But I was trying to figure out if it was needed now, as in, "2002-robot.txt update".
I get the impression the robot.txt is more for DISallowing any spiders...than for allowing.
(Still doesn't 'splain the fast/alltheweb problem with only my home/index page being included, but perhaps that is a post for another thread...)
Thanks folks for the help! You guys are so fast! I'm still smiling that I found this website! (thanks Brett..!)
In my mind a missing robots could be interperted as sign of amateurisem, but I don't know for sure
Which leads to the issue about what is the point of having a site Terms Of Use which states "can't download, store, save, redistribute..." if these outfits don't obey it. I didn't ask them to spider my site. Aren't they bound by the Terms Of Use the same as a visitor? As far as I am concerned, they are, whether they visit in person or via spider/robot.
Here are two sure ways to stop abusive robots:
1) Get the ip addresses of the badbots and block at the router, firewall (or ipchains) - not an option on a hosting package.
2) mod_rewrite - block the user agents. See toolman's close to perfect badbot blocker here [webmasterworld.com]
Then there is the Scooter issue: is jeckyl & hyde bot - so maybe not a canidate for a permanent ban - here's [webmasterworld.com] a recent example. (I almost fell off my chair hearing that Alta's crawler support team actually did something ;))
Robots.txt contains DOS or Mac line enders. Although it is so common most spiders will deal correctly with the file, it is not a valid format.
You can use Note Pad to set it up, but you will need to upload via FTP using the ASCII mode so that the above DOS or Mac line enders are not present. I learned this after numerous testing runs and using Brett's robots.txt validator.
You can also use a number of other programs that support UNIX mode.
I'm not sure if having a robots.txt file in place gives you any advantage other than blocking those spiders that obey the directives. I keep one present in all web site directories that I manage just to minimize the 404 errors and to keep them out of certain sub directories like css and javascript.
I've read many topics on this during the past couple of years and I've seen too many comments about a spider calling the robots.txt file, not finding it, and then moving on its merry way never to return. Its a simple file to set up and I think all web sites should have one present in their root directory, even if its just the standard...
User-agent: *
Disallow:
(edited by: pageoneresults at 4:36 pm (utc) on Jan. 25, 2002)
So you are probably better off just having a blank robots.txt than not having one at all.
What Xoc said! When I first experimented with toolman's barbed-wire htaccesstm I also added the guardian cgi to email me 404s. That lasted about a weekend. Indexing issues aside, the robots.txt/htaccess combo is a good site management mechanism.
1. "[an empty robots.txt] ...will be treated as if it was not present, i.e. all robots will consider themselves welcome"
If robots.txt is not present then its pretty much up to the robot to make its own decision on its next action - supply a robots.txt if you think that a search engine is by passing your site due to a 404 for /robots.txt, but don't assume a missing robots.txt will stop robots coming in,
2. If the robot does find /robots.txt then its important that it is correctly formatted. Again, its up to the robot to take a decision on what to do next if the robots.txt file contains a syntax error. The author of the robot might give the benefit of the doubt and try to work out the meaning, but more than likely will ignore an incorrect robots.txt and carry on as if it wasn't there. [searchengineworld.com...] shows some common syntax errors,
3. Some syntax points from the earlier posts:
“The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL).”
CR (Mac), CRLF(Windows/MS-DOS) and LF(Unix) are all valid line terminators, so notepad and FP 2002 are valid editors for creating robots.txt with respect to end of line characters.
“The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome”
You are allowed a robots.txt that has no contents.
So, as for the original questions, I would agree with jlara that a blank robots.txt is the thing to do if you are worried that robots might be ignoring you, and you should have no problem creating robots.txt in FP 2002.
DOS Line Enders:
Another common mistake, is editing your robots.txt in DOS mode. Although it is such a common problem, that we are sure search engines account for it, it is bad practice. Always edit your robots.txt in UNIX mode and upload in ASCII. Many FTP clients will make the transformation to Unix line enders for you seamlessly, but obviously some will not. Kick your text editor into Unix mode before editing a robots.txt file.
Its a very simple step to make sure that the file validates and there is no reason to neglect having it validate 100%!
[DOS mode robots.txt] … is bad practice. Always edit your robots.txt in UNIX mode and upload in ASCII.. Many FTP clients will make the transformation to Unix line enders for you seamlessly
Perhaps I’m being picky again, but I think I disagree with this. RFC 959 in section 3.1.1.1 states
The sender converts the data from an internal character representation to the standard 8-bit NCT-ASCII representation (see the Telnet specification). The receiver will convert the data from the standard form to his own internal form.
So, under ASCII mode, the end of lines are converted by the client from CR (Mac) or LF (Unix) to CRLF. Windows/DOS clients need make no conversion as it already uses the telnet standard end of lines. The file is transferred across the network and the FTP server makes the translation from CRLF to (Mac) or LF (Unix) when it stores it. Again, Windows/DOS servers make no translation.
If you want to guarantee LF for end of line in the uploaded robots.txt you should edit the file locally in Unix mode and upload in binary mode. This will stop both the client and the server making any transformations.
DOS end of lines is 100% valid for robots.txt, but, if you feel that some search engines don’t follow the spec and are confused by then there is no harm in adding a few extra rules of your own to the robots.txt spec.
I would be surprised, however, if search engines cannot handle DOS end of lines, since a lot of internet protocols use CRLF as the end of line (TELNET, FTP, HTTP).