homepage Welcome to WebmasterWorld Guest from 54.205.254.108
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Content of Robots.txt files
You wouldn't believe what some hosts think should be in a Robots.txt file!
AlucardSpyderWriter




msg:1526868
 10:42 pm on Nov 27, 2004 (gmt 0)

As a spider writer I'm very aware of the Robots.txt file, it's syntax, how to parse it, and follow it. So I might have a better grasp on it than most. But would you believe:

No Robots.txt at all on 25% of all hosts.

30% of the time, when a Robots.txt file was found, it contained nothing but HTML code. No Robots.txt syntaxed code at all.

Nearly 40% of those that had the file, and the file didn't contain HTML, were still so badly writen they couldn't be followed.

This is over a few years now, thousands of hosts. So please, make and check your Robots.txt file.

 

larryhatch




msg:1526869
 10:56 pm on Nov 27, 2004 (gmt 0)

Hello Alu:

Are you referring to 'hosts' (ISPs) or webmasters?
The webmaster determines the contents of robots.txt and
the host simply stores whatever is sent up if I have that right.

Before I learned much of anything, I had a robots.txt
file that read something like

disallow:
flakes,weirdos, new-age types ..

I've since put up a better one thanks to the good advice on this forum.

- Larry

AlucardSpyderWriter




msg:1526870
 11:20 pm on Nov 27, 2004 (gmt 0)

I was refering to the webmasters more than the physical hosts really. I just can NOT believe how many people, servers, hosts, or webmasters think full HTML belongs in the file.

Symbios




msg:1526871
 11:25 pm on Nov 27, 2004 (gmt 0)

I never use them myself, unless I want to exclude a folder, if in doubt leave it out as it can do more harm than good if you get it wrong.

ps: AlucardSpyderWriter like the handle, just keep your eye out for rellikeripmaveht lurking around here.

AlucardSpyderWriter




msg:1526872
 7:13 am on Nov 28, 2004 (gmt 0)

My new favorite is also people using something like -

Disallow: Googlebot
-OR--
Disallow: *

Can we all agree that the asterisc (*) never belongs in this file other than in a "User-agent: *" line?

Also, I feel like I'm getting picky here, what with everything else webmasters have to worry about. But this file is just about all a good spider (and good spider writer) has to go by, other than the robot METAs.

ncw164x




msg:1526873
 9:17 am on Nov 28, 2004 (gmt 0)

Also, I feel like I'm getting picky here, what with everything else webmasters have to worry about

Yes you are being picky, you may understand how it works but a lot of others don't especially the newbies and maybe thats why you have seen so many mistakes made with the coding in this file.

You don't have to have the file and you can have just a blank robots.txt file, the reason for the blank file is to eliminate any 404's you would get in your log file when the file was requested but not available to be served

The robots.txt file only works on well behaved spiders other rouge spiders just walk all over it and don't obey so you have to implement other means to stop these.

AlucardSpyderWriter




msg:1526874
 5:59 pm on Nov 28, 2004 (gmt 0)

Umm, yes, I know, I said that, and that I got.

It's not the lack of it I don't get, and I understand all kinds of simple mistakes..

But when my spiders get more links out of the HTML in a Robots.txt than the HTML of a page itself, I just remember how silly I first thought it was to include code to check Robots.txt for HREFs.

jdMorgan




msg:1526875
 6:12 pm on Nov 28, 2004 (gmt 0)

I get fairly annoyed at some spider writers, too. It's unbelievable how many spiders from well-known organizations can't handle the multiple-user-agent policy records clearly described by the Standard. Most just pack up and go home, failing to find their user-agent string, despite the fact that it is included. Others decide they can do what they like, since they deem the file invalid.

So, there is frustration on both ends, and the solution is not to complain, but to inform.

Standard for Robot Exclusion [robotstxt.org]

robots.txt syntax checker [searchengineworld.com]

Jim

PCInk




msg:1526876
 8:17 pm on Nov 28, 2004 (gmt 0)

30% of the time, when a Robots.txt file was found, it contained nothing but HTML code.

Possibly some of these are caused by missing robots.txt files. If the robots file is missing, some servers can return a 404 missing error page but a 200 code.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved