homepage Welcome to WebmasterWorld Guest from 54.205.242.179
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Robots.txt Validator
Froggyman




msg:1529283
 8:45 pm on May 22, 2001 (gmt 0)

Got hit from this one today:

IP: (omitted)
Robots.txt Validator [searchengineworld.com...]

Did it validate??? Oh that's right, I don't have a robot.txt file. Actually, I do- I call it a .htaccess file. :)

 

toolman




msg:1529284
 8:49 pm on May 22, 2001 (gmt 0)

It really is amazing how many people don't use a robots.txt file. I think it is a merited asset just to reduce the 404's.

Froggyman




msg:1529285
 9:00 pm on May 22, 2001 (gmt 0)

Reduce the 404's-- come on now Toolman. When a spider hits my nonexistant robots.txt file it is redirected to my custom error404 page which contains lots and lots of links to other pages on my site. In fact, I haven't resubmitted to a search engine once and I am very well indexed by all.

Brett_Tabke




msg:1529286
 9:17 am on May 26, 2001 (gmt 0)

The data I gathered from that crawl is very interesting to me. Ok, I'm an seo wonk and anything related is interesting to me.

I've not collated the last batch of data yet. It takes a very long time to study that many files. All done now and ended up with 180-210k robots.txt files (not counted yet). Those are just the semi valid ones. Of those I have collated data on, here are some highlights:

About half of robots.txt are in msdos format. (should be unix line enders).
About 60% of all requests for robots.txt ended up as a redirects to an html page. This is not good server configuration to have. SE's do have to deal with it, but it is really bad style.

About 6% of all robots.txt are not valid and many search engines will ignore them.

Common fatal and near fatal errors: These are errors that would give a spider cause for concern about the validity of the file:

Multiple disallows per line. Only one disallow is acceptable per line - you can't combine disallows.

There is no ALLOW tag.

Wild gyrations in formatting. From structured formatting with spaces at the begging of the line, to attempts at multi line comments, there are so many variations, that it makes me wonder what se's do with them.

Size. There were hundreds of robots.txt that were near or over a megabyte in size. I simply can't imagine a search engine using that file as valid. When you consider the overhead involved in parsing a file that size, some se boxes would literally run out of memory. I don't know if there is a "safe size", but a meg is in the questionable range. It is represents bad server/directory setup. eg: Ban whole directories, not 10k files IN the directory.

Doc format. Yes, we ran into 50+ robots.txt that were in microsoft word format. No kidding - loaded some of them up in word, and there was a very pretty looking robots.txt.

HTTP redirects. Ran into many robots.txt that were valid, but they were parked under a http redirect. Questionable if the se's would think of that as valid. (ex: foo.com/robots.txt redirected to foo.com/bar/robots.txt or foo.com/robots2.txt)

Bogus txt files: hit a huge server farm that was loading robots.txt with keyword lists...why? who knows.

Red Flags:
We identified over 200 "server farms" or "domain farms" simply by the identical nature of their robots.txt. (keep that in mind cloakers). The largest we found was a robots.txt duplicated on over 800 domains.

Early Phase One:
[searchengineworld.com...]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved