Forum Moderators: goodroi
If you do OPT-IN method, that bot is banned by default!
Example:
# allowed bots here
User-agent: Googlebot
User-agent: Slurp
User-agent: Teoma
Crawl-delay: 2
Disallow: /cgi-bin# everyone else jump off a cliff
User-agent: *
Disallow: /
I'll just keep preaching OPT-IN until people pay attention ;)
Feed all others a robots.txt file which disallows them.
Use the dynamic file to log accesses, agents, etc. so you can monitor what hits and what to allow, disallow, etc. I've got an extremely detailed and feature-rich dynamic program I've written -- it also monitors bot-traps and automatically updates the .htaccess file to block bots which ignore robots.txt directives.
If you do OPT-IN method, that bot is banned by default!
A well-known site (few millions visitors monthly) uses this method. It allows access to probably 25 robots, and disallows explicitly to 25 others, at the end it even has
User-Agent: *
Disallow: /
Probably this site does not need robots.txt at all, it is well known site. Such a stupid idea to allow Googlebot and disallow Teleport! Teleport users can change robot signature to User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4 GoogleToolbarFF 3.0.20070420
What about this guy, isn't it stupid:
User-agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)
Disallow: /
Wow, Internet Explorer does not check robots.txt before browsing! And Teleport has a setting "do not honor robots.txt".
What about other not-so-well-known robots which may appear on the market and generate good traffic for you?
1. It IS very risky.
2. It IS cloaking.
Having static file in a file system is preferable way.
What about returning 404 if your 'dynamic service' is temporary broken? Have you tested it? Are you sure that it won't reply 50x? 403? Are you sure that your dynamic content correctly handles HTTP HEAD request? What about 304? Are you sure that it generates correct version of robots.txt in case of redirection from another site?...
You have a lot of dependencies... I'd suggest to treat all robots the same in our world of democracy, and to disallow access to 'shopping cart', for instance:
User-Agent: *
Disallow: /addToCart
Disallow: /sendFeedaback
Disallow: /search
Solution with dynamically generated robots.txt is not good by other more important meanings: scalability. It is just plain static text!
If you want to cloak not-welcomed User-Agents, including those not-honoring, just do it explicitly! without adding meaningless value to robots.txt conventions.
For those who uses dynamically generated content for robots.txt, with dependency on IP, User-Agent, and etc:1. It IS very risky.
2. It IS cloaking.Having static file in a file system is preferable way.
Cloaking means showing the USER something different than the SEARCH ENGINE, and the USER should never see robots.txt in the first place, so it's not cloaking.
It's also not risky, no more risky than running a database driven website, such as a forum, blog, or anything else that uses MySQL on the backend. Actually, the cloaked robots.txt is probably less risky because most people just serve up a series of static files dynamically based on the user agent, so it's far less likely to fail than the site itself which is prone to a MySQL database crash.
FWIW, you're arguing the wrong point on the wrong website as WebmasterWorld itself uses a dynamic robots.txt file:
[webmasterworld.com...]
Sorry, but all that fear, uncertainty and doubt about dynamic robots.txt doesn't fly here.
White Hatknows, it has even 'synchronize' functionality. Suppose I need to read up to 5-th level depth of WebmasterWorld website during my long trip when Internet is not available... Can I generate locally cached copy of your site before such trip?
Yes, I am creating a lot of work for everyone at this Forum. Please remove my profile!
`!`
P.S.
Mozilla also has a cache! Mozilla does not honor robots.txt.
WebmasterWorld itself uses a dynamic robots.txt
There is a big difference between a browser, a robot, a cache, and a crawl.
Cache = keeping something you already downloaded
Browser = directly human operated viewing
Robot = automatic downloader of content
Crawl = following links automatically to discover pages and download them
Browsers are not robots, so they do not need to respect robots.txt. Disallowing well-known browser User Agents in the robots.txt file trips up poorly created robots - no genuine browser will check the robots.txt file - so anyone claiming to be IE4 which reads the robots.txt file deserves to see Disallow.
Robots.txt however is only one part of a bigger picture - good management includes honey pots, IP based blocks and similar.
You published it probably after checking your server logs. I accessed it yesterday. Very nice promotion! Especially when someone replies to me via private Email using real name.
I'm not sure what you mean as I have no affiliation with WebmasterWorld whatsoever other than being a member, so I couldn't check any logs at all.
The information I posted is all public knowledge which you can find by viewing the WebmasterWorld blog which *IS* a robots.txt file, see the info at the top of the document:
[webmasterworld.com...]
Now, any questions?
Can you add real value to your script to prevent content stealing?
Robot.txt isn't designed to stop content stealing, it's designed to tell well behaved robots how to crawl your site. If your goal is to stop content stealing using robots.txt then you have lost the battle already.
BTW, many SEs still have bugs with redirects mostly related to robots.txt: which version of that file to use? with initial URL, or with final URL? What to do with 302? with 301?
Dynamic robots.txt uses no redirects, at least nothing I do is redirected, therefore those issues aren't a problem.