Forum Moderators: goodroi
What your competitors don't know - can't hurt you, but if they get too snoopy, it can hurt them.
Why would you want to hide it?
Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads. On several occasions, I have seen a user get the robots.txt, view the directory hierarchy and which directories are off-limits, then go after them.
Since the robots.txt is, after all, intended for respectful robots, I feel it serves no constructive purpose for general accessibility.
> I understand the last line, but would you explain the first two?
RewriteCond %{HTTP_USER_AGENT} ^Mozilla
RewriteCond %{HTTP_USER_AGENT} !(Slurp¦surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]
(Note correction of code line 1 from first post. Also, replace "¦" with a solid vertical pipe character from your keyboard.)
Jim
Does this all only work on Linux/Unix hosting?
It's meant for apache. There's also a win32 version of it, but I don't know whether it's suitable for production servers.
Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads
I have experienced both sides: The bot programmer and the site owner. A "good" leeching bot (in the eye of the leecher) will disguise its UA and never obeye a robots.txt. It's quite easy to modify existing bots in Perl and Python to do so. It's also very easy to write your own.
On the other side of the fence, as a site owner I strongly recommend "traffic-shaping methods" in real time independent of UA and robots stuff based on "offending" IPs. It works like firewalls detecting intrusion attempts and DOS attacks, only on a higher level (HTTP).
Btw, from my experience a lot of leechers use sophisticated Perl/Python solutions. Sometimes I feel like telling them about wget and its mirroring options. Leecher's life could be so simple :)
(since when was there a robots.txt forum? didn't even realize that 'till now)
Say you have the following files and directories:
/index.html
/my_sekrit_files/
/pages/
/sitemap.html
/stuff_that_i_dont_want_to_be_spidered/
then you only need to specify
User-agent: *
Disallow: /m
Disallow: /st
to disallow those two directories, and leave snoopers in the dark about the names of the excluded directories.
A "good" leeching bot (in the eye of the leecher) will disguise its UA and never obeye a robots.txt.
And the best have vast networks so they can appear to be coming from 100s of different IPs, with different UserAgents every time.. much like regular visitors.
I'm not sure if there's a viable way to block this sort of activity, as it's very hard to track.
This is going to disallow
/mofo.html
/moritis.htm
/strange.htm
/stupi.html
Better use :
Disallow: /m/
Disallow: /st/
For directories.
How do u implement this on windows based webservers?
Open Notepad* (or SimpleText for you Macintosh** users), type any of the examples givien here (using the directory/file names on your domain), save it as "robots.txt" and upload it to the root folder.
If this isn't what you are asking, please be a little bit more specific.
* Notepad should automaticly add ".txt" to the file name upon saving, thus you will only need to save it as "robots" choosing "Text Documents" as the type.
** Users of MacOS 9 (and eariler) will need to add ".txt" to any text file they save before uploading to the web. While HTML files are still plain text, they will require either ".htm" or ".html" before being uploaded to the web.
MacOS X's text editor automaticly adds ".txt" to the file name upon saving, thus you will only need to save it as "robots".