jdMorgan

msg:1528643 | 2:31 am on Sep 27, 2003 (gmt 0) |
keyplyr, Sure:
RewriteCond {HTTP_USER_AGENT} ^Mozilla RewriteCond %{HTTP_USER_AGENT}!(Slurp¦surfsafely) RewriteRule ^robots\.txt$ /someotherfile [L]
"someotherfile" could be blank, or a it could be a fake robots.txt.
|
Brett_Tabke

msg:1528644 | 8:41 pm on Oct 28, 2003 (gmt 0) |
Brilliant thread. You could also do a mod_rewrite on the robots.txt to a CGI file and serve it up dynamically :-)
|
bcolflesh

msg:1528645 | 8:43 pm on Oct 28, 2003 (gmt 0) |
Why would you want to hide it? - Non-conformists are going to be taken care of in your rewrite rules, hopefully -
|
Brett_Tabke

msg:1528646 | 9:23 pm on Oct 28, 2003 (gmt 0) |
Let's be honest: the robots.txt standard is useless at stopping rogue bots. I want to do a complete ban on all bots but the good search engine bots. How can you do that with thousands of bot names you don't know? Ban 'em all, and then let the good bots in via JD's script. What your competitors don't know - can't hurt you, but if they get too snoopy, it can hurt them.
|
keyplyr

msg:1528647 | 9:23 pm on Oct 28, 2003 (gmt 0) |
| Why would you want to hide it? |
| Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads. On several occasions, I have seen a user get the robots.txt, view the directory hierarchy and which directories are off-limits, then go after them. Since the robots.txt is, after all, intended for respectful robots, I feel it serves no constructive purpose for general accessibility.
|
PhraSEOlogy

msg:1528648 | 1:02 am on Oct 29, 2003 (gmt 0) |
Brett, YIKES. Perhaps you could serve up a robots.txt based upon the requesting robot and feed them what they are looking for. Directory structure, big pages, small pages, keywords in URL's, keyword density, etc, etc. Opens up a world of possibilities... Dont try this at home folks!
|
BlueSky

msg:1528649 | 1:12 am on Oct 29, 2003 (gmt 0) |
Very cool idea. I've also seen someone check out robots.txt via browser before releasing his misbehaving bots on my site. | Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads. |
| Huh, only some? Gee, I'd love to be in that other group. lol
|
jdMorgan

msg:1528650 | 5:45 am on Oct 29, 2003 (gmt 0) |
keyplyr, > I understand the last line, but would you explain the first two?
RewriteCond %{HTTP_USER_AGENT} ^Mozilla RewriteCond %{HTTP_USER_AGENT} !(Slurp¦surfsafely) RewriteRule ^robots\.txt$ /someotherfile [L]
Line 1: IF the User-agent string starts with "Mozilla" (most browsers) Line 2: AND IF the User-agent string does not contain "Slurp" or "surfsafely" (Two 'bots w/UAs that start with "Mozilla") Line 3: THEN do the rewrite of robots.txt to some other file. (Note correction of code line 1 from first post. Also, replace "¦" with a solid vertical pipe character from your keyboard.) Jim
|
keyplyr

msg:1528651 | 6:38 am on Oct 29, 2003 (gmt 0) |
Thanks, it sure looks tempting ;)
|
dillonstars

msg:1528652 | 9:58 am on Oct 29, 2003 (gmt 0) |
Does this all only work on Linux/Unix hosting? If so are there any alternatives for Windows hosting?
|
dirkz

msg:1528653 | 11:51 am on Oct 29, 2003 (gmt 0) |
| Does this all only work on Linux/Unix hosting? |
| It's meant for apache. There's also a win32 version of it, but I don't know whether it's suitable for production servers. | Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads |
| I have experienced both sides: The bot programmer and the site owner. A "good" leeching bot (in the eye of the leecher) will disguise its UA and never obeye a robots.txt. It's quite easy to modify existing bots in Perl and Python to do so. It's also very easy to write your own. On the other side of the fence, as a site owner I strongly recommend "traffic-shaping methods" in real time independent of UA and robots stuff based on "offending" IPs. It works like firewalls detecting intrusion attempts and DOS attacks, only on a higher level (HTTP). Btw, from my experience a lot of leechers use sophisticated Perl/Python solutions. Sometimes I feel like telling them about wget and its mirroring options. Leecher's life could be so simple :)
|
ritch_b

msg:1528654 | 12:06 pm on Oct 29, 2003 (gmt 0) |
Excellent stuff - works like a charm! I modded jdMorgan's code a little to exclude Opera: RewriteCond %{HTTP_USER_AGENT} ^(Mozilla¦Opera) RewriteCond %{HTTP_USER_AGENT}!(Slurp¦surfsafely) RewriteRule ^robots\.txt$ /someotherfile [L] Already added to my collection of useful stuff! Cheers. R.
|
amznVibe

msg:1528655 | 1:46 am on Oct 30, 2003 (gmt 0) |
Maybe someone should let the whitehouse know about this idea ;) [webmasterworld.com...] (since when was there a robots.txt forum? didn't even realize that 'till now)
|
Bloggerheads

msg:1528656 | 9:27 am on Oct 30, 2003 (gmt 0) |
Oops. My bad. OK, here's a generic example for you: # /robots.txt file for http://www.example.com User-agent: * Allow: /games Allow: /forum Allow: /tutorials Disallow: / [edited by: engine at 5:19 pm (utc) on Oct. 30, 2003] [edit reason] examplified & de-linked [/edit]
|
tschild

msg:1528657 | 10:05 am on Oct 30, 2003 (gmt 0) |
Another option would be to specify disallowed directories with a partial path which is only long enough to not include allowed directories. Say you have the following files and directories: /index.html /my_sekrit_files/ /pages/ /sitemap.html /stuff_that_i_dont_want_to_be_spidered/ then you only need to specify User-agent: * Disallow: /m Disallow: /st to disallow those two directories, and leave snoopers in the dark about the names of the excluded directories.
|
wackybrit

msg:1528658 | 10:51 pm on Oct 30, 2003 (gmt 0) |
| A "good" leeching bot (in the eye of the leecher) will disguise its UA and never obeye a robots.txt. |
| And the best have vast networks so they can appear to be coming from 100s of different IPs, with different UserAgents every time.. much like regular visitors. I'm not sure if there's a viable way to block this sort of activity, as it's very hard to track.
|
ThierryZoller

msg:1528659 | 10:39 pm on Nov 1, 2003 (gmt 0) |
User-agent: * Disallow: /m Disallow: /st This is going to disallow /mofo.html /moritis.htm /strange.htm /stupi.html Better use : Disallow: /m/ Disallow: /st/ For directories.
|
spud01

msg:1528660 | 2:15 pm on Nov 12, 2003 (gmt 0) |
How do u implement this on windows based webservers?
|
nakulgoyal

msg:1528661 | 7:14 am on Nov 17, 2003 (gmt 0) |
That was wonderful. I would also like to know how to implement this over windows.
|
lightsup55

msg:1528662 | 1:45 am on Dec 2, 2003 (gmt 0) |
| How do u implement this on windows based webservers? |
| Doesn't matter what software your server is running as it is just a text file. Open Notepad* (or SimpleText for you Macintosh** users), type any of the examples givien here (using the directory/file names on your domain), save it as "robots.txt" and upload it to the root folder. If this isn't what you are asking, please be a little bit more specific. * Notepad should automaticly add ".txt" to the file name upon saving, thus you will only need to save it as "robots" choosing "Text Documents" as the type. ** Users of MacOS 9 (and eariler) will need to add ".txt" to any text file they save before uploading to the web. While HTML files are still plain text, they will require either ".htm" or ".html" before being uploaded to the web. MacOS X's text editor automaticly adds ".txt" to the file name upon saving, thus you will only need to save it as "robots".
|
|