Welcome to WebmasterWorld Guest from 23.22.46.195

Forum Moderators: goodroi

Hiding robots.txt

   
2:12 am on Sep 27, 2003 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Is there a way to hide robots.txt from browsers, but not impede robots? Thanks.

2:31 am on Sep 27, 2003 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



keyplyr,

Sure:


RewriteCond {HTTP_USER_AGENT} ^Mozilla
RewriteCond %{HTTP_USER_AGENT}!(Slurp¦surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]

"someotherfile" could be blank, or a it could be a fake robots.txt.
8:41 pm on Oct 28, 2003 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Brilliant thread.

You could also do a mod_rewrite on the robots.txt to a CGI file and serve it up dynamically :-)

8:43 pm on Oct 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why would you want to hide it? - Non-conformists are going to be taken care of in your rewrite rules, hopefully -
9:23 pm on Oct 28, 2003 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Let's be honest: the robots.txt standard is useless at stopping rogue bots. I want to do a complete ban on all bots but the good search engine bots. How can you do that with thousands of bot names you don't know? Ban 'em all, and then let the good bots in via JD's script.

What your competitors don't know - can't hurt you, but if they get too snoopy, it can hurt them.

9:23 pm on Oct 28, 2003 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Why would you want to hide it?

Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads. On several occasions, I have seen a user get the robots.txt, view the directory hierarchy and which directories are off-limits, then go after them.

Since the robots.txt is, after all, intended for respectful robots, I feel it serves no constructive purpose for general accessibility.

1:02 am on Oct 29, 2003 (gmt 0)

10+ Year Member



Brett,

YIKES.

Perhaps you could serve up a robots.txt based upon the requesting robot and feed them what they are looking for. Directory structure, big pages, small pages, keywords in URL's, keyword density, etc, etc.

Opens up a world of possibilities...

Dont try this at home folks!

1:12 am on Oct 29, 2003 (gmt 0)

10+ Year Member



Very cool idea. I've also seen someone check out robots.txt via browser before releasing his misbehaving bots on my site.

Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads.

Huh, only some? Gee, I'd love to be in that other group. lol
5:45 am on Oct 29, 2003 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



keyplyr,

> I understand the last line, but would you explain the first two?


RewriteCond %{HTTP_USER_AGENT} ^Mozilla
RewriteCond %{HTTP_USER_AGENT} !(Slurp¦surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]

Line 1: IF the User-agent string starts with "Mozilla" (most browsers)
Line 2: AND IF the User-agent string does not contain "Slurp" or "surfsafely" (Two 'bots w/UAs that start with "Mozilla")
Line 3: THEN do the rewrite of robots.txt to some other file.

(Note correction of code line 1 from first post. Also, replace "¦" with a solid vertical pipe character from your keyboard.)

Jim

6:38 am on Oct 29, 2003 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Thanks, it sure looks tempting ;)

9:58 am on Oct 29, 2003 (gmt 0)

10+ Year Member



Does this all only work on Linux/Unix hosting? If so are there any alternatives for Windows hosting?
11:51 am on Oct 29, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does this all only work on Linux/Unix hosting?

It's meant for apache. There's also a win32 version of it, but I don't know whether it's suitable for production servers.

Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads

I have experienced both sides: The bot programmer and the site owner. A "good" leeching bot (in the eye of the leecher) will disguise its UA and never obeye a robots.txt. It's quite easy to modify existing bots in Perl and Python to do so. It's also very easy to write your own.

On the other side of the fence, as a site owner I strongly recommend "traffic-shaping methods" in real time independent of UA and robots stuff based on "offending" IPs. It works like firewalls detecting intrusion attempts and DOS attacks, only on a higher level (HTTP).

Btw, from my experience a lot of leechers use sophisticated Perl/Python solutions. Sometimes I feel like telling them about wget and its mirroring options. Leecher's life could be so simple :)

12:06 pm on Oct 29, 2003 (gmt 0)

10+ Year Member



Excellent stuff - works like a charm!

I modded jdMorgan's code a little to exclude Opera:

RewriteCond %{HTTP_USER_AGENT} ^(Mozilla¦Opera)
RewriteCond %{HTTP_USER_AGENT}!(Slurp¦surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]

Already added to my collection of useful stuff!

Cheers.

R.

1:46 am on Oct 30, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Maybe someone should let the whitehouse know about this idea ;)
[webmasterworld.com...]

(since when was there a robots.txt forum? didn't even realize that 'till now)

9:27 am on Oct 30, 2003 (gmt 0)

10+ Year Member



Oops. My bad.

OK, here's a generic example for you:

# /robots.txt file for http://www.example.com

User-agent: *
Allow: /games
Allow: /forum
Allow: /tutorials
Disallow: /

[edited by: engine at 5:19 pm (utc) on Oct. 30, 2003]
[edit reason] examplified & de-linked [/edit]

10:05 am on Oct 30, 2003 (gmt 0)

10+ Year Member



Another option would be to specify disallowed directories with a partial path which is only long enough to not include allowed directories.

Say you have the following files and directories:

/index.html
/my_sekrit_files/
/pages/
/sitemap.html
/stuff_that_i_dont_want_to_be_spidered/

then you only need to specify

User-agent: *
Disallow: /m
Disallow: /st

to disallow those two directories, and leave snoopers in the dark about the names of the excluded directories.

10:51 pm on Oct 30, 2003 (gmt 0)

10+ Year Member



A "good" leeching bot (in the eye of the leecher) will disguise its UA and never obeye a robots.txt.

And the best have vast networks so they can appear to be coming from 100s of different IPs, with different UserAgents every time.. much like regular visitors.

I'm not sure if there's a viable way to block this sort of activity, as it's very hard to track.

10:39 pm on Nov 1, 2003 (gmt 0)

10+ Year Member



User-agent: *
Disallow: /m
Disallow: /st

This is going to disallow
/mofo.html
/moritis.htm
/strange.htm
/stupi.html

Better use :
Disallow: /m/
Disallow: /st/

For directories.

2:15 pm on Nov 12, 2003 (gmt 0)

10+ Year Member



How do u implement this on windows based webservers?
7:14 am on Nov 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That was wonderful. I would also like to know how to implement this over windows.
1:45 am on Dec 2, 2003 (gmt 0)

10+ Year Member



How do u implement this on windows based webservers?

Doesn't matter what software your server is running as it is just a text file.

Open Notepad* (or SimpleText for you Macintosh** users), type any of the examples givien here (using the directory/file names on your domain), save it as "robots.txt" and upload it to the root folder.

If this isn't what you are asking, please be a little bit more specific.

* Notepad should automaticly add ".txt" to the file name upon saving, thus you will only need to save it as "robots" choosing "Text Documents" as the type.

** Users of MacOS 9 (and eariler) will need to add ".txt" to any text file they save before uploading to the web. While HTML files are still plain text, they will require either ".htm" or ".html" before being uploaded to the web.
MacOS X's text editor automaticly adds ".txt" to the file name upon saving, thus you will only need to save it as "robots".

 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month