Welcome to WebmasterWorld Guest from 54.145.235.23

Forum Moderators: goodroi

Message Too Old, No Replies

Hiding robots.txt

     
2:12 am on Sep 27, 2003 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:7562
votes: 244


Is there a way to hide robots.txt from browsers, but not impede robots? Thanks.

2:31 am on Sept 27, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


keyplyr,

Sure:


RewriteCond {HTTP_USER_AGENT} ^Mozilla
RewriteCond %{HTTP_USER_AGENT}!(Slurp¦surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]

"someotherfile" could be blank, or a it could be a fake robots.txt.
8:41 pm on Oct 28, 2003 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38069
votes: 15


Brilliant thread.

You could also do a mod_rewrite on the robots.txt to a CGI file and serve it up dynamically :-)

8:43 pm on Oct 28, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 21, 2003
posts:2355
votes: 0


Why would you want to hide it? - Non-conformists are going to be taken care of in your rewrite rules, hopefully -
9:23 pm on Oct 28, 2003 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38069
votes: 15


Let's be honest: the robots.txt standard is useless at stopping rogue bots. I want to do a complete ban on all bots but the good search engine bots. How can you do that with thousands of bot names you don't know? Ban 'em all, and then let the good bots in via JD's script.

What your competitors don't know - can't hurt you, but if they get too snoopy, it can hurt them.

9:23 pm on Oct 28, 2003 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:7562
votes: 244


Why would you want to hide it?

Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads. On several occasions, I have seen a user get the robots.txt, view the directory hierarchy and which directories are off-limits, then go after them.

Since the robots.txt is, after all, intended for respectful robots, I feel it serves no constructive purpose for general accessibility.

1:02 am on Oct 29, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 20, 2003
posts:408
votes: 0


Brett,

YIKES.

Perhaps you could serve up a robots.txt based upon the requesting robot and feed them what they are looking for. Directory structure, big pages, small pages, keywords in URL's, keyword density, etc, etc.

Opens up a world of possibilities...

Dont try this at home folks!

1:12 am on Oct 29, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 11, 2003
posts:495
votes: 0


Very cool idea. I've also seen someone check out robots.txt via browser before releasing his misbehaving bots on my site.

Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads.

Huh, only some? Gee, I'd love to be in that other group. lol
5:45 am on Oct 29, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


keyplyr,

> I understand the last line, but would you explain the first two?


RewriteCond %{HTTP_USER_AGENT} ^Mozilla
RewriteCond %{HTTP_USER_AGENT} !(Slurp¦surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]

Line 1: IF the User-agent string starts with "Mozilla" (most browsers)
Line 2: AND IF the User-agent string does not contain "Slurp" or "surfsafely" (Two 'bots w/UAs that start with "Mozilla")
Line 3: THEN do the rewrite of robots.txt to some other file.

(Note correction of code line 1 from first post. Also, replace "¦" with a solid vertical pipe character from your keyboard.)

Jim

6:38 am on Oct 29, 2003 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:7562
votes: 244


Thanks, it sure looks tempting ;)

9:58 am on Oct 29, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:June 10, 2003
posts:598
votes: 0


Does this all only work on Linux/Unix hosting? If so are there any alternatives for Windows hosting?
11:51 am on Oct 29, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 28, 2003
posts:925
votes: 0


Does this all only work on Linux/Unix hosting?

It's meant for apache. There's also a win32 version of it, but I don't know whether it's suitable for production servers.

Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads

I have experienced both sides: The bot programmer and the site owner. A "good" leeching bot (in the eye of the leecher) will disguise its UA and never obeye a robots.txt. It's quite easy to modify existing bots in Perl and Python to do so. It's also very easy to write your own.

On the other side of the fence, as a site owner I strongly recommend "traffic-shaping methods" in real time independent of UA and robots stuff based on "offending" IPs. It works like firewalls detecting intrusion attempts and DOS attacks, only on a higher level (HTTP).

Btw, from my experience a lot of leechers use sophisticated Perl/Python solutions. Sometimes I feel like telling them about wget and its mirroring options. Leecher's life could be so simple :)

12:06 pm on Oct 29, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 22, 2002
posts:453
votes: 0


Excellent stuff - works like a charm!

I modded jdMorgan's code a little to exclude Opera:

RewriteCond %{HTTP_USER_AGENT} ^(Mozilla¦Opera)
RewriteCond %{HTTP_USER_AGENT}!(Slurp¦surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]

Already added to my collection of useful stuff!

Cheers.

R.

1:46 am on Oct 30, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 16, 2002
posts:2010
votes: 0


Maybe someone should let the whitehouse know about this idea ;)
[webmasterworld.com...]

(since when was there a robots.txt forum? didn't even realize that 'till now)

9:27 am on Oct 30, 2003 (gmt 0)

New User

10+ Year Member

joined:Oct 29, 2003
posts:4
votes: 0


Oops. My bad.

OK, here's a generic example for you:

# /robots.txt file for http://www.example.com

User-agent: *
Allow: /games
Allow: /forum
Allow: /tutorials
Disallow: /

[edited by: engine at 5:19 pm (utc) on Oct. 30, 2003]
[edit reason] examplified & de-linked [/edit]

10:05 am on Oct 30, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 27, 2003
posts:166
votes: 0


Another option would be to specify disallowed directories with a partial path which is only long enough to not include allowed directories.

Say you have the following files and directories:

/index.html
/my_sekrit_files/
/pages/
/sitemap.html
/stuff_that_i_dont_want_to_be_spidered/

then you only need to specify

User-agent: *
Disallow: /m
Disallow: /st

to disallow those two directories, and leave snoopers in the dark about the names of the excluded directories.

10:51 pm on Oct 30, 2003 (gmt 0)

Full Member

10+ Year Member

joined:Oct 24, 2002
posts:225
votes: 0


A "good" leeching bot (in the eye of the leecher) will disguise its UA and never obeye a robots.txt.

And the best have vast networks so they can appear to be coming from 100s of different IPs, with different UserAgents every time.. much like regular visitors.

I'm not sure if there's a viable way to block this sort of activity, as it's very hard to track.

ThierryZoller

10:39 pm on Nov 1, 2003 (gmt 0)

Inactive Member
Account Expired

 
 


User-agent: *
Disallow: /m
Disallow: /st

This is going to disallow
/mofo.html
/moritis.htm
/strange.htm
/stupi.html

Better use :
Disallow: /m/
Disallow: /st/

For directories.

2:15 pm on Nov 12, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:June 12, 2003
posts:164
votes: 0


How do u implement this on windows based webservers?
7:14 am on Nov 17, 2003 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member 10+ Year Member

joined:May 13, 2003
posts:723
votes: 0


That was wonderful. I would also like to know how to implement this over windows.

lightsup55

1:45 am on Dec 2, 2003 (gmt 0)

Inactive Member
Account Expired

 
 


How do u implement this on windows based webservers?

Doesn't matter what software your server is running as it is just a text file.

Open Notepad* (or SimpleText for you Macintosh** users), type any of the examples givien here (using the directory/file names on your domain), save it as "robots.txt" and upload it to the root folder.

If this isn't what you are asking, please be a little bit more specific.

* Notepad should automaticly add ".txt" to the file name upon saving, thus you will only need to save it as "robots" choosing "Text Documents" as the type.

** Users of MacOS 9 (and eariler) will need to add ".txt" to any text file they save before uploading to the web. While HTML files are still plain text, they will require either ".htm" or ".html" before being uploaded to the web.
MacOS X's text editor automaticly adds ".txt" to the file name upon saving, thus you will only need to save it as "robots".

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members