homepage Welcome to WebmasterWorld Guest from 54.161.192.61
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Hiding robots.txt
keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 74 posted 2:12 am on Sep 27, 2003 (gmt 0)

Is there a way to hide robots.txt from browsers, but not impede robots? Thanks.

 

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 74 posted 2:31 am on Sep 27, 2003 (gmt 0)

keyplyr,

Sure:

RewriteCond {HTTP_USER_AGENT} ^Mozilla
RewriteCond %{HTTP_USER_AGENT}!(Slurp¦surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]

"someotherfile" could be blank, or a it could be a fake robots.txt.

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 74 posted 8:41 pm on Oct 28, 2003 (gmt 0)

Brilliant thread.

You could also do a mod_rewrite on the robots.txt to a CGI file and serve it up dynamically :-)

bcolflesh

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 74 posted 8:43 pm on Oct 28, 2003 (gmt 0)

Why would you want to hide it? - Non-conformists are going to be taken care of in your rewrite rules, hopefully -

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 74 posted 9:23 pm on Oct 28, 2003 (gmt 0)

Let's be honest: the robots.txt standard is useless at stopping rogue bots. I want to do a complete ban on all bots but the good search engine bots. How can you do that with thousands of bot names you don't know? Ban 'em all, and then let the good bots in via JD's script.

What your competitors don't know - can't hurt you, but if they get too snoopy, it can hurt them.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 74 posted 9:23 pm on Oct 28, 2003 (gmt 0)

Why would you want to hide it?

Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads. On several occasions, I have seen a user get the robots.txt, view the directory hierarchy and which directories are off-limits, then go after them.

Since the robots.txt is, after all, intended for respectful robots, I feel it serves no constructive purpose for general accessibility.

PhraSEOlogy

10+ Year Member



 
Msg#: 74 posted 1:02 am on Oct 29, 2003 (gmt 0)

Brett,

YIKES.

Perhaps you could serve up a robots.txt based upon the requesting robot and feed them what they are looking for. Directory structure, big pages, small pages, keywords in URL's, keyword density, etc, etc.

Opens up a world of possibilities...

Dont try this at home folks!

BlueSky

10+ Year Member



 
Msg#: 74 posted 1:12 am on Oct 29, 2003 (gmt 0)

Very cool idea. I've also seen someone check out robots.txt via browser before releasing his misbehaving bots on my site.

Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads.

Huh, only some? Gee, I'd love to be in that other group. lol

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 74 posted 5:45 am on Oct 29, 2003 (gmt 0)

keyplyr,

> I understand the last line, but would you explain the first two?

RewriteCond %{HTTP_USER_AGENT} ^Mozilla
RewriteCond %{HTTP_USER_AGENT} !(Slurp¦surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]

Line 1: IF the User-agent string starts with "Mozilla" (most browsers)
Line 2: AND IF the User-agent string does not contain "Slurp" or "surfsafely" (Two 'bots w/UAs that start with "Mozilla")
Line 3: THEN do the rewrite of robots.txt to some other file.

(Note correction of code line 1 from first post. Also, replace "¦" with a solid vertical pipe character from your keyboard.)

Jim

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 74 posted 6:38 am on Oct 29, 2003 (gmt 0)

Thanks, it sure looks tempting ;)

dillonstars

10+ Year Member



 
Msg#: 74 posted 9:58 am on Oct 29, 2003 (gmt 0)

Does this all only work on Linux/Unix hosting? If so are there any alternatives for Windows hosting?

dirkz

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 74 posted 11:51 am on Oct 29, 2003 (gmt 0)

Does this all only work on Linux/Unix hosting?

It's meant for apache. There's also a win32 version of it, but I don't know whether it's suitable for production servers.

Some sites fall prey to constant file pilfering, leeching and unwanted mass downloads

I have experienced both sides: The bot programmer and the site owner. A "good" leeching bot (in the eye of the leecher) will disguise its UA and never obeye a robots.txt. It's quite easy to modify existing bots in Perl and Python to do so. It's also very easy to write your own.

On the other side of the fence, as a site owner I strongly recommend "traffic-shaping methods" in real time independent of UA and robots stuff based on "offending" IPs. It works like firewalls detecting intrusion attempts and DOS attacks, only on a higher level (HTTP).

Btw, from my experience a lot of leechers use sophisticated Perl/Python solutions. Sometimes I feel like telling them about wget and its mirroring options. Leecher's life could be so simple :)

ritch_b

10+ Year Member



 
Msg#: 74 posted 12:06 pm on Oct 29, 2003 (gmt 0)

Excellent stuff - works like a charm!

I modded jdMorgan's code a little to exclude Opera:

RewriteCond %{HTTP_USER_AGENT} ^(Mozilla¦Opera)
RewriteCond %{HTTP_USER_AGENT}!(Slurp¦surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]

Already added to my collection of useful stuff!

Cheers.

R.

amznVibe

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 74 posted 1:46 am on Oct 30, 2003 (gmt 0)

Maybe someone should let the whitehouse know about this idea ;)
[webmasterworld.com...]

(since when was there a robots.txt forum? didn't even realize that 'till now)

Bloggerheads

10+ Year Member



 
Msg#: 74 posted 9:27 am on Oct 30, 2003 (gmt 0)

Oops. My bad.

OK, here's a generic example for you:

# /robots.txt file for http://www.example.com

User-agent: *
Allow: /games
Allow: /forum
Allow: /tutorials
Disallow: /

[edited by: engine at 5:19 pm (utc) on Oct. 30, 2003]
[edit reason] examplified & de-linked [/edit]

tschild

10+ Year Member



 
Msg#: 74 posted 10:05 am on Oct 30, 2003 (gmt 0)

Another option would be to specify disallowed directories with a partial path which is only long enough to not include allowed directories.

Say you have the following files and directories:

/index.html
/my_sekrit_files/
/pages/
/sitemap.html
/stuff_that_i_dont_want_to_be_spidered/

then you only need to specify

User-agent: *
Disallow: /m
Disallow: /st

to disallow those two directories, and leave snoopers in the dark about the names of the excluded directories.

wackybrit

10+ Year Member



 
Msg#: 74 posted 10:51 pm on Oct 30, 2003 (gmt 0)

A "good" leeching bot (in the eye of the leecher) will disguise its UA and never obeye a robots.txt.

And the best have vast networks so they can appear to be coming from 100s of different IPs, with different UserAgents every time.. much like regular visitors.

I'm not sure if there's a viable way to block this sort of activity, as it's very hard to track.

ThierryZoller

10+ Year Member



 
Msg#: 74 posted 10:39 pm on Nov 1, 2003 (gmt 0)

User-agent: *
Disallow: /m
Disallow: /st

This is going to disallow
/mofo.html
/moritis.htm
/strange.htm
/stupi.html

Better use :
Disallow: /m/
Disallow: /st/

For directories.

spud01

10+ Year Member



 
Msg#: 74 posted 2:15 pm on Nov 12, 2003 (gmt 0)

How do u implement this on windows based webservers?

nakulgoyal

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 74 posted 7:14 am on Nov 17, 2003 (gmt 0)

That was wonderful. I would also like to know how to implement this over windows.

lightsup55

10+ Year Member



 
Msg#: 74 posted 1:45 am on Dec 2, 2003 (gmt 0)

How do u implement this on windows based webservers?

Doesn't matter what software your server is running as it is just a text file.

Open Notepad* (or SimpleText for you Macintosh** users), type any of the examples givien here (using the directory/file names on your domain), save it as "robots.txt" and upload it to the root folder.

If this isn't what you are asking, please be a little bit more specific.

* Notepad should automaticly add ".txt" to the file name upon saving, thus you will only need to save it as "robots" choosing "Text Documents" as the type.

** Users of MacOS 9 (and eariler) will need to add ".txt" to any text file they save before uploading to the web. While HTML files are still plain text, they will require either ".htm" or ".html" before being uploaded to the web.
MacOS X's text editor automaticly adds ".txt" to the file name upon saving, thus you will only need to save it as "robots".

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved