Welcome to WebmasterWorld Guest from 54.226.143.14

Forum Moderators: goodroi

Message Too Old, No Replies

files requested by robots.txt

i would like to find out about the files requested by robots

   
7:45 pm on Jun 5, 2003 (gmt 0)

10+ Year Member



Hello

I am just about to write a robots.txt file which I have never done before.

I know that search engines spiders retrieve a page off a web site and then find the links on this page and then follow these links. What I wanted to know was whether robots will request any other files off my server that aren't linked to. I wasn't sure if this was possible because obviously the spiders don't know the names of the other files if they are linked to, so if they don't know their names how can they request them.

I want to know this because I want to know if I can save myself the trouble of writing a robots.txt file to disallow access to all the other files on my server which aren't actually linked to which, sadly there are many of all over the place!

Mnay thanks
Andrea

12:46 am on Jun 6, 2003 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Andrea
Welcome to Webmaster World.

If you have pages on your website(s) which do not have links pointing to them anyplace on the internet?

That I would suggest NOT pointing to disallow these pages in robots.txt.
Why point a devious bot or person to something which you desire not to exist in a public forum, which for the most part is your robots.txt

I'll give you a couple of examples.

At one point I was courageous enough to use my standard email and a two-line signature in my Usenet participation mail submissions.
I was looking for a way to develop a 6-generational family database structure using MS-Acess. In my inquiry I provided an example to file which was displayed on a page in a folder which is basically private and used sparingly for a few friends.
Some four years later, I get an occassioanl referral from google archive of groups looking for that no-longer existent example. :(

On another occassion somebody in Usenet was inquiring about the free web pages which are provided with a discount registrar.
To show that person that a pop-was added I provided that domain name in a usenet mail.
Again four years later I get referrals from that mail in usenet.

Initailly when I created the folder my sole intent was to keep it private from the main-stream internet. These two slips taught me valuable lessons.

Don

2:35 pm on Jun 7, 2003 (gmt 0)

10+ Year Member



Hi Don

Thanks very much for the reply.

If you are still around I would like to try and clarify your answer because I am new to all this and still find it a bit confusing.

Are you saying that if I disallow my private/development files in the robots.txt I will make the existence of these files known. If I dont put these directories in the robots.txt file then the spiders won't know they exist and they can't request them (unless of course they are linked to, which they aren't)

Many thanks again
Andrea

3:53 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Andrea

Good robots would honor you robots.txt in most instances.
The ONLY way that spiders/bots can become aware of a page(s) is if a link to that page exists some other place on the web. If you are the only one aware of the existence of these file/folders, there is no way a bot can find them. (If you sticky me? I'll provide an example)

Are you saying that if I disallow my private/development files in the robots.txt I will make the existence of these files known. If I dont put these directories in the robots.txt file then the spiders won't know they exist and they can't request them (unless of course they are linked to, which they aren't)

Yes.

DEVIOUS robots in most instances don't even bother with robots.txt. However, there is no need to take a chance on pointing devious bots to either folders or files of which your websites or any other websites DO NOT have links pointing to.

So yes I'm suggesting that should you have pages of which are either in the works or used "private/development files" that you do NOT list them in your robots.

A much safer solution would be to dump all those files in a folder which denies access to most everybody except desired IP ranges.

Jim provided me with a nice rewrite a short while back when I began working on the Oceanic IP ranges. Although it requires some caution, it has been effective.
If your interested? Sticky me and I'll attempt to assist you in setting it up.

Don

9:37 pm on Jun 7, 2003 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Andrea,

Welcome to WebmasterWorld [webmasterworld.com]!

It has been rumored that visiting a page with the Google Toolbar installed can cause a visit from the Googlebot. Because of this, I'd recommend you add the <meta name="robots" content="noindex"> tag to those pages you absolutely don't want disclosed.

If you'd like to write a more compact robots.txt, I posted some suggestions here [webmasterworld.com].

Jim

10:22 pm on Jun 7, 2003 (gmt 0)

10+ Year Member




DEVIOUS robots in most instances don't even bother with robots.txt. However, there is no need to take a chance on pointing devious bots to either folders or files of which your websites or any other websites DO NOT have links pointing to.

And FWIW, I have on occasion seen entries in my server log indicative of real humans viewing my robots.txt files and then deliverately checking out the "disallowed" directories.

11:33 am on Jun 8, 2003 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Google Toolbar

Cannot install the thing Jim.
I have Active X turned off, which it requires.

7:53 pm on Jun 8, 2003 (gmt 0)

10+ Year Member



Hello

Thank-you to Don and Jim for your replies.

I am definitely wiser than before. And thanks for the link to the great page about reducing the robots.txt file.

Don, I am interested in finding out about the script to restrict access to a page. I will mail you about this.

Many thanks
Andrea

 

Featured Threads

Hot Threads This Week

Hot Threads This Month