homepage Welcome to WebmasterWorld Guest from 107.20.131.154
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
files requested by robots.txt
i would like to find out about the files requested by robots
andrea edwards




msg:1529602
 7:45 pm on Jun 5, 2003 (gmt 0)

Hello

I am just about to write a robots.txt file which I have never done before.

I know that search engines spiders retrieve a page off a web site and then find the links on this page and then follow these links. What I wanted to know was whether robots will request any other files off my server that aren't linked to. I wasn't sure if this was possible because obviously the spiders don't know the names of the other files if they are linked to, so if they don't know their names how can they request them.

I want to know this because I want to know if I can save myself the trouble of writing a robots.txt file to disallow access to all the other files on my server which aren't actually linked to which, sadly there are many of all over the place!

Mnay thanks
Andrea

 

wilderness




msg:1529603
 12:46 am on Jun 6, 2003 (gmt 0)

Andrea
Welcome to Webmaster World.

If you have pages on your website(s) which do not have links pointing to them anyplace on the internet?

That I would suggest NOT pointing to disallow these pages in robots.txt.
Why point a devious bot or person to something which you desire not to exist in a public forum, which for the most part is your robots.txt

I'll give you a couple of examples.

At one point I was courageous enough to use my standard email and a two-line signature in my Usenet participation mail submissions.
I was looking for a way to develop a 6-generational family database structure using MS-Acess. In my inquiry I provided an example to file which was displayed on a page in a folder which is basically private and used sparingly for a few friends.
Some four years later, I get an occassioanl referral from google archive of groups looking for that no-longer existent example. :(

On another occassion somebody in Usenet was inquiring about the free web pages which are provided with a discount registrar.
To show that person that a pop-was added I provided that domain name in a usenet mail.
Again four years later I get referrals from that mail in usenet.

Initailly when I created the folder my sole intent was to keep it private from the main-stream internet. These two slips taught me valuable lessons.

Don

andrea edwards




msg:1529604
 2:35 pm on Jun 7, 2003 (gmt 0)

Hi Don

Thanks very much for the reply.

If you are still around I would like to try and clarify your answer because I am new to all this and still find it a bit confusing.

Are you saying that if I disallow my private/development files in the robots.txt I will make the existence of these files known. If I dont put these directories in the robots.txt file then the spiders won't know they exist and they can't request them (unless of course they are linked to, which they aren't)

Many thanks again
Andrea

wilderness




msg:1529605
 3:53 pm on Jun 7, 2003 (gmt 0)

Andrea

Good robots would honor you robots.txt in most instances.
The ONLY way that spiders/bots can become aware of a page(s) is if a link to that page exists some other place on the web. If you are the only one aware of the existence of these file/folders, there is no way a bot can find them. (If you sticky me? I'll provide an example)

Are you saying that if I disallow my private/development files in the robots.txt I will make the existence of these files known. If I dont put these directories in the robots.txt file then the spiders won't know they exist and they can't request them (unless of course they are linked to, which they aren't)

Yes.

DEVIOUS robots in most instances don't even bother with robots.txt. However, there is no need to take a chance on pointing devious bots to either folders or files of which your websites or any other websites DO NOT have links pointing to.

So yes I'm suggesting that should you have pages of which are either in the works or used "private/development files" that you do NOT list them in your robots.

A much safer solution would be to dump all those files in a folder which denies access to most everybody except desired IP ranges.

Jim provided me with a nice rewrite a short while back when I began working on the Oceanic IP ranges. Although it requires some caution, it has been effective.
If your interested? Sticky me and I'll attempt to assist you in setting it up.

Don

jdMorgan




msg:1529606
 9:37 pm on Jun 7, 2003 (gmt 0)

Andrea,

Welcome to WebmasterWorld [webmasterworld.com]!

It has been rumored that visiting a page with the Google Toolbar installed can cause a visit from the Googlebot. Because of this, I'd recommend you add the <meta name="robots" content="noindex"> tag to those pages you absolutely don't want disclosed.

If you'd like to write a more compact robots.txt, I posted some suggestions here [webmasterworld.com].

Jim

rbs10025




msg:1529607
 10:22 pm on Jun 7, 2003 (gmt 0)


DEVIOUS robots in most instances don't even bother with robots.txt. However, there is no need to take a chance on pointing devious bots to either folders or files of which your websites or any other websites DO NOT have links pointing to.

And FWIW, I have on occasion seen entries in my server log indicative of real humans viewing my robots.txt files and then deliverately checking out the "disallowed" directories.

wilderness




msg:1529608
 11:33 am on Jun 8, 2003 (gmt 0)

Google Toolbar

Cannot install the thing Jim.
I have Active X turned off, which it requires.

andrea edwards




msg:1529609
 7:53 pm on Jun 8, 2003 (gmt 0)

Hello

Thank-you to Don and Jim for your replies.

I am definitely wiser than before. And thanks for the link to the great page about reducing the robots.txt file.

Don, I am interested in finding out about the script to restrict access to a page. I will mail you about this.

Many thanks
Andrea

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved