Can people or spiders get to pages I haven't linked to?

Forum Moderators: phranque

Message Too Old, No Replies

Can people or spiders get to pages I haven't linked to?

Wondering about new content not ready for public consumption

stlouislouis

6:43 am on Nov 22, 2002 (gmt 0)

Hi,

I have a site with a homepage that does not have links to any other pages on my site. My site is under development, thus not ready for public consumption yet. I'm hosted on a named based "shared server" web hosting account. It's a Linux/Apache setup.

Can I be sure that -- unless I tell someone or someone guesses my directory structure and file names -- I don't have to worry about anyone browsing web pages I upload to my domain that I haven't linked to? That neither person nor spider would have any way of finding the web pages that I've uploaded, but haven't linked to?

Really wondering about this. What's the definitive answer?

Thank you very much!

Louis

andreasfriedrich

6:55 am on Nov 22, 2002 (gmt 0)

If there is really no link and nobody submits any of your secret pages to any search engine then nobody should find it.

To be sure you could just use basic HTTP authentication.

Andreas

fathom

6:57 am on Nov 22, 2002 (gmt 0)

Spiders can relatively easy, but browsers (people) are a bit more difficult.

Spiders, Bots, Crawlers -- example: if a new page is in a directory/sub-directory where other pages are current linked to and your robot.txt allows the Spiders, Bots, and Crawlers access (vice disallows) to pages in this directory they can gobble up anything and everything in that directory.

A user could attempt to view the "parent directory" where these new pages reside (e.g. - www.example.com/direct/) without adding a specific page (pagename.html) at the end and depending on your host server, and/or use of .htaccess may view a list of all content within that directory and then direct access to these pages from there.

andreasfriedrich

7:07 am on Nov 22, 2002 (gmt 0)

That would still be a link though.

to view the "parent directory" sb must already know something about the directory structure. That person must have arrived at the server somehow after all. However, Louis ruled out any knowledge about the site and any successful guesses.

I still maintain that if nobody knows, nobody guesses and there are no links there is nothing to fear.

Andreas

Sinner_G

7:19 am on Nov 22, 2002 (gmt 0)

For spiders, it is also possible that they read your log file. If you checked your pages, the URL will be noted there, so spiders could find it.

fathom

7:38 am on Nov 22, 2002 (gmt 0)

That would still be a link though.

No it wouldn't.

If I have a page in directory /two/ called me.html in my site design and then saved it as you.html in the same directory, I am the only one that knows you.html exists

...and there are no links to you.html what-so-ever.

By default a parent directory query on www.domain.com/two/ will open index.html but if no index page exists then a parent directory list of all files will be exposed, and exposing all your new "secret" pages, such as:

xxx.xxx.xxx.xxx - /c/

--------------------------------------------------------------------------------

[To Parent Directory]

Wednesday, June 12, 2002 10:45 AM <dir> 1

Tuesday, November 06, 2001 10:32 PM 376720 1.zip

Wednesday, October 10, 2001 2:26 PM 32652 103-9497563-8811068

Tuesday, August 13, 2002 6:43 AM 110422 Academic_Brochure.pdf

Wednesday, August 21, 2002 12:23 AM 22016 aju9wbe0.doc

Friday, August 02, 2002 11:13 AM 99652 Astronomy manual v2 07 02.pdf

Tuesday, October 30, 2001 4:26 PM 2847 Atmosphere.gif

Monday, May 28, 2001 11:35 PM 275 BFFTP.REG

andreasfriedrich

7:51 am on Nov 22, 2002 (gmt 0)

I guess we just have a different understanding of the term link.

I just viewed an index page produced by apache and the source looked like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<TITLE>Index of /root</TITLE>
</HEAD>
<BODY>
<H1>Index of /root</H1>
<PRE><IMG SRC="/icons/blank.gif" ALT=" "> <A HREF="?N=D">Name</A> <A HREF="?M=A">Last modified</A> <A HREF="?S=A">Size</A> <A HREF="?D=A">Description</A>
<HR>
<IMG SRC="/icons/back.gif" ALT="[DIR]"> <A HREF="/">Parent Directory</A> 22-Nov-2002 02:33 -
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A HREF="aaron_cursor.gif">aaron_cursor.gif</A> 30-Mai-2002 02:54 9k

<A HREF="aaron_cursor.gif">aaron_cursor.gif</A> looks like a link to me.

Andreas

fathom

8:12 am on Nov 22, 2002 (gmt 0)

Sorry Andreas we agree here.

If a domain is public then all pages are public regardless of whether the main page links to these pages or not as you have correctly stated (according to the root index)

stlouislouis wrote: Can I be sure

No...

You should not place under construction designs on a public server under a current public domain.

Develop the site off-line and when the design is ready upload to the server, without linking to mainpage.

When you have previewed for browsers variations etc. link accordingly.

In practice - you should not design/develop live to avoid embassament.

andreasfriedrich

8:22 am on Nov 22, 2002 (gmt 0)

I agree entirely with fathom.

There are just too many ifs in if nobody knows, nobody guesses and there are no links there is nothing to fear. Chances are your site will be discovered.

Andreas

Mohamed_E

12:12 pm on Nov 22, 2002 (gmt 0)

From the Google Webmaster FAQ [google.com]:

Why is Googlebot downloading information from our "secret" web server?
It is almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, it is likely that your "secret" URL is in the referer tag, and it can be stored and possibly published by the other web server in its referer log. So, if there is a link to your "secret" web server or page on the web anywhere, it is likely that Googlebot and other "web crawlers" will find it.

A third vote for keeping "not ready for prime time" stuff off the web server.

toadhall

3:09 pm on Nov 22, 2002 (gmt 0)

How about directories protected by .htaccess username/password challenge? Are these not inaccessible to spiders?

graywolf

3:47 pm on Nov 22, 2002 (gmt 0)

We keep two sites running. The live site and then the development site under different URL. On the development site I have a disallow all in the robots.txt. Robots do scan it from time to time, but the "big" search engines don't list it. Yes if someone does figure it out they can see content, but there isn't any "top secret" content there

stlouislouis

4:07 pm on Nov 22, 2002 (gmt 0)

Hi all,

Thanks for helping on this. I understand that it's best practice not to develop live, but offline instead; to only upload the index.html homepage that links to the rest of the site when the site is ready for public consumption.

But it seems to me that it's important for (at least commercial) webmasters to know "for sure" what ways folks can find inner pages not linked to on their domains. I'm really curious about this. Surely all methods are knowable.

So besides the point made in Mohamed_E's post, I'm still unclear how else people or spiders could find the interior pages of a domain that have not been linked to. At least when folks who might guess a directory name get a "forbidden -- access not allowed" page back instead of a file listing when they try to browse a directory. I haven't a clue if or how spiders could get to the log files on my host's servers. How can one be sure to exclude them from one's log files?

Is my logic flawed? What other ways exist besides what's been pointed out? Surely there are not an infinite number of ways or "unknowable" methods.

Thanks a lot for contributing on this!

Louis

Travoli

4:17 pm on Nov 22, 2002 (gmt 0)

don't think it has been mentioned, apologies if it has, but..

Don't view the pages with the Google Toolbar installed if you don't want them found yet.

Mohamed_E

4:30 pm on Nov 22, 2002 (gmt 0)

About the toolbar I remember Googleguy saying (in essence, failed to find the post):

We do not currently use the toolbar "phone home" feature to find new pages, but we believe that nothing in the terms of service prohibits it.

fathom

5:47 pm on Nov 22, 2002 (gmt 0)

I'm still unclear how else people

This isn't that easy to do. I can - if having your domain I can see every file, page and image, in every directory as long as there is no permission based restrictions.

spiders could find the interior pages of a domain that have not been linked to.

If for example: you have another page in the root directory, the bot has access to your index page this means it normally has access to all the root directory) and thus the other page.

If this other page is linked to the rest of the site under construction the bot now has access to every page (unless you disallow it from all but the root and place no addition pages in the root.

In addition, you initial quoted as saying "I'm hosted on a named based "shared server" web hosting account" which also means that bot and spiders more likely than not already visit that server including your host's site.

Just because a bot submission wasn't requested doesn't mean they don't actively seek out new pages and sites in and around the ones they currently visit and as long as they are allowed to visit, they will.

As well you quoted "unless I tell someone".

If you told me I could tell you exactly what directories you have and pages and I can download your entire unfinish site as well.

Can a random web user do this... no probably not.

As I originally stated:

Spiders can relatively easy, but browsers (people) are a bit more difficult.