Forum Moderators: phranque
I have a site with a homepage that does not have links to any other pages on my site. My site is under development, thus not ready for public consumption yet. I'm hosted on a named based "shared server" web hosting account. It's a Linux/Apache setup.
Can I be sure that -- unless I tell someone or someone guesses my directory structure and file names -- I don't have to worry about anyone browsing web pages I upload to my domain that I haven't linked to? That neither person nor spider would have any way of finding the web pages that I've uploaded, but haven't linked to?
Really wondering about this. What's the definitive answer?
Thank you very much!
Louis
Spiders, Bots, Crawlers -- example: if a new page is in a directory/sub-directory where other pages are current linked to and your robot.txt allows the Spiders, Bots, and Crawlers access (vice disallows) to pages in this directory they can gobble up anything and everything in that directory.
A user could attempt to view the "parent directory" where these new pages reside (e.g. - www.example.com/direct/) without adding a specific page (pagename.html) at the end and depending on your host server, and/or use of .htaccess may view a list of all content within that directory and then direct access to these pages from there.
to view the "parent directory" sb must already know something about the directory structure. That person must have arrived at the server somehow after all. However, Louis ruled out any knowledge about the site and any successful guesses.
I still maintain that if nobody knows, nobody guesses and there are no links there is nothing to fear.
Andreas
That would still be a link though.
No it wouldn't.
If I have a page in directory /two/ called me.html in my site design and then saved it as you.html in the same directory, I am the only one that knows you.html exists
...and there are no links to you.html what-so-ever.
By default a parent directory query on www.domain.com/two/ will open index.html but if no index page exists then a parent directory list of all files will be exposed, and exposing all your new "secret" pages, such as:
xxx.xxx.xxx.xxx - /c/
--------------------------------------------------------------------------------
[To Parent Directory]
Wednesday, June 12, 2002 10:45 AM <dir> 1
Tuesday, November 06, 2001 10:32 PM 376720 1.zip
Wednesday, October 10, 2001 2:26 PM 32652 103-9497563-8811068
Tuesday, August 13, 2002 6:43 AM 110422 Academic_Brochure.pdf
Wednesday, August 21, 2002 12:23 AM 22016 aju9wbe0.doc
Friday, August 02, 2002 11:13 AM 99652 Astronomy manual v2 07 02.pdf
Tuesday, October 30, 2001 4:26 PM 2847 Atmosphere.gif
Monday, May 28, 2001 11:35 PM 275 BFFTP.REG
I just viewed an index page produced by apache and the source looked like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<TITLE>Index of /root</TITLE>
</HEAD>
<BODY>
<H1>Index of /root</H1>
<PRE><IMG SRC="/icons/blank.gif" ALT=" "> <A HREF="?N=D">Name</A> <A HREF="?M=A">Last modified</A> <A HREF="?S=A">Size</A> <A HREF="?D=A">Description</A>
<HR>
<IMG SRC="/icons/back.gif" ALT="[DIR]"> <A HREF="/">Parent Directory</A> 22-Nov-2002 02:33 -
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A HREF="aaron_cursor.gif">aaron_cursor.gif</A> 30-Mai-2002 02:54 9k
<A HREF="aaron_cursor.gif">aaron_cursor.gif</A> looks like a link to me.
Andreas
If a domain is public then all pages are public regardless of whether the main page links to these pages or not as you have correctly stated (according to the root index)
stlouislouis wrote: Can I be sure
No...
You should not place under construction designs on a public server under a current public domain.
Develop the site off-line and when the design is ready upload to the server, without linking to mainpage.
When you have previewed for browsers variations etc. link accordingly.
In practice - you should not design/develop live to avoid embassament.
Why is Googlebot downloading information from our "secret" web server?It is almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, it is likely that your "secret" URL is in the referer tag, and it can be stored and possibly published by the other web server in its referer log. So, if there is a link to your "secret" web server or page on the web anywhere, it is likely that Googlebot and other "web crawlers" will find it.
A third vote for keeping "not ready for prime time" stuff off the web server.
Thanks for helping on this. I understand that it's best practice not to develop live, but offline instead; to only upload the index.html homepage that links to the rest of the site when the site is ready for public consumption.
But it seems to me that it's important for (at least commercial) webmasters to know "for sure" what ways folks can find inner pages not linked to on their domains. I'm really curious about this. Surely all methods are knowable.
So besides the point made in Mohamed_E's post, I'm still unclear how else people or spiders could find the interior pages of a domain that have not been linked to. At least when folks who might guess a directory name get a "forbidden -- access not allowed" page back instead of a file listing when they try to browse a directory. I haven't a clue if or how spiders could get to the log files on my host's servers. How can one be sure to exclude them from one's log files?
Is my logic flawed? What other ways exist besides what's been pointed out? Surely there are not an infinite number of ways or "unknowable" methods.
Thanks a lot for contributing on this!
Louis
I'm still unclear how else people
This isn't that easy to do. I can - if having your domain I can see every file, page and image, in every directory as long as there is no permission based restrictions.
spiders could find the interior pages of a domain that have not been linked to.
If for example: you have another page in the root directory, the bot has access to your index page this means it normally has access to all the root directory) and thus the other page.
If this other page is linked to the rest of the site under construction the bot now has access to every page (unless you disallow it from all but the root and place no addition pages in the root.
In addition, you initial quoted as saying "I'm hosted on a named based "shared server" web hosting account" which also means that bot and spiders more likely than not already visit that server including your host's site.
Just because a bot submission wasn't requested doesn't mean they don't actively seek out new pages and sites in and around the ones they currently visit and as long as they are allowed to visit, they will.
As well you quoted "unless I tell someone".
If you told me I could tell you exactly what directories you have and pages and I can download your entire unfinish site as well.
Can a random web user do this... no probably not.
As I originally stated:
Spiders can relatively easy, but browsers (people) are a bit more difficult.