Forum Moderators: mack

Message Too Old, No Replies

Getting Indexed by Spider

         

Bud_Bundy

4:00 am on Nov 4, 2003 (gmt 0)

10+ Year Member



If I make a page and leave it on the root directory without having it linked from and index.html or any other pages, would it get indexed by SE spiders?

deejay

4:18 am on Nov 4, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Generally, no.

Spiders don't fly. Gotta give 'em a link to crawl.

On the other hand... I'm too old'n'cynical to say a firm 'no'... cos sure as eggs there'll be an exception.

mattglet

1:33 pm on Nov 4, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



well, i'm young: NO.

the only way a SE can browse your site is via links to your files. if there's no link to the file anywhere, it's considered non-existant to the engines. the only other way to find your files would be to search the directory for every file. that would incur a TON of resources by every engine, which would be totally impractible. i think it would also breach some security laws.

-Matt

Sinner_G

1:42 pm on Nov 4, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, I'm not old, BUT there are a couple of possibilities.

It is AFAIK not certain whether surfing to a page with the Google toolbar tells Googlebot to visit it or not.

If you have external links on that page to sites that have their log files public, it can be found as a referrer there and thus spidered.

Web Footed Newbie

1:50 pm on Nov 4, 2003 (gmt 0)

10+ Year Member



Okay, so you want an orphan page? Depending on your purpose, it may make sense. BUT, if you don't want it to be crawled, don't publish it on the web. Eventually, someone you don't know will place a link to it, and it probably would get crawled.

If you do want it crawled, put some internal and external links to it.

If it's a page just for your friends, it can still eventually get crawled, what I refer to as "engine magic."

What is the purpose of the orphan page? If it is "for your eyes only" put a sign in function on your site to access the page.
WFN:)

Mohamed_E

2:10 pm on Nov 4, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



WFN has hit the nail on the head.

It is unlikely that a file with no links will get crawled, but if you want to make it unavailable to spiders (or anyone unauthorized) the only safe way is to protect it with a password.

Bud_Bundy

2:10 pm on Nov 4, 2003 (gmt 0)

10+ Year Member



When I built shtml pages, I inserted a <!--#include virtual="note.html" --> line on every page. That "note.html" was a link directory to every other pages. In a couple of months, all of the pages had been indexed by google bots. I didn't put any links to "note.html", I just threw the "note.html" in the root as with other shtml files. (SEE: the "<!--#include virtual="note.html" -->" is not a url link).

So, I was assuming that if I had a page and put it in the root along with other files, it would still get indexed by the SE bots/spiders since the bots would first seek the root directory and go from that point to every other pages. If I didn't want my file to get indexed, I could still add a line or two of some sort of commands in that file that prevents the bots from crawling, couldn't I?

Reflect

6:29 pm on Nov 4, 2003 (gmt 0)

10+ Year Member



Yes you can add the META...

<meta name="robots" content="noindex,nofollow">

You can also add to your robots.txt file to dissallow.

In the end though with rouge spiders/bots and curious people (like myself) I would recomend the same as above. Remove IUSR (If on a Wins server) or modify your .htaccess (If on Apache). That way it is restricted on read access.

Brian

Ambergreen

5:13 pm on Nov 5, 2003 (gmt 0)

10+ Year Member



When I built shtml pages, I inserted a <!--#include virtual="note.html" --> line on every page.

This is what is called server side includes (SSI) This basically means that you include the contents of note.html on to every page. This is done on the server level so that as far as any spiders would see note.html is just another part of the original file, hence following all the links from it.

As for the Original question, WFN and Mohamed_E were spot on, password protecting and robots.txt exclusioon will significantly reduce the risk of the page being spidered but the only true way to ensure it would be never to publish it online in the first place.