Getting Indexed by Spider

Forum Moderators: mack

Message Too Old, No Replies

Getting Indexed by Spider

Bud_Bundy

4:00 am on Nov 4, 2003 (gmt 0)

If I make a page and leave it on the root directory without having it linked from and index.html or any other pages, would it get indexed by SE spiders?

deejay

4:18 am on Nov 4, 2003 (gmt 0)

Generally, no.

Spiders don't fly. Gotta give 'em a link to crawl.

On the other hand... I'm too old'n'cynical to say a firm 'no'... cos sure as eggs there'll be an exception.

mattglet

1:33 pm on Nov 4, 2003 (gmt 0)

well, i'm young: NO.

the only way a SE can browse your site is via links to your files. if there's no link to the file anywhere, it's considered non-existant to the engines. the only other way to find your files would be to search the directory for every file. that would incur a TON of resources by every engine, which would be totally impractible. i think it would also breach some security laws.

-Matt

Sinner_G

1:42 pm on Nov 4, 2003 (gmt 0)

Well, I'm not old, BUT there are a couple of possibilities.

It is AFAIK not certain whether surfing to a page with the Google toolbar tells Googlebot to visit it or not.

If you have external links on that page to sites that have their log files public, it can be found as a referrer there and thus spidered.

Web Footed Newbie

1:50 pm on Nov 4, 2003 (gmt 0)

Okay, so you want an orphan page? Depending on your purpose, it may make sense. BUT, if you don't want it to be crawled, don't publish it on the web. Eventually, someone you don't know will place a link to it, and it probably would get crawled.

If you do want it crawled, put some internal and external links to it.

If it's a page just for your friends, it can still eventually get crawled, what I refer to as "engine magic."

What is the purpose of the orphan page? If it is "for your eyes only" put a sign in function on your site to access the page.
WFN:)

Mohamed_E

2:10 pm on Nov 4, 2003 (gmt 0)

WFN has hit the nail on the head.

It is unlikely that a file with no links will get crawled, but if you want to make it unavailable to spiders (or anyone unauthorized) the only safe way is to protect it with a password.

Bud_Bundy

2:10 pm on Nov 4, 2003 (gmt 0)

When I built shtml pages, I inserted a  line on every page. That "note.html" was a link directory to every other pages. In a couple of months, all of the pages had been indexed by google bots. I didn't put any links to "note.html", I just threw the "note.html" in the root as with other shtml files. (SEE: the "" is not a url link).

So, I was assuming that if I had a page and put it in the root along with other files, it would still get indexed by the SE bots/spiders since the bots would first seek the root directory and go from that point to every other pages. If I didn't want my file to get indexed, I could still add a line or two of some sort of commands in that file that prevents the bots from crawling, couldn't I?

Reflect

6:29 pm on Nov 4, 2003 (gmt 0)

Yes you can add the META...

You can also add to your robots.txt file to dissallow.

In the end though with rouge spiders/bots and curious people (like myself) I would recomend the same as above. Remove IUSR (If on a Wins server) or modify your .htaccess (If on Apache). That way it is restricted on read access.

Brian

Ambergreen

5:13 pm on Nov 5, 2003 (gmt 0)

When I built shtml pages, I inserted a  line on every page.

This is what is called server side includes (SSI) This basically means that you include the contents of note.html on to every page. This is done on the server level so that as far as any spiders would see note.html is just another part of the original file, hence following all the links from it.

As for the Original question, WFN and Mohamed_E were spot on, password protecting and robots.txt exclusioon will significantly reduce the risk of the page being spidered but the only true way to ensure it would be never to publish it online in the first place.