How deep of a page will a spider read?

Forum Moderators: phranque

Message Too Old, No Replies

How deep of a page will a spider read?

grnidone

10:42 pm on Sep 17, 2001 (gmt 0)

Is the "no more than two directories deep" still a good rule to follow?

I would think so.

However, how deep of a page will a spider index?

Let's say I submit a page which links to:

[whatever.com...]

will that page even be indexed in the database? Will the ranking of the page be hurt because it is so deep?

Any thoughts?

Marshall

11:44 pm on Sep 17, 2001 (gmt 0)

grnidone,

Are you referring to how deep in the directory path or how deep in links? I simply direct the search engines to my site map which pretty much covers everything. Just a suggestion.

Marcia

12:53 am on Sep 18, 2001 (gmt 0)

As far as /dir3/ I know Google will get that far, but don't know about further because that's as far as I've gone.

Slade

4:52 am on Sep 18, 2001 (gmt 0)

I was looking at the PR of links on a site I've just started to clean up (and optimize, or try to anyway).

My speculation is this: It's more related to how far it is from the base link. The site I'm working on was rather badly done. Nearly everything is in the root dir. The main page has toolbar PR5, but each link away from root the PR goes down.

Does that make since? Does it hold any water?

Actually, now that I'm thinking about it, inter-cross-linking will skew the results, but oh well...

grnidone

5:07 am on Sep 18, 2001 (gmt 0)

>how deep in the directory path or how deep in links?

Yes. Both.

Let's take my example a little further to exaggerate:

[whatever.com...]

Now, I know a spider probably won't crawl into a site that far, but let's say

1. I submit that deep URL (whatever.com) by hand to the engine (any engine)

2. And, I a page which links to that URL from another site. Let's say the link is at / level at [bla.com...]

Questions:
A) Will a spider crawl a direct link from the bla.com URL to the whatever.com's buried url?

If so: will the whatever.com buried url even be put into the search engine's database? (Any search engine..)

=====
Another situation:

You have two buried URLS

[yadda.com...]

You submit Yadda.com's buried URL to the search engine as well as

[blecko.com...]

which links directly to the first.

Same two questions:

Will the spider crawl from the buried blecko.com to yadda.com?

and if so, will yadda.com be in the database anywhere?

glengara

6:17 am on Sep 18, 2001 (gmt 0)

My very early morning thought is, if you type in a deep url that may not have been spidered, would you entice googlebot to have a closer look?

Woz

6:26 am on Sep 18, 2001 (gmt 0)

I tend to suspect that a deep link from a site that Google respects highly (read PageRank) would probably get noticed quicker than a link from one that gains less respect. But then, that is just a gut feeling.

Wouldn't it be nice if spiders included the referring url when they visited which would answer a few questions very neatly.

Onya
Woz

grnidone

10:13 pm on Sep 18, 2001 (gmt 0)

So, it doesn't matter how deep your directories go anymore?

I can't believe this is true.

rcjordan

12:29 am on Sep 19, 2001 (gmt 0)

I just happened upon one of mine in Google that is #2 out of 52,800. It matches your exaggerated example with one notable exception (see bold):
www.whatever.com/dir1/dir2/dir3/dir4/dir5/index.Shtml

(see s-mail)

All in all, I think Woz is pretty close...

<added>
I just check MSN for the same 3-word phrase. Out of 27075:
#5 = www.whatever.com/dir1/dir2/dir3/dir4/index.shtml
#8 = www.whatever.com/dir1/dir2/dir3/dir4/dir5/index.shtml

grnidone

1:10 am on Sep 19, 2001 (gmt 0)

I can believe that. Hmmm.

rcjordan

1:50 am on Sep 19, 2001 (gmt 0)

FWIW, the pages I'm quoting as examples are from the directory site discussed in the one size fits all [webmasterworld.com] thread (dated Aug 2000). It was built as part of a hallway, but now this particular hallway page outranks the primary page. Why? I figure the engines consider it to have more content (in the form of link text and descriptions out to my other domains and to other informational sites) than the primary site has. In other cases where the primary site is fully developed, the hallway ranks below the primary page --but not usually too far behind.

Woz

2:00 am on Sep 19, 2001 (gmt 0)

If you backtrack the PageRank process it seems pretty logical and is certainly how I would run things.

Of course this is all Google based, but the whole idea of PR is to rate sites and pages as a measure of worth to give better results in SERPs. However, where do you start?

Supposition on my part, but it seems Google decided to use educational institutions as a starting point giving them authority status and thereby decreeing that any site linked to from a .edu was of more value than those sites not so linked. This approach would have raised these new sites up one notch in the PR ranking. More links = more notches, and so on.

So now we have a vast database of ranked sites and pages in terms of worth and value as applied to their own subject matter. But, as we all know, the web is by its own very nature extremely dynamic. So how do we keep up?

If I worked at Google, and if I was in charge of Quality Assurance, I would make sure that sites with higher PR were spidered more often to ensure there was no loss of SERP quality in the higher results. We have already seen evidence to support that.

Additionaly, if I was in charge of Quality Improvement, an entirely different matter, then I would be looking at retrieving the best quality pages I could all the time. But with the large number of pages available on the web, the trick would be so prejudge to a certain extent and find pages that are of high quality before they were spidered. How do you do that?

Kind of like how do you know what the apple tastes like before you eat it? Well, if all the other apples the grower grows taste good, then chances are...

Likewise, if a sufficient number of high PR pages are linking to a particular page not yet in the spidered database, it would seem to suggest the page should be spidered for confirmation of value.

Hence, I would be ordering the "uinspidered" database so that pages with a high number of links from high PR pages would be spidered earlier, regardless of position within the target site. Make sense??

Notice that I am only talking value and worth here as they are site specific. Relevance on the other hand is Search specific.

I am also only talking Google here. Each SE works slightly differently of course but I am hoping that they would all have some similar approach to quality assurance and improvement.

Onya
Woz