Forum Moderators: open

Message Too Old, No Replies

Getting Google to crawl 150,000 pages

What's the best way to get Google to crawl deep?

         

jamesa

10:37 am on Feb 23, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm redesigning on a site that will contain over 150,000 pages when I'm done. What's the best way to get Google to crawl ALL these pages? With that many pages, no matter how I structure it, it'll need to go deeper than three levels (and/or will have more than 100 links per page).

How can I get Google (or any bot) to crawl that deep? Right now the site is a PR 6 and NONE of the banklinks are from the site's own pages. Due to the former design Google is only aware of a few fluff pages (about us, etc).

The site is built around a search engine right now, which I am improving. For the user's sake I'd like to also build a hierarchical directory ala Yahoo/DMOZ but I'm faced with technical hurdles (the current database leaves a lot to be desired). If I did do that, though, it'll certainly be more than 3 levels deep. How deep will Google go, and how can I get her there?

I also thought about having an alphabetical listing just for Google (click "A" to see al the A's, etc) but each letter has anywhere from several hundred to many thousands of items. Plus that wouldn't be very themed.

Is this a good case for cnames? There are about 5 broad categories I could divide it up into. Would that have any impact on any PR that would be generated from internal linking?

Whichever approach I go with will take a lot of work, so I was hoping to get some feedback from you gurus in here first. How would you do it and what are the pros and cons?

gbaker123

10:17 pm on Feb 23, 2003 (gmt 0)

10+ Year Member



I recently modified my site and google has crawled over 140000 pages.

What I did was make sure that all the pages were only two levels from the root and had no querystring variables. I did this with rewrites in apache. If you're running apache, I'll be happy to give you some tips. Otherwise you'll have to look up rewites for you server software.

Here's a sample url on my site:

[sampleurl.com...]

The URL used to be similar to [sampleurl.com...]

I can sticky mail the url of the site to you if you wish.

Thanks,
George

jamesa

10:52 pm on Feb 23, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks, gbaker123. Actually I am planning to use Apache rewrites on the site. And all of the php pages will have a .html extension. :)

So with 140,000 pages not more than two levels deep, that's got to add up to alot of links per page, which is one thing I'm concerned about.

I'd love to see the site if you don't mind sticky'ing me. Thanks! :)

globay

11:03 pm on Feb 23, 2003 (gmt 0)

10+ Year Member



I recommend you to use url rewriting. Instead of having '/' in your Url, replace it with '.' for example (assuming, that no directory name contains '.'), and then use the site structure you would have used with the folder system!

Try to keep the names short. Google does not like long urls.

Or list all folders in a database ¦ID¦Name¦ and link to them as /file/SubID1.SubID2.SubID3.SubID4

--
globay

jamesa

12:37 am on Feb 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



thanks globay, I'll definitely be using rewrites. But even with that I still have a dilema with the linking structure. Right now the site logically breaks up into 5 main sections. The largest of the 5 is about 55,000 pages. So when you hit the first section, you're already one level deep, and from there we need to link to the 55,000 pages. Can't have 55,000 links on one page, so if I kept the number of links per page at 100, let's say, doing the math it would break down like this on the average:

home page
......level 2: contains 100 links to level 3 pages
..........level 3: 100 links to level 4 pages
...............level 4: 5 links to the real content
.....................the real content

100 * 100 * 5 = 50000

So the real content would be 5 levels deep in that scenario. If I used cnames for the level one, then it would still be 4 levels. To compress it to two or three levels, there would need to be many more than 100 links per page.

Everything I've read around here says keep it no deeper than 3 levels, try to stay withing about 100 links per page max. I don't see how that's possible with such a large site. So should I:

A) keep it at 100 or so links per page and let it go 5 levels deep, or
B) go with 300-500 links per page so it'll be 2 or 3 levels deep, or
C) is there another way?

A is much more user-friendly than B I would think, but I'm hoping there is a C. :) And I have no problem with creating a multi-page site map containing several hundred links per page if that's what works.

gbaker123

11:56 am on Feb 24, 2003 (gmt 0)

10+ Year Member



jamesa,

sticky me your url and I might have some tips on how to reduce the links per page.

Hope this helps,
George

rpking

1:04 pm on Feb 24, 2003 (gmt 0)

10+ Year Member



>home page
>......level 2: contains 100 links to level 3 pages
>..........level 3: 100 links to level 4 pages
>...............level 4: 5 links to the real content
>.....................the real content

Why do these pages have to be in sub directories?

Surely they could all be in the root if you so desired...

You can have less links per page, and less sub directories.

globay

1:55 pm on Feb 24, 2003 (gmt 0)

10+ Year Member



Just to make sure:

Everything I've read around here says keep it no deeper than 3 levels,

"Three levels" means no more than 3 subdirectories.

If you have the following site-structure:

/index.html -> (links to)
--- categories [/cat] ->
------ subcat [/cat/subcat] ->
--------- subcat2 [/cat/subcat.name] ->
------------ ... ->
--------------- ... -> [/cat/subcat.name.name2.name3.]...

everything should be ok, am I right?

You will be just 2 directories down from your root directory, and you can have as many subpages as you want, thus you can have less than 100 links a page.

Correct me if I am wrong!

--
globay

rpking

1:57 pm on Feb 24, 2003 (gmt 0)

10+ Year Member



...exactly!

>You will be just 2 directories down...

or 1 directory, or none, depending on how you use mod rewrite

jamesa

11:41 pm on Feb 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok... I was under the impression that the linking structure determines the level, not the path name. So even if all the content pages had pathnames that were in the root (/1.html ... /150000.html) it would still be considered several levels deep if you linked to them from several subpages. In other words, I thought Google limits how many pages it will follow through a site, so if you content is 12 "clicks" deep Google won't find it because it only goes 3 'clicks" deep for most sites, for example.

But the proof is in the pudding - gbaker123 stickied me a PR6 site where the content pages are four click-levels down (the paths/filenames are all /dir/str.str.str/) and Google spidered all 140,000 pages. So I'm going to give that a shot, along with the mod-rewrite of course. I'll post back the results (probably a couple of months).

Thanks for all the help so far. :)

jamesa

5:51 am on Mar 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



linking structure determines the level, not the path name

Just for the record a lot of what I've been reading around here tends to confirm this, especially jdMorgan's posts (#2 and #5) in this thread:

[webmasterworld.com ]