Forum Moderators: open

Message Too Old, No Replies

How many pages are to many pages?

         

Sulla

4:49 am on Aug 18, 2003 (gmt 0)

10+ Year Member



Ok another question then I will sit back and lurk and read for a while.

My question is this how many pages are to many. Do you get penalized if you do to many? What I am planning on doing is adding extra information to my book price comparsion site "removed" The new info I will be adding will a info from an open source encyclopedia, dictionary and thesaurus. If the user chooses the extra info form of the search they will get a page something like this mockup "removed" with the extra info on each side. They will also be able to browse the encyclopedia, dictionary and thesaurus to find articles on authurs, related to words to search with etc. Between the encyclopedia with 200,000 plus entries the dictionary, the thesaurus, the browsable book database (also on the way) and the authors database (also on the way) I should be creating well over 500,000 dynamic pages when it is crawled when its all done. With most pages have info from the other sources and being optimized the best I can figure out how to do including mod_rewrite to make nice friendly URLs.

Now is this a good idea?

I keep forgetting about removing links the other forums I use dont have that rule.

[edited by: Sulla at 6:02 am (utc) on Aug. 18, 2003]

davewray

5:41 am on Aug 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sulla,

I'd advise that you remove all of your personal links..it's against the TOS.

BlueSky

5:48 am on Aug 18, 2003 (gmt 0)

10+ Year Member



Welcome to WebmasterWorld. I don't see why any SE would penalize a site for too much info as long as the pages are different and aren't spam. I seriously doubt they will index all 500K though. My guess is it'll be a small fraction maybe like 5% or less. Pages which can only get reached by entering terms in a search form won't get indexed at all.

It may not be a good idea to have that many indexed regularly anyway because it will eat up a ton of bandwidth.

Sulla

6:05 am on Aug 18, 2003 (gmt 0)

10+ Year Member



Hmm if they can reach all of the pages becouse they are browsable how will the search engines pick where to stop? Do they have a maxim limit on pages or bandwidth etc? Yes I have been thinking about bandwidth and server load and trying to work that out also.

BlueSky

6:55 am on Aug 18, 2003 (gmt 0)

10+ Year Member



Not sure...maybe there's some built-in max limit in their algorithms. If you're a large well-known company, they may very well index that amount of pages. If not, I suspect they try to take a sampling.

If I was in your position, I would use mod_rewrite to make the URLs friendly. Then, I would build a site map and place it near the top of each page to feed the bots the exact pages I wanted indexed. If they do more great, but at least do the ones I want. If you do this, be careful not to put more than 100 links per page.

There's also a lot of bad bots out there who think nothing of sucking up entire sites at a fairly rapid rate. If you do a search here, you'll find a very good list of them. I used it as my starting point and continue to add to it as others show up.

Skylo

3:36 pm on Aug 18, 2003 (gmt 0)

10+ Year Member



Hi Sulla, just to put your mind at rest about the size thing. We have 12 sites with the biggest being 20K pages big and the others are relatively small. They are on the same theme of the business but all offer their own unique content and target market.
So I don't think google would penalise you for having too many pages as long as they weren't repeating themselves or each other. I also think that if you site was for reference sake 100K pages ( could that even be possible?:) big then I think G would just take a personal look at it and give your site the manual hands up.
It could be part of their Algo to eliminate sites that are too big, but that wouldn't make sense if the site was actually very good and informative
My 2cents
Happy Surfing
Skye

Skylo

3:39 pm on Aug 18, 2003 (gmt 0)

10+ Year Member



Just adding to that, maybe if your site is flagged by G for being too big maybe they would have a separate spider to run through your site and shout if it finds duplicate material....that would be cool :-)

BlueSky

4:04 pm on Aug 18, 2003 (gmt 0)

10+ Year Member



I looked up in Google one of the reference sites I use. Their dictionary has 67,700 pages indexed, their thesaurus has 90,500, and the whole site 171,000. Merrium Webster has 97,000 pages indexed. These dictionaries have around half a million words. So, my estimate is low. 70K-200K is probably a better ballpark figure. Wow, that's a lot of pages to get spidered regularly.

Sulla

2:36 pm on Aug 19, 2003 (gmt 0)

10+ Year Member



Hmm well this should turn out to be an interesting experence anyways:)

visca

9:03 pm on Aug 19, 2003 (gmt 0)

10+ Year Member




But when you guys say "repeating" does that mean i)identical pages, ii)pages that are "similar" with only slightly different content?

Example, in Google, somehow my member directory of thousands and thousands of members was spidered by Google. I was quite suprised because it went through the whole dynamicaly driven membership list, and then starting going through every single members profile page and indexing it. Every page is the same except for the users personal information (ie age, gender, username, etc).

My site has a truly pathetic ranking on Google, I wonder if this is attributing to that. Otherwise it is quite an optimized site. I think something must have happened to the poor site and got it into Googles "bad list" or something.

BlueSky

10:55 pm on Aug 19, 2003 (gmt 0)

10+ Year Member



I've seen Googlebot do that on other forums too. I've seen him crawl the exact same page over 13,000 times with the only difference being the session id in the url. All these dupe pages are in the index.

If you don't want him wasting time and bandwidth indexing your members then I recommend you try regular expressions in robots.txt. You can do something like this:

User-agent: Googlebot
Disallow: /*members=*$
Disallow: /*profile=*$

or whatever is used in the url to distinguish those pages. I had to use something similar to get him and his relatives to behave on my site. The pages I didn't want him crawling are taking awhile to drop out of the index. You may see the same thing. But, at least he follows that robots.txt plus the regular expressions.

Sulla

4:11 am on Aug 20, 2003 (gmt 0)

10+ Year Member



Well I should have my browsable index of books done (without extra info) in about a week or so with the new short urls. Then I can see how google likes that and how much bandwidth etc it takes before the exended info comes online.

Sulla

4:16 am on Aug 23, 2003 (gmt 0)

10+ Year Member



Also I was wondering what kind of extra load will using mod_rewrite put on the server? With a huge number of pages to crawl I am a little concerned.

Also I had some one tell me they could just do it with Multiviews instead of using mod_rewrite. I cant find much info on Multiviews would it be better to use than mod_rewrite?