Forum Moderators: Robert Charlton & goodroi
While his Search functions on the site allow visitors to get to any page of content, they require submitting a form. Hence, search engine spiders are not finding those pages, which is a shame for many reasons.
So what's the best answer?
A. Create a massive site map with links to the millions of pages. Given 110K limitations on page size, this site map would have to be broken up into tons of tiered site map pages. It would also need to be updated every day since new content is added daily.
B. Create a Google site map and submit to Google. I have not yet done a Google site map -- not sure how it would work with millions of pages.
C. Create a secondary separate site that includes easily spiderable links to all the pages.
D. Call up Google and say "Listen, Google. I've got millions of pages that are not in your index that people would love to see. They even have adsense on them! Can you hook me up with an engineer there to figure out the best way to get these pages into the index?"
E. Something else
I know there are folks in here that work on massive sites. Any tips would be greatly appreciated.
Thanks!
You can't have an onsite sitemap with millions of pages on it. Google sitemaps will not help - even if you get the orphaned pages in the index, they won't rank well.
The internal pages need to be de-orphaned, no two ways about it.
This is likely to be a mamouth task so you're going to have to be really careful about quoting for it. If the internal sub-pages do not link in some manner allowing spidering, someone has a lot of editing to do.
I know a fair bit about SEO for sites with, say, 200 pages -- but his site has literally millions of pages of valuable content.
All of the principles you know and understand still apply.
It would also need to be updated every day since new content is added daily.
If you have a DB backend, the easiest way forward is going to be a complete redesign and rebuild.
Again, based of everything you've said, this is going to be an absolutely mamouth task. However, if it genuinely has millions of pages of high quality content then he may well be sitting on a goldmine, in which case throwing a few thousand at it could be a decent investment.
TJ
Create a massive site map with links to the millions of pages. Given 110K limitations on page size, this site map would have to be broken up into tons of tiered site map pages. It would also need to be updated every day since new content is added daily.
If the site (I'll assume it's coming from a database) is capable of generating that many pages, the site architecture is going to be the foundation that supports it all (which is the case for all sites).
We'll assume multiple primary categories? Links to those categories from primary pages? Index pages for each of the categories with links deeper into that content?
You're going to need to dynamically create multiple site maps based on each category. And then you'll probably have additional site maps within each sub-category.
Rewriting the URIs is going to be mandatory. Get rid of all queries in the URI string. Keep the strings short and sweet example.com/sub/sub/ or, sub.example.com/sub/. If there are millions of pages, the site most likely qualifies for a sub-domain structure. Use it sparingly and use it only for top-level categories.
Feed content to the spiders as if they were traveling down a breadcrumb trail. Internal linking is key.
P.S. You'll need to have patience too. Indexing millions of pages and then getting them seated in the index takes some time. They need to become established.
You need to make sure that there is only one URL for each page of content. Read up all that has been written about "duplicate content" over the last two years, and learn from it.
You need proper navigation to every page of the site, probably "breadcrumb" like, with every page linking back to the root, and to the section index for that section (and maybe to some other related section index pages).
This is a very important piece of advise. Breadcrumbs, crossreferences, "see also" and so on help establish multiple routes to each page, to make them more likely to be crawled and indexed. With structured breadcrumbs it also concentrates PR upwards, and some say a higher PR on the main pages improves depth of crawling.
Of course, to go with this you need a downward path. Try to have 50-100 links on the home page to major sections, and a similar number on the subpages. Averaging 64 links a page gets you to 16 million pages 4 clicks or less from the home page.
Google sitemaps with millions of pages are no problem, and work well. Be sure to compress them (to save bandwidth), and use a siteindex to refer to the many sitemaps (see the sitemap help pages to use a siteindex). Could even use more than one siteindex (perhaps one for stable pages and one for more volatile areas).
I have a site with an extremely large number of pages. Having a few million of the key pages in a sitemap has helped it maintain 1 to 6 million pages indexed (and cached) for over a year now.
And it doesn't take that long to build up. It's certainly possible to get the first half million indexed within a month or so, and millions in under three months.
So, as others have said, it really needs a redesign with crawlability in mind. An alterative retro-fit would be a structured sitemap (i.e main page divided some logical way or alphabetically, subpages divided down etc., but breadcrumbs and crossreferences are much better.
Again, as others have said, watch out for the current agressive duplicate content filter. I have another site that has information on ten million widgets, with a similar layout and external links for the data on each page. That was on 3 million pages indexed, but dropped to a few tens of thousands in a week when the new filter came in. So make sure the pages are distinctive, especially for their external links.