I'm in the process of developing a site that will probably have millions of pages, thanks to APIs with millions of entities. Most definitely about 99% of the content (companies/services/products/places) will already be published somewhere on the web. Will I run into massive duplicate content issues? Even if I combine the content from different sources so it won't look like an exact 1:1 copy?
Now just concerning the business listings and from a search engine view, the site drills down like:
homepage->state->city->category->business homepage->category->business (tens or hundreds of thousands businesses with huge pagination in this case)
and of course there's a search form, too.
I assume I should use noindex,follow for the second one? Should I list all its categories on the business listings page? If so, should I also link them back to the categories? Not sure about the link juice here.
[edited by: tedster at 6:06 pm (utc) on Feb 7, 2012] [edit reason] moved from another location [/edit]
I have worked with some very large national directories and can offer this - Google will take some time to index it all (depends on how many millions), so a good sitemap.xml index method will help to get the more important pages indexed first.
Search engine visitors will come via category based phrases, so you don't want to block that path. I don't see a need for two hierarchies. The eventual business profile page should have only one instance.
You may have omitted one step - "businesses", which lead to individual "business" listings.
Directory users (direct visits) will use the internal search primarily and might browse geographically. You may want to consider how and where to use noindex,follow.