|How to publish 100k+ pages?|
I read most of the posts in this forum and am aware of the possible negative consequences of publishing hundreds of thousands of pages.
This thread is about publishing huge amounts of useful high quality content without getting algo-related problems. Human inspections shouldn't be a problem. Therefore this thread is mostly about the technical aspects.
This thread is mainly about the article pages, because they will be published/crawled/indexed first.
TBPR 5, 4 years old, very clean and authoritative backlinkprofile
125k pages, 99% indexed
Crawler activity: 10-15k pages/day
No known penalties or algo problems
10k of the 125k pages use RSS feeds (press releases) as a source and a high percentage of them are low quality (e.g. two sentences and a link to the source)
Most of the other pages are calculators or statistical/historical data, so there is a high percentage of thin (not shallow) content.
I gained access to a huge collection of high quality, topic relevant press releases ("articles") in full text with very nice metadata and have the rights to publish them. I want to publish articles, images, videos, tag(popular relevant topics) pages and profile pages (names found in the text, profile page filled with citations and contact data)
1. 200k article pages
2. 50k image pages / 100k images (large size and original size)
3. 10k video pages (third party hosting, not YouTube)
4. 150k profile pages
5. 10k tag pages
We do everything we can think of to improve the "quality" of our articles and reduced the amount of articles by 20% (250k-->200k):
No duplicate or similar(!) articles
Only 1k-10k characters per article (=95%)
Only 2-3 syllables average per word (=95%)
No event-centered articles (nearly useless when too old)
First all already existing low quality articles will be deleted or upgraded.
The articles will be published according to the alphabetical order of the authors (2000):
This way I can measure the percentage of indexed pages of a representive sample in contrast to date-based publishing (2012,2011,...)
In the beginning 1k articles/day will be published randomly distributed over the day. Mid-term goal is to get 30% of the new articles indexed. If this works well, I will increase the publishing rate up to 2k articles/day. Otherwise I will decrease it accordingly.
The publishing process will be supported by intense high quality linkbuilding and a redesign.
The sitemap(s) will only include already published pages.
Content of the article pages:
Preview images (noindex) linking to image pages (robots.txt)
Links (in text) to profile pages (robots.txt) e.g. "John Doe"-->/person/john-doe
Links (in text) to tag pages (robots.txt) e.g. "Widgets"-->/tag/widgets
URL's in the article will be plain text
1. Which publishing rate / percentage of indexed pages/articles would you recommend?
2. Should all the links to pages blocked by robots.txt be in articles from the beginning on or should I add them later?
3. Should I make all the content available in the beginning on and use noindex instead of robots.txt?
4. Should I make all the content available in the beginning on and just slow down Googlebot using WMT?
5. How would you use the sitemap(s)?
just a small question.What is the revenue source of the website?
|I gained access to a huge collection of high quality, topic relevant press releases |
hmm.. how can all 200K articles be "high quality" ... how can a press release be high quality? I have never seen one.
Tough question. I don't think there is a way to avoid Panda when publishing 200K pages like that.
Ideally penalising/de-ranking a new site with so many pages would be dumb for a thing like Google.
But the terms & conditions mentioned in your original post make it very complicated. And if the value of your content is questionable then who knows? But something is not going to happen in a day. If it happens (at all), it will happen with a Panda, Penguin or another animal.
I also don't see how press releases can be considered high quality. They generally are:
1) published in multiple location (duplicate content)
2) Written in pseudo-newsy sales pitch (off-putting to searchers)
3) Not targeted at topics with search volume, or if they are, they wouldn't be the best result.
I can see this kind of content being valuable in a couple niches. Maybe for people searching for specific events in the history of a company. Or maybe you can data mine something out of such a large collection and provide useful summary data and insights.
You can certainly publish large amounts of content on your website without problems if you do it right. If this info is less compelling than your other content, make sure you don't feature it prominently to the search engine spiders. Don't link to it very much. Make sure it doesn't have high page rank compared to the rest of your site.
To answer your questions.
1) I usually batch publish lots of content. In my case it has been things like backlogs of user generated content that all got approved at once and full site translations. The biggest mass publish I did was going from one language to ten overnight (400K new pages across the 10 languages). The biggest UGC increment I've done is about 20% of the existing total. I've never had problems, but I have doubts that press releases would be high enough quality to not have problems with mass publishing.
2) Are the links in the articles useful for visitors? If so, include them in the beginning. Otherwise visitors won't have a good experience and the pages will never rank. Never launch user-crippled pages. Take advantage of the honeymoon period for new content where Google tests user response to new content.
5) I would be careful with a sitemap. Depends on quality of the content. Since it seems to me to be low quality, then I would either not do a sitemap, or I would put it in a sitemap with a very low priority setting.
I might recommend launching a small number of these and doing usability testing. With that small number, see how people react to it. See what the bounce rate is. Try to improve those metrics as much as possible. See how they rank in the search engines. Then when you have optimized the user experience on your side and are comfortable with the rankings, then launch the rest of it.
|What is the revenue source of the website? |
Ads only. AdSense, but we hired a sales guy to sell our adspaces directly a few weeks ago. I guess not showing ads on this pages until everything looks fine and stable is the safest way.
|how can all 200K articles be "high quality" ... how can a press release be high quality? |
We tried to filter out anything abnormal using text length, average syllable count per word and other metrics. All of the press releases are very well categorised by the authors. This way we can easiely filter out for example event-based press releases.
Maybe you would consider for example NASA press releases high quality? Our press releases have a similar degree of quality and we filter out the worst 20% (too short, too long, event-based...).
|If this info is less compelling than your other content, make sure you don't feature it prominently to the search engine spiders. Don't link to it very much. |
I might recommend launching a small number of these and doing usability testing.
Personally I like to read the press releases.
At first there won't be PageRank related problems. We will show related articles below every article (using JS) and will optimise this feature.
My long-term goal is to show related articles on every type of page (without JS).
This all sounds like a project with some potential long-term consequences. Frankly, the content sounds like rubbish and I think you're in the minority of people who like to read press releases.
|The publishing process will be supported by intense high quality linkbuilding and a redesign. |
I admire your optimism, but how do you expect high quality links to press releases or tag pages? I can't think of one quality site which would link to such pages, unless you've got "personal" connections with the editor of a national newspaper.
|We will show related articles below every article (using JS) and will optimise this feature. |
Then you're looking at serious user experience issues. Having a JS script sift through 200,000 press releases to determine whether they are related is going to probably crash many browsers. Even if they are searching through articles in the same category, you should still expect problems.
|My long-term goal is to show related articles on every type of page (without JS). |
Then your short-term goal is doomed to failure.
|This all sounds like a project with some potential long-term consequences. |
It is meant to be. There are no risk-free decisions left in Pandaland. While most members of this forum seem to have choosen to focus on downsizing and/or improving usability, my strategy is rapid growth.
|Frankly, the content sounds like rubbish and I think you're in the minority of people who like to read press releases. |
More people would read high quality press releases if finding them would be as convenient as finding newspaper articles. Most of the scientific/technical/financial articles are just poorly rewritten press releases. Of course there are press releases like "We announce a press conference" or "We have a new manager", but we do a good job filtering them out.
|I admire your optimism, but how do you expect high quality links to press releases or tag pages? |
This thread should be about indexing, not about questioning the quality of the links or press releases. But I don't mean to be rude:
Authoritative high quality niche
As much interaction with authoritative institutions as possible
Lots of up to date contact data
1. Getting the 2k authors involved: "We published 80% of your press releases and thank you for your work. Please check if we have overseen some outdated or factual wrong press releases. Your profile picture will be shown above every article and in Google search if you want: Just upload a picture and your G+ account..."
2. Most of the press departments like to show how often their press release has been published and link out. They will be notified about all the old press releases and every single new one.
3. Sending the press departments a widget (with a link) which shows all of their press releases so they don't have to update their websites.
4. Offering very customizable press release widgets: If the website owners want a newsfeed widget about the Moon(example), the widget comes with a link to the Moon tag page and so on. In addition to the 500+ institutions I can contact 5k smaller non-commercial websites. Hundreds of them are already using my other widgets.
|Having a JS script sift through 200,000 press releases to determine whether they are related is going to probably crash many browsers. |
This is a misunderstanding. Our server will handle the workload. The short term goal is testing different ways to show related articles (for the users) without totally confusing Google. Call it cloaking if you want to. My long term goal is making this additional content available to Google. If everything works well, this will happen in 1.5 years or so.