Forum Moderators: Robert Charlton & goodroi
My main site is mostly user-created content. We employ a team of editors to try to eliminate the SPAM, duplicate content and the content which links to bad neighborhoods, but even with this intent, it has been impossible to really know exactly what content could get us penalized.
Because of this, I've implemented a set of functions which try to guess which pages Google really doesn't like. The formula basically relies on Google's list of sites that they don't like to display AdSense on, and a mechanism that determines how much Google traffic each page has received over a given amount of time. Basically, if a page doesn't receive 1 visitor a month from Google over a 6 month period of time, the page is removed.
We've also looked at Googlebot crawl dates. The vast majority of my pages are crawled every week, but some pages are crawled at much longer intervals. After 60 days, they get the boot.
I realize that this is a different way of looking at SEO, but it has worked very well for us for the past year and a half.
I would like to know what other ways there are of determining if Google likes a page or not. Any suggestions?
If you make good clean original pages, and you have some quality incoming links, then your pages will mostly do OK, and with a bit of work can do very well.
And Google helps you with copious webmaster tools and guidelines, including Google's SEO Starter Guide [googlewebmastercentral.blogspot.com].
Having a working knowledge of the guidelines, while striving to build a quality site, is probably a better approach than trying to second-guess the algo - which is an almost certain invitation to sick site syndrome - overdoing, overobsessing SEO, and simultaneously making the site unwelcoming to human beings.
Managing a large (50,000 pages +) user-created content site does not allow for finely-tuned control. I can have two pages containing articles on the same subject with one receiving an avalanche of Google referrals and the other nil. There are reasons why this occurs, and I would like to explore those reasons and how to detect them.
Using Google's Starter Guide is a given, but having 15,000 members, all with their own agendas, using best practices is not realistic. Managing this across 250,000 content-heavy pages, one page at a time, is a completely different challenge than managing an ecommerce site.
If a web site contains too many pages that the algo doesn't like, the other pages will suffer from lower general rankings or possibly a site-wide penalty.
I realize that this isn't how SEO is typically viewed, but for certain web sites, this kind of reasoning can mean the difference between ranking well (and earning a small fortune) and being forced out of business due to low rankings.
You have listed two signs that Google may consider certain pages non-essential. On a purely information or ecommerce site, I would probably not automatically remove such pages. I would first do a manual inspection of the situation, but I appreciate that the situation with user generated content can be different.
I have been thinking about your question, which I assume is aimed at finding other criteria that you could automate. I can't come up with any, and if you are currently enjoying good traffic from G, then I would probably not add anything to your current actions.
When you took the first step and you rankings returned, how many pages did you remove? Also, even though the timing for your return to ranking well is quite suggestive, I'm wondering. Do you think perhaps you located just a few problematic pages that made the difference for you - and the rest of the pages you removed were not truly part of the problem.
Over the past year and a half we've experimented with other ideas for weeding out the bad pages and in the process removed another 15,000 pages. During this period of time, our traffic from G has doubled.
Just to be clear, we're getting twice as much traffic (compared to pre-penalty times) with 55% less content. This is why this topic is important to me.
it would make more sense to noindex the pages for googlebot only -- and let all the other ones carry on. you can do that with just a simple meta tag in the header.