|Heavy Drop in Google Rankings After Cleanup of Forum Spam|
First, thank you so much in advance for any assistance you can provide.
I have attempted to pick the most appropriate form for this assistance request.
We are currently experiencing significantly reduced search engine positioning for most (if not all) of the search terms people use to reach us. We believe there are a few factors at play, but this whole thing started with a spammer-infested forum, so some background history may be beneficial for solving this problem.
Some time ago we delinked our support forum due to inactivity and spammers (in retrospect we should have just deleted it). Unfortunately neither the spammers nor Google forgot about it and over time it filled up with an assortment of risqué links. We only noticed months ago when we saw our search positioning had been significantly penalized and, upon logging in to Google Webmaster, we saw our top keywords were things like ‘sex’ and ‘video’ (delightful). To solve this problem we removed the forum, ensured all its contents were returning 404 errors, and waited. About a month passed with little result and our top search terms remained, so we blocked the whole forum through robots.txt (hoping it would drop its contents from Google). After some time that seemed to work, we started ranking well in search engine results again, and the risqué keywords vanished (although the replacement keywords were a little unusual, though at least they were normal words.
Another month passed with little concern, until again, rather quickly, our search engine results plummeted even deeper than before. Again we visit Google Webmaster, check any causes we can imagine, and conclude that there is only one cause that really makes sense. Our website contains about 500 valid documents, and Google Webmaster (thanks to this forum) reports over 2000 404 errors and nearly 5000 pages blocked by robots.txt (all contents of this removed forum). Nothing else on the website has really changed (I’ve been working in the background on some big projects so the site’s actual content has remained quite steady for over a year now). We thought it was a little unusual that all those 404 errors were still being retained and we also noticed that all those pages (now blocked with 404 errors since February) were still in the Google index. It made no sense; we could only imagine that somehow blocking them with robots.txt had caused them to be retained for so long.
It is now two days ago. Our next idea is to unblock the forum and 301 redirect its entire contents back to the home page in hope of purging it and reigning it back in. After additional research (a few hours) we decide on a different idea. We follow Google guidelines, 404 the pages again, reblock them with robots.txt, and use the Google URL Removal Tool to remove the entire directory. Today Google’s actual indexes of those pages has vanished although all the associated 404/blocked notifications remain in Google Webmaster (perhaps for historical reference?). Our search engine ranking has not recovered. We also noticed something else which is curious: our keywords, as reported in Google Webmaster, are derived in strong majority from manufacturer PDF manuals we host on the website rather than from actual content pages.
We know it takes time for some of these things to sort out, but as the current circumstance has a significant impact on our business we want to make sure we’re doing the smartest and most efficient thing we can. I would very much love to hear feedback and thoughts on this matter, our approach, and any other possible approach from folks in this forum.
We have never lost page rank.
Google has not associated our site with malware.
We originally submitted a reconsideration request and were told we are not being penalized—that whatever is happening right now is algorithmic.
What do you think about our current circumstance, solution, and options?
Why are PDF documents determining 90% of our keywords in Google Webmaster?
If relevant, should we block the PDF directory (robots.txt)?
I am just passing along second hand knowledge, so take it for what it is worth. Here goes...
If I thought that google SUSPECTED I had some shady content, the last thing I would do is block it via robots.txt I would want google bot to see as much of my site as possible so they would know I wasn't up to no good.
I would, if at all possible, serve 410 instead of 404 status for those forum pages. I don't know how to do that at the server level (I only know how to do that at the page level with PHP - not fun if you have lots of pages to get red of).
I think if you block with robots.txt, then googlebot won't see the 404 / 410 status. So I don't know that robots.txt would do you any good.
I think there is a way to (temporarily) remove lots of URLs from the google index now in webmaster tools, so that might be an option as well.
But I think the important thing is letting google see that those pages are truly gone.
wait for others to pitch in here though first.
Hi Planet 13,
Thank you for your response.
I’m not sure we are still being penalized for the original forum content as we did recover our search engine positioning after we removed all of that content. I’m not certain why we originally lost positioning, but my best theory is that either our main site’s content was diminished because the forum content changed the purpose of our site in Google’s eyes (diminished the legitimate keywords) or that we got into some kind of premature link farm trouble (because that forum was filled with spammers and equated to 10x more documents than the main site), although if that’s the case it never went so far that we lost page rank.
We’re now in a second dip into page rank loss (see explanation above) and it seems a step detached from the original actual content of the forum. Of course I’m always happy to hear other perspectives.
As for 410 errors, we have considered that. I’d be curious to know how Google’s handling of 410 errors differs from its handling of 404 errors. It seems logical to process them more definitively as they would be deliberate by definition. The forum-wide flagging of 410 errors would be easy enough as well. We’d just have to display the 410 error page using mod_rewrite and use it to send the proper 410 header.
Google responds much more quickly to the 410 status - removing the URLs with fewer cycles of trust checking and going back to your server for the URLs much less frequently. In short - they support the idea that "Gone" is a much stronger statement than "Not found."
Terrific. Thanks for your thoughts on that, tedster.
Does anyone else have thoughts on this subject? I've received some ideas and feedback but there is still so much we don't know about the original problem and the course of action we are taking. This may well be the case dealing with Google, but if someone does have an understanding of these things I would appreciate hearing their thoughts.
What dates were your traffic drops? Do they coincide with the Panda updates?
Good question, and it is one we have considered pretty carefully.
The initial drop was closer to the Panda update but at least a few weeks or even a month or so after it hit all the content mills really hard—it also went along with the huge shift in our keywords and all the risqué content in our forum. Blocking that content caused us to recover our search engine positioning which seems inconsistent with Panda. And today our second drop takes place again despite almost no changes made to the main site.
I read the 'charter' of this forum and I recall reading that we're not to be too specific about our website, which makes it a little difficult to troubleshoot this. Perhaps this can give a good picture without breaking any rules. We are a commercial sales company functioning within a specific but smaller industry. Our site is broken down into sections for rentals, sales, service and support. The only aspect of Panda which concerned me to some extent is the fact that many of the models in our sales section use some content which comes from promotional material (content which various competitors also use) so there will be some degree of duplicate content in this regard across the internet. I'm attempting to address to some extent with a big upgrade I'm working on (it is difficult when you're dealing with descriptions of some pretty technical equipment) but if this is having any impact, it is probably an opportunity in disguise. I don't think it is Panda, though, because relative to the rest of the website this content is not great, many detailed top-ranking pages with unique content are suffering alongside everything else, and the dips/drops don't coincide very well with anything I've read about Panda.
|We also noticed something else which is curious: our keywords, as reported in Google Webmaster, are derived in strong majority from manufacturer PDF manuals we host on the website rather than from actual content pages. |
probably the bots have encountered these pdf manuals on other sites and considered them to be redundant and shallow when it indexed them on yours.
[edited by: indyank at 12:49 pm (utc) on May 28, 2011]
BBlocking pdf manuals via robots.txt won't help as google has already seen the content. You may consider removing them.
Oh my panda, you are making life hell for many with your stupid algo.
Although it may not be Panda, this wouldn't be the reason to dismiss it: "many detailed top-ranking pages with unique content are suffering alongside everything else"
That's actually a big part of Panda, as Panda happily destroys good pages simply because bad pages are effecting the entire site. (And of course, I'm not making any judgments by using the terms good and bad - just using those as quality designations assigned by Panda, not me). That's the one piece of Panda that Google has been pretty clear on - some bad pages can bring down the whole site.
So, although Panda may not be involved, don't dismiss it simply because the great pages are suffering as well.
|...many detailed top-ranking pages with unique content are suffering alongside everything else... |
Are those pages that are unique and were previously ranking high linking out TO the lower quality / duplicate content? If so, I would seriously consider removing links to your shallow content until you are able to improve the content of those shallow pages.
However, this is just pure speculation on my part, and I have no proof that it actually makes a difference.
We do know that if you have a good page that links to bad pages on OTHER websites, the good page on your site has an increased chance of suffering in rankings (hence, the development of the nofollow tag).
So by extrapolation (I think that is the right word), then maybe linking to shallow content on your OWN site from good content can hurt the rankings of your high quality content pages.
|Probably the bots have encountered these pdf manuals on other sites and considered them to be redundant and shallow when it indexed them on yours. |
I missed indyank's previous post. Yeah, I would have to agree with that, too.
|Are those pages that are unique and were previously ranking high linking out TO the lower quality / duplicate content? |
Well, I've completely removed any pages I deemed as "low quality" based on what I assume Panda is looking for, so if anything was linking to them, they aren't anymore. Not that anything has made a difference yet.
Panda actually seems to like SPAM in many SERPs I see, so you probably removed something it actually prefers, regardless of what the talking heads are saying.