Forum Moderators: Robert Charlton & goodroi
We run a large alexa top 10k website, receive over 2 million visitors a month and have over 1.9M pages indexed in google. The content is similar to digg/technorati. Starting around August 2008 we rapidly began receiving more press/links/visitors/etc..
By December googlebot started indexing the site at 10 pages/sec, we thought this was a good thing. Then we got our first google penalty out of the blue. We didn't even know it existed before it hit us, we did our research and assumed we had too many duplicate pages/etc since our site is pretty straight forward since we don't allow javascript or 'unclean' user submitted content. So we stared NOINDEX'ing just about every page that was near/exact duplicate. We submitted reconsideration and 6 days later we were back.
Two weeks later, the penalty hits us again, Dec 22. Then two days later lifted, then two days later hit again. This has continued to happen up to and including today, however it's happened so many times we've started timing so that we can tell to the exact hour when it'll hit or lift. Exactly 52 hours. Literally it'll penalize for 52 hours then exactly 52 hours later it'll lift, and the cycle starts again. What's triggering it and whether we did anything to fix it is still a mystery. With 1.9M pages indexed, the effects are pretty extreme when the penalty drops and lifts.
We can see lots quality and spam sites link to us, we assume that shouldn't have anything to do with it since it's outside our control. All our outgoing links have always been NOFOLLOW to ensure we don't get caught up in link schemes.
Here's the only thing we still don't know about : image badge.
Like technorati, we offer a small little badge that people can put on their blog if they wish to track traffic on their site. Doesn't contain keywords, just a link back to our site and an small icon image. Nothing underhanded. Does this constitute a paid link even if we're not paying for it, but instead offering a service for it? We've seen too many other large well established sites doing it to assume it was the culprit. But who knows?
Anyways, i'll keep this updated with news on our experiences.
With such a big (and successful) site i'm surprised that you were given a penalty that required a reconsideration request to be lifted and that the cause was duplicate content. Btw, was the content duplicated withing your site right? Not a copy of other site's content?
What's the PR of the site? Did it change after the penalty? (not that it can be the source of the problems, but might give some hints on what's wrong)
I would be surprised if the badge harmed your site anyway. As you say it's a common practice, especially for well estabilished websites.
Like techmeme and technorati, the content is a sentence or two excerpt of an article, with links to the source. We were under the assumption that google had a way of dealing with same-site duplicates so we never bothered to filter out articles that were the same. Essentially we didn't know that could harm a site so never bothered NOINDEXing the dupe article. We still are not sure if that's the cause since the official blogs over at google seem to maintain same-site dupe content penalty doesn't exist. Which is why we've been looking at the image badge, (but another article by matt cutts seems to clearly state widgets+link are legitimate as long as nothing underhanded is being done)
In any case, everything on the site is 'common practice' when it comes to quoting articles, citing sources, excerpting text and following strict DMCA control. Part of our success was the fact we followed what established sites had been doing but tried to execute it better. Obviously we executed something worse.
About the first penalty, what was it like? Was the site deindexed completely or "only" suffered bad rankings?
And what about now? When you say you're penalized for 52 hours, what do you mean? Less traffic? Half traffic? No traffic?
Did you notice if portions of your site (pages or sections) remain "stable" even during the penalization period?
Of the 1.9M indexed pages, what percentage do you think is duplicate content?
You "noindexed" duplicate pages, did you also "nofollowed" them?
could you check in you Google Wemaster Tools googlebot's frequency graphics ?
Tell us if you have 2 peaks of the spider (in the green and blue graphics) and if they happens, as I think, BEFORE and AFTER the application of the penalty.
Because it could also be a specific spider that, checking duplicates, determines if you must be penalized.
The peaks should happen when you enter and and exit from the penalty.
On Dec 22 it started to yo-yo, 60k/day for 52 hours, then 1000 for 52 hours. By doing searches for specific pages on our site google would list the page as the last result on google. Then when the penalty lifted, that same search would show us at #1 spot again.
All pages on the site are affected, only when a search including the actual domain name is used do we come us as first result. Even for specific pages on the site.
Misterjinx:
Our webmaster tools shows a steady upward line, no spikes really, just an increase in spider traffic starting in august :
august 300k
sept 350k
oct 400k
nov 500k
dec 600k, but there is a drop right around mid december to about 300k then back to 600k what looks like a few days later
Today again, as timed, we're out of penalty. But we know it'll drop back on us in 52 hours. Does this yo-yo ever go away?
the content is a sentence or two excerpt of an article, with links to the source
This doesn't sound like a heck of a lot of content, does it? An add-on question would be: Why would Google or any other search engine want to index this "content" and not simply the source?
Again. sorry for the tone. Just can't think of a better way of asking the questions at the moment.
jim, our site noindexes pages that are very brief but will let google index some of the longer quotes, however all pages display more than just the excerpt, very similar to technorati/techmeme in that there is more displayed for our visitors regarding the article that just the article. We spoke with developers at technorati to see when/what/why they noindex cited articles and followed their lead when it came to applying the same 'indexing' rational on our site.
The timeframe of the yo-yo penalty appears to be tightening. What started as a 52 hour yo-yo two weeks ago has now increased to a 17-20 hour yo-yo. We were out of penalty starting yesterday around 9:30am, google referrals return to normal levels, but had it reapply today around 3:30am, google referrals now back down to near 0.
We have many search terms specific to our site that we use to test when we go into penalty, and sure enough, we appear as result #1 when we're ok, and on page 9+ when we're in penalty. For all pages and searches. Obviously some sort of automated procedure is moving us around, the real question is whether we can do anything about it.
The timeframe of the yo-yo penalty appears to be tightening. What started as a 52 hour yo-yo two weeks ago has now increased to a 17-20 hour yo-yo. We were out of penalty starting yesterday around 9:30am, google referrals return to normal levels, but had it reapply today around 3:30am, google referrals now back down to near 0.
Are you able to compare the global search on separate Google TLD's to see if there are variances. They may be different , however , the point is probably the same.
Chief suspects [ without qualification ] in my mind are :
- Content - [ not enough , original or thin ]
- Badges [ who do the badge URL's point to , on topic , purpose ?]
- Internal linking [ repetitive terms Y/ N ]
- External links to your site [ maybe a pattern Y/ N ]
Although i suspect badges , they may just be part of a trigger that brings the other issues into play.
The urls to the badges simply point to the domain name, no keywords in the link, just an image with url. ie:
<a href='http://www.example.com'><img src='http://www.example.com/badge.gif'></a>
The internal linking is pretty straight forward, if the subject is Basketball (think of tags at technorati), the link will be
<a href='/tag/basketball'>Basketball</a>
If it's an article,
<a href='/article/this_and_that/4000.html'>This and that</a>
External links are as usual, we get more everyday from everything from spam blogs to the new york times. Since that's out of our control we assume it can't negatively affect us, but who knows.