|So, for a big site with hundreds of thousands of pages, is there any way to do this? Any third-party tool or service? |
I'm sure a good coder could find a way to make use of one of these:
Of course to do anything with either you basically have to get into writing a bot and parsing the information you get from other sites, which may be beyond many, but would probably be very enlightening for as many or more, even just to try and detect similar text they know exists on another site, because to do it reliably you really have to get into how to extract the main text from the template, and that's definitely a challenge...
Do the following:
1. Remove from google index and cache all pages with low amount of original content (you can do it in webmasters account)
2. Block them in robots.txt
3. Resubmit your sitemap
The Key is to remain only quality content on your website!
You'll be surprised very soon. Good luck)
Welcome to the forums, Andreas. Have you recovered lost rankings for your site by doing that?
About your comment "Remove from google index and cache all pages with low amount of original content (you can do it in webmasters account)"
Where is the option in Google Webmaster Account to remove pages from Index and cache?
I'm seeing big ranking fluctuations right at this moment depending on which of my computers I search from. Some pages that dropped hard on 2/24 have become several pages worse, some have improved by several pages. These are the first changes of any magnitude that I've seen since 2/24.
I wanted to note a quote from matt cutts on rustybrick's site that says:
"Google's Matt Cutts specifically said Cult Of Mac was not impacted by the Farmer / Panda update because if they were, they would still not be ranking."
To me, that's a pretty significant statement, and implies to me that all sites impacted are still tanked.
|Where is the option in Google Webmaster Account to remove pages from Index and cache? |
Search Google for the Google URL Removal Tool.
Do NOT submit pages that you intend to keep on your site because they will be removed from the index for at least 90 days. You cannot delete a page, remove it from the index, resubmit the same URL in a sitemap and expect it to immediately return to ranking --- that would take several months. You can, however, quickly remove dead URLs and deleted content using the removal tool.
There is little risk of removing a page by accident, because they must return a 404, 410, or be blocked in robots.txt. If you are deleting pages permanently, the removal tool is a quick way to delete pages from Google, often within a day or two. I have never seen it have a rapid impact on rankings, however.
[edited by: crobb305 at 3:50 pm (utc) on Mar 21, 2011]
@robert76, actually my US traffic is same. But, international traffic improved a bit.
|Do NOT submit pages that you intend to keep on your site because they will be removed from the index for at least 90 days. |
That used to be true, but it's no longer the whole story. Google improved the URL Removal process a while ago to allow speedy reinclusion, because after all, stuff happens ;).
|You can reinclude your content at any time during the 90-day period by following these steps: |
1. On the Webmaster Tools Home page, click the site you want.
2. Under Site configuration, click Crawler access.
3. Click the Remove URL tab.
4. Select the Removed content tab, and then click Reinclude next to the content you want to reinclude in the Google index.
Pending requests are usually processed within 3-5 business days.
|Google improved the URL Removal process a while ago to allow speedy reinclusion, because after all, stuff happens |
Oh, that's definitely a change that I wasn't aware of. I learned the hard way years ago, after removing some pages, and back then I had to wait 6 months. When I saw it posted here, I thought better safe than sorry. Glad to see they are more flexible now.
sumthin is happening. SERPS are different and shifting. Looks like google is ready to stir the pot again.
In two more days, panda will be a month old and google might have some plans for this.
Yep, there’re definitely back stirring the pot once again in my areas prime time. They’re beginning to take it a little bit to far now. Many of the top sites aren’t doing anything which would send a call for a quality check or they’re frozen to begin with. Very similar to Adwords.
|For some reason the one advice that Google keeps giving is: delete or redo the 'bad /thin' pages as they will hurt your entire site and wait for Google to index and later re-calculate. |
I’m beginning to think they’re focusing on the word delete, delete, delete too much. It’s almost as if in a weird way you must delete pages even with newly created high quality pages or they too will be down-ranked.
I get a bit angry every time I hear Google suggest that the <em>solution</em> to their algorithm's confusion resulting from the widespread theft of articles <em>I wrote</em> is for me to delete them from my site.
Where did you see or hear that quote, hyperkik?
It seems that we are getting better conversion from google in past couple days, the traffic from google is still about the same since Feb. 24th, but the conversion seems to be much better now.
Us too Grimmer. Same traffic, but much more targeted last few days. A step in the right direction. :-)
tedster, the official word from the Google Blog is,
|This update is designed to reduce rankings for low-quality sites—sites which are low-value add for users, copy content from other websites or sites that are just not very useful. At the same time, it will provide better rankings for high-quality sites—sites with original content and information such as research, in-depth reports, thoughtful analysis and so on. |
John Mu advised a site owner on Webmaster Central,
|One thing that is very important to our users (and algorithms) is high-quality, unique and compelling content. Looking through that site, I have a hard time finding content that is only available on the site itself. If you do have such high-quality, unique and compelling content, I'd recommend separating it from the auto-generated rest of the site, and making sure that the auto-generated part is blocked from crawling and indexing, so that search engines can focus on what makes your site unique and valuable to users world-wide. |
I definitely wouldn’t get cocky your time could be fast approaching. I was way up but on March 10th (I misquoted dates of the 12th) I saw them tinkering once again. Those reevaluations of your site quality could come multiple times in a year period or only once. Then you might get lock-out periods where any change sends you plummeting further.
I found two good articles on Google's search bias. U.S. senators have stepped up in recent days to call for public hearings. Check these links.
Let's see what Google is going do now.
I don't belive that the <em>solution</em> is in what <em>you write</em> anymore. As many have observed since the Panda, they stopped using the word 'content' in their statements and replaced it with 'quality', which is not exactly the same thing.
The John Mu quote is interesting. He seems to suggest implicitly that he thinks auto-generated content cannot be high-quality, unique and compelling. If Deep Blue can beat Kasparov, I think a computer can easily outwrite, say, a talented Bloomberg reporter, provided that the computer has access to a good database of information that it can use to auto-generate high-quality, unique and compelling content. Let's hope Google's war on auto-generated content is a war on ** low-quality ** auto-generated content, not just auto-generated content in general. Personally, I try to balance high-quality, auto-generated content with high-quality, hand-written content. That's the goal anyway. It's easier said than done.
|Google maintains that its criteria for evaluating sites are based strictly on what best serves users. |
And under cross-examination at Google-Gate Eric(I'm Sane)Schmidt admits that "users" was a typo and we meant to say "us". He also alludes to the fact “We never intentionally misled anybody who couldn’t already be misled”.
Whoa, all Google has done is suggest deletion /blocking of 'thin' and user content. I certainly hope that we do get a re-re-rank of what's left very shortly. I deleted all my tags and many other pages . Many were useful and were getting some traffic but I cannot afford to take a chance and will worry about the rest as time passes by.
Right now we are waiting to find out how much longer will Google wait for the re-calc.
|As many have observed since the Panda, they stopped using the word 'content' in their statements and replaced it with 'quality', which is not exactly the same thing. |
Yet they explicitly define low quality as "copied content". I've had a Google employee compliment my site's content. I've been plagiarized by nonprofits and government agencies. I've had some of my articles translated into Spanish, with permission, by a public service agency that wanted to use them as educational materials. I've had my articles reprinted, with permission, in various print publications, for college classes, client newsletters, and the like. So yes, when I got hit by Panda it seemed quite reasonable to infer that "copied content" is the issue, and I'm not really holding my breath that Google's next "fix" will suddenly distinguish my originals from the copies. Is there a Google statement I should look at that would suggest otherwise?
It does sound like you may be a false positive, hyperkik. Did your rankings fall with the earlier "Scraper" update, or only when Panda was released?
Google is severely challenged right now in the area of scraping, spinning, syndicating and quoting. I wish they would just say so - then we could assume they are working on it - which I do think they are. The recently announced "original-source" attribute for Google News websites is a sign to me that they know it's way out of line.
Still, Panda is FAR more complex than a measure of copied content.
|Yet they explicitly define low quality as "copied content". |
All content for my website has been written by professionals and was 100% unique ... until a lot of other websites and blogs started to copy it. Is the 40% traffic loss Google's "thank you" for all the work I put in the website and for all the work the writers did? Am I punished because a bunch of thiefs copied my content? Because Blogspot users think they can use anyone's texts and images? More than 50% of the copied content I find is on Blogspot! This Panda update is sickening. Copyright violators are still ranking, stolen content on Blogspot is still ranking and eHow is still ranking for partly re-written content.
If Google can not detect the difference between original content and copied content on Blogspot, then it says a lot about the "quality" of the Panda update.
One good thing is that a lot of copyright violators use Adsense to monetize and a lot of them violate the Adsense TOS. Since a couple of weeks I'm reporting Adsense TOS violators and I already saw Adsense disappear on two websites with stolen content. My revenge ... Reporting Adsense TOS violators seems to be more efficient than filing DMCAs.
|This Panda update is sickening. Copyright violators are still ranking, stolen content on Blogspot is still ranking and eHow is still ranking for partly re-written content. |
It really does get a bit depressing. I don't understand how Google can continue to support ehow's business model. It is obvious from what they pay that people aren't going to spend time on original research. I have put obscure tidbits about my life in my articles - stuff that is 100% unique, and I saw the same tidbits show up on a content farm. It is like they aren't just stealing my articles they are literally stealing bits and pieces of my life and trying to pass them off as their own.
Content theft is out of control. If they are not stealing your content, they take what you researched and wrote, reword it, and republish it as if they did all the hard work.
The biggest facilitator of it is blogspot. Google never did a thing with the half dozen DMCA complaints I filed. eHow is just a regurgitator of someone else's content. I've found my articles on the biggest article directories. I've had a doctor and other professionals in foreign countries steal my content.
Solving the stolen content is Google's real challenge, and one they have never cared about. It's hypocritical for google to proclaim they are the sheriff of quality without ever coming up with effective solutions for stolen content. They run around trying to put out grass fires when the rest of the town is burning.
In searching for copied content, I was really surprised to find that a well-respected online retailer in my niche had stolen a paragraph or two about a brand of widgets. It's something I wrote a few years ago, but I would have thought they would be above such things, particularly since they already rank extremely well.
I'm finding portions of my content scraped as well as the better part of whole pages. I don't see filing complaints as a solution, as it's very time-consuming. The material copied is years, months or even weeks old. It would seem that the only solution would be to constantly re-write pages.
How would Google rank sites where the text changes every few weeks?
Don't you guys find this all a bit strange.
Google is saying, we're clamping down on low quality content.
But scrapers are replacing original content in the serps. (I'm actually seeing more junk in the serps for some queries, maybe it's just me.)
eHow is still untouched.
The best part was the official sounding announcement on their blog, claiming biblical improvements in search quality.
And about the "junk floating to the top before they can skim it" argument, I see nothing from google saying the algo needs time to learn whatever it needs to learn.
The message I got was, "We rolled out a super duper algo and the search quality is now better than before."