| This 133 message thread spans 5 pages: < < 133 ( 1 2 3  5 ) > > || |
|Matt Cutts: Google Algo Change Targets Dupe Content|
|Earlier this week Google launched an algorithmic change that will tend to rank scraper sites or sites with less original content lower. The net effect is that searchers are more likely to see the sites that wrote the original content. An example would be that stackoverflow.com will tend to rank higher than sites that just reuse stackoverflow.com's content. Note that the algorithmic change isn't specific to stackoverflow.com though. |
I know a few people here on HN had mentioned specific queries like [pass json body to spring mvc] or [aws s3 emr pig], and those look better to me now. I know that the people here all have their favorite programming-related query, so I wanted to ask if anyone notices a search where a site like efreedom ranks higher than SO now? Most of the searches I tried looked like they were returning SO at the appropriate times/slots now.
I know there's an existing thread for SERP/algo changes, although this mainly seems to be a 'new' development in that it relates to further tackling dup content scrapers. Mods feel free to merge with an existing thread if needed though.
From Matt Cutts Blog:
I just wanted to give a quick update on one thing I mentioned in my search engine spam post.
My post mentioned that “we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content.” That change was approved at our weekly quality launch meeting last Thursday and launched earlier this week.
This was a pretty targeted launch: slightly over 2% of queries change in some way, but less than half a percent of search results change enough that someone might really notice. The net effect is that searchers are more likely to see the sites that wrote the original content rather than a site that scraped or copied the original site’s content.
[edited by: Brett_Tabke at 9:22 pm (utc) on Jan 28, 2011]
[edit reason] Added link for the Cuttlets [/edit]
LOL @ that definition ... I didn't read it and was going with the more common usage: From merriam-webster; 2. a: double 2a b: alter ego
I second "Scrapegate".
At our agency we have a client with two closely related sites, and quite a bit of product crossover. One site is about 12 years old, the other only two. We had the 2-year old site ranking at #2 (below its brother, the older site) for awhile on our primary keyword, but it was kicked down to the middle of page two on Jan. 16.
We're not sure if this is a victim of dupe content, but we'll be working on that and see if it makes a difference.
One man's duplicate content is another man's product breadth
OK, Brett - let me clarify, why I think both algo changes are connected strongly:
read 19.6, starting on page 18: "near duplication"
To identify "near duplication" or content farms if you like, a coder would extract "shingels" from a document and create checksums (to keep it simple). So, to understand Google better we have to come back to the scale they operate on... IF some Google employee comes up with the idea to save new datasets per document (e.g. shingles) he would create a serious amount of data:
e.g.: 12 Billion URLs? Lets keep 8 shingles with 64 bit each, makes at least 1 TB of data, if you calculate some overhead into it.
And then I would start my duplicate content finder.
That's why I think BOTH algo changes root back to the same start: BigDaddy is all about this meta data. If it is shingles, shards or whatever they call it. To detect poor content, you need the data anyhow and while you are at it, compare the data...
section 3 and 4...
I think Google is feeding heavily on the academic world and that is a good place to look for hints!
just 2 cents!
I do see where you're coming from, pontifex. However, the purest form of "content farm", as I understand the term, isn't near duplication at all It's just low quality writing created only to rank - and not to give the user anything real at all.
It's so bad that some of the people who write for these "businesses" have been publicly saying that they're embarrassed to have their name featured in a byline and choose not to. They're jsut told to crank out x number of words about such-and-such a key phrase.
There's a lot of fog being blown around right now, confusing duplicates (scrapers/spinners/syndication) and content farms. Not at all the same thing.
Agreed! Just wanted to point out that the synchronous occurrence stroke me and from a technical point of view I think both algo changes are "cousins". But you are right, we should separate them by the (economical) effect they do have...
All we are doing is speculating...As the wise old owl says, when it come to Google, "the world may never know!"
How long until Google does a rollback after evaluating the changes and feedback in this thread? :)
Personally Google was doing a fine job in regards the selection of the true authoritive content before they made the changes to the 2% (how many pages is that, I wonder?) deployed this past week.
Why fix something that isn't broke and reward that garbage. How much Adsense is on that junk?
Has anyone considered that "some" of these "google changes" are meant to give users a new set of pages to see adsense? After all, if the top sites were the same day after day... Most of theses algo changes seem to run about a quarter, and, ahem, seem to happen at the beginning or end of a quarter...
Dang it, rumpled my tin foil hat!
I think the response that was written by Matt "slightly over 2% of queries change in some way" speaks volumes.
That shows me they were not confident in doing these changes in the first place as they must of had an idea that the quality of the results will likely suffer, thus why only "2%" and not any more.
Maybe 2% = $x in potential earnings from sites that are rising with Adsense on it that were previously being filtered properly before the changes?
No that could never happen! ;)
You could look at it the other way also, that they only did 2% to get the feedback on the changes before deploying it across the board, that's better spin for Google. :D
|It's just low quality writing |
The word "quality" keeps coming up in this discussion, and I'd practically bet that if we ask 10 different Google employees what constitutes "quality", we'll get 11 different answers. If Google has it in its collective head that they will be the arbitrator of tastes, we're heading down a slippery slope. I am totally in favor of them ranking original content ahead of other sites which either copy that content outright, or obviously reword it. But having Google setting themselves up as the authority of what does and does not constitute "quality" could end up making a real mess of things. When it comes to corporate/tech driven subjectivity, their best intentions could end up simply muddying the water.
Regarding the name: "Trying to Detect .. Update" or "CEO Update" (changing top management refresh employees minds).
For my affected site, traffic has been recovered 50%.
Reading what MC/Google have said on both subjects, this is what I took away
1) Google is trying to promote original material ABOVE duplicates, BUT will not remove the duplication, merely rank it slightly lower
2) Contents Farms are low quality, but are over-represented near the top of SERPs. Google intends to stop this
Now, I'm sure that everyone is aware of this, but I'll repeat: Google does not always tell the full and unedited truth at all times. MC not mentioning content farms does not mean they are not affected- just as the Google blog not mentioning duplicate-targeting meant no algo change was imminent.
In order to supress duplicate content, I would imagine active promotion of original content would be required. I would surmise that this promotion will lead to content farms being even more heavily represented. So, I would expect a similataneous damping of Content Farms, however so defined.
Finally, I would expect very, very little collateral damage from making original content rank higher (a timestamp should suffice), wheras I would expect MASSIVE collateral damage from mitigating against Farms. Since we are observing damage, I would infer a priori and a posteriori that farms would be and indeed have been targetted in two separate strands of one single update.
PS. I would expect SEVERAL more major (i.e observable) algo changes to refine the fight against Farms. I will be most interested in the implication for ecoms using centralised product text (be it third party or manufacturer), or whether that "class" of site will be exempted.
|Is the original author getting credit a bit more often? Do we have some positive reports, rather than puzzling over lost traffic? |
I am still seeing little change in credit for original content, but it may be due to making to many changes at the same time(new server, 4.0 - xhtml etc, to many new links, etc) caused google to distrust me.
I think local results have really improved, no doubt an ongoing
challenge for a presence in international search, especially maintaining top 1-3 out of often tens of million results, so in that sense Google seems like its improving.
What is surprising though is i do notice a lot of big well established trusted sites have disappeared from the results or been pushed deep.
|The net effect is that searchers are more likely to see the sites that wrote the original content. |
Good for Google ,I would guess it wil take a while to see the full effect of those changes and I wonder if getting bogged down with thousands if not hundreds of thousands of DMCA's prompted this change.
|After all, if the top sites were the same day after day |
Thats reminds me of something I had forgot to consider when criticizing Google. Of course they want to keep people clicking ads, but they also need to keep it fresh. Giving searchers the same top ten results day after day month after month would not be good in competitive niches anyways. I see now why they have to shake it up. If I thought i could have hurt a corps feelins i'd apologize. :D
|I would expect very, very little collateral damage from making original content rank higher (a timestamp should suffice), wheras I would expect MASSIVE collateral damage from mitigating against Farms. |
I don't know about that, Shaddows. If the timestamp was all it took, then why the years of scrapers ranking higher? I don't think timestamps are dependable enough to use, at least not exclusively. One tweak to the sitewide template and you've lost your "original source" status.
Ok, I was being a little facetious with the timestamp comment. But still, the mechanism will have to be built around discovery dates. It was probably not possible to do this meaningfully even 5 years ago, as G was only 5 years old. While I have no doubt that a good amount of information is MORE than 10 years old, it will be a tiny fraction of the whole of the web- and crucially most scrapers came later.
Once the original content has been identified, the way its ranking are enhanced will have very little collateral damage. Etiher you are original (boosted), unique (unchanged) or republished (NO "drop", but other sites move past you).
A mechanism to do the boosting could be made to "trump" all else. But then a crap site would beat a rehashed site that was much better as a user experience. Academically pure, but not ACTUALLY the desired result for Googles shareholders. As a result, I reject it.
Next up, you give extra ranking points for being original. Ok, so you are 8th, scraper is 3rd, but with additional usability. All other spots are unique content. Where do you rank now? Swap with scraper? Move to 7th, with scraper 8th. Move to 2nd or third with scaper behind you? I don't think this is a viable solution, especially if there are multiple sets of duplicated data on the same SERP.
My initial thoughts were to therefore just give extra points to ANY original content, and go from there. As (IMHO) SERPs are folded from different parallel criteria for different sites (partitions again, if anyone cares to look back), the pages may or may not change positions, or the SERP might be rebuilt a different composition of site-types in different positions. Some of this paragraph will be giberish to some readers. I apologise. Suffice to say things are done in the normal Google way, not mechanistically after the event to fix dupe issues.
However, while writing this, I have had another idea. Give value to the CONTENT, regardless of where it resides. Then aporition value to sites, based on origination. Hmmm, more thought required.
I think you're probing in the right direction, Shaddows - and I think that direction uncovers a dirty little secret. Google may not be as good at page segmentation as is generally thought in the SEO community. Looking at the entire web, and not just the code generated by widely used CMS, there are probably more edge cases than easy page segmentation examples.
This means that assigning credit to the original author is a lot more complex than it seems - even with a discovery tag for time/date and maybe a hash thrown in the mix.
I'm nursing a theory that having a cleanly defined content area in your markup is a critical step toward not being outranked by anyone else's copy. I'm thinking about using the HTML5 <article> element all the time.
|Give value to the CONTENT, regardless of where it resides. Then aporition value to sites, based on origination. |
A well stated argument Shaddows, but for me, the missing phrase in your post is: "all things being equal" (which of course they rarely are). So yes, give more value to origination, but (as you articulated) if everything else on that site is awful, then it's not to the viewers benefit to push it to the top.
We know the G algo has hundreds of facets, so all things being equal, original content deserves to be rewarded and scraped content deserves to be dampened.
And yet... Experience tells us that if a site has scraped word-for-word content from 9 other sites, but does just about everything else really well (good backlink profile, logical navigation, keyword domain, valid code, etc) and the 9 other sites fall noticeably short, then the scraper probably outranks them, and from the user point of view (as opposed to the webmaster POV), there are probably no complaints.
This, it strikes me, is Google's challenge ~ At what point does it do their users a disservice to present scraper/dupe/contentfarm sites above originators? If that point is due to "low quality", then what does that mean? Because we all know, there are a TON of original content sites out there that by any definition are low quality, and I hate to say it, there are a bunch of sites that borrow generously from others that are pretty well done. If the user has a vote, they'll likely go for number 2, and I'd be surprised if Google would trump that, no matter what they say.
Wheel's thought.. "Google apparently thinks the web owes them a living" --> so true, and if we are not careful, Google will 'become the web'... scary but were heading in that direction more than we might think.
And my concern is not for Google specifically - because as big companies go they are pretty benign. My concern is the same as it was for Microsoft, and this was well expressed by economist Milton Friedman:
"Concentrated power is not rendered harmless by the good intentions of those who create it."
So all humanity needs to play watch dog. At the same time, we all need Google to get this scraped content issue right - because they are the big guys right now and the problem IS creating harm for others right now.
And though there are signs of improvement, I'm assuming that there will be another push on this - and then continued maintenance. I think anyone whose site has enjoyed lots of search traffic because of republishing others content should be hard at work, re-thinking their business model right now.
|I wonder if getting bogged down with thousands if not hundreds of thousands of DMCA's prompted this change |
I tend to agree with that because quite a few article site owners told me that DMCA’s filings were going through the roof. That's an understatement with what some were saying. In fact Google’s own announcement of a 24 hour response time regarding DMCA’s mirrors this. They aren’t responding any quicker though. In other words most changes are the result of factors you’re forced to act upon or Google wants something that they currently know they’ll be rejected on.
I'm not doing much analyzing though because Google is going to do what it darn well wants to and little stops them.
I think the choice of result could (or should?) be influenced with trying to pin down user intention and the type of search answer that should be served. If the search is such that one page will do and would answer the searcher's question, then serve the original content page. But if the intention can be classified to be more general around the subject area and the scraper site has better organised site focused on the whole wider subject with a good taxonomy, additional (connected) info on other pages of the site, good interlinking etc, then I (purely as an user) would like to see the scraper site. From the selfish user point of view, I would not be interested who wrote the article first - unless the article is such that the site it comes from originally gives additional credibility to the article.
So the challenge is to weed out sites who steal/republish other people's content in order to use these as a landing pages for the rest of their site where they perhaps sell something remotely connected. But on the other hand I would like to see in the results a well designed informational site with a good structure and focus on perhaps little bit wider subject to what I was searching for, where I could get additional info without having to go back to Google to search again.
|...that the algorithm change from last week was just related to blocking low quality scraper sites from showing up in Google's search results. |
Again, the algorithm that is live is related to low quality scraper sites and not content farms.
...that the algorithm change from last week was just related to blocking low quality scraper sites from showing up in Google's search results.
Again, the algorithm that is live is related to low quality scraper sites and not content farms.
"low quality scraper sites"
Does this mean there are "high quality" scraper sites that are ok? Scraping is scraping. How could there be a differentiation between high and low quality scrapers?
|Does this mean there are "high quality" scraper sites that are ok? Scraping is scraping. How could there be a differentiation between high and low quality scrapers? |
My definition might be wrong, but if a scraper is simply an automated bot which collects content, then I can see how there could be high quality scrapers.
fflick (before being bought by Google) pretty much 'just' scraped content from Twitter (albeit then running it through their semantic analysis engine). Nonetheless it was a fairly automated site and its content was all scraped.
There's also a site I know of (whose name eludes me at the moment) which scrapes Twitter for #haiku and displays it on the site. So again, it's an automated scraper. But it offers value and apparently was starting to get fairly popular when I last checked.
So yeah, I guess that scrapers don't necessarily have to be low quality. Probably 99.9% of them are, though.
|related to low quality scraper sites |
Our understanding of this would be greatly enhanced if someone at Google would explain what they mean by the word "quality". It can mean anything. Suppose they said "undistinguished scraper sites", or "unaccomplished scraper sites", or "scraper sites with a lack of character" ~ would any of us know any more than what we know now, with their use of the adjective "quality"? I doubt it.
It looks to me like this new algo may have tanked and they are doing a rollback. I see the few junk sites that dropped one whole position are now back to the top. FAIL
One useful change Google could do is to make wikipedia a result you must ASK for.
I love wikipedia, but it does not need to be included in every search result.
If I need to wiki something, I should just have to type "Jimi Hendrix wiki" (or whatever).
I see no signs of this update actually working... a scraper site that autoposts from my rss feed is ranking #1 on the exact match of one of my page titles... and my page (the original) is buried.
Second on the FAIL!
|One useful change Google could do is to make wikipedia a result you must ASK for. |
Forget about making people ask, just give it the top left of the results page right below the logo...
ADDED: They could even change the name to Googipedia and move to a .org.
How could they present the result section? Hmmm... 'Wikipedia the Official Result of Googipedia.Org - At Least One Wikipedia Page Guaranteed to be Served for Every Single Search'
Sry. Feeling like a bit of a smart a** today. ;)
| This 133 message thread spans 5 pages: < < 133 ( 1 2 3  5 ) > > |