Google no longer knows who the owner of content is

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google no longer knows who the owner of content is

chrisv1963

4:45 pm on Apr 6, 2011 (gmt 0)

Well, I guess we can no longer count on Google to protect our property.

I have been searching today with snippets of text from my website to find content thiefs. Very disappointing. In many cases Google is ranking the thief higher than the original source.

One sample of stolen content (text + image) is really unbelievable. The violating website is absolutely low quality, with nothing else but stolen texts + images. Advertising allover (the surface used for advertising is about twice the surface used for text): Three 300x250 Adsense blocks and one 300x250 Amazon block.

We have been working like crazy to improve the quality of our websites because after Panda Google told us to do so. What we see however is that low quality websites are running of with our content and getting good rankings for it. This is not the Google I used to know. Something is very wrong.

I'm sorry, but I lost ALL trust in Google. Isn't Google simply broken or do we need to use black hat tacticks to rank for our own content?

Shatner

6:58 am on Apr 7, 2011 (gmt 0)

For my part, I really don't understand why it's so hard for Google to detect scraper sites.

Fact: Scraper sites rarely scrape from just one source, usually the scrape content from 3 or 4 sources, sometimes more.

To detect them, simply look for sites that seem to have content which exactly duplicates the content on several different sites, and contain no content which isn't a duplicate.

That is a scraper site. Ban.

My fear is that Google is hoping the +1 button will help them solve this, thinking that if a site is a scraper site, people won't +1 it and they'll +1 the original site. If so, we are doomed because that is 100% false. The average internet user not only can't tell the difference they don't care about the difference even if they could tell. They will +1 the crap out of the first site they see.

Shatner

7:00 am on Apr 7, 2011 (gmt 0)

>>I'm not convinced that the template is a major algorithm factor

Yeah I don't think that either. I have a really nice template design, and since Panda my site is now outranked by scraper sites stealing my content with AWFUL template designs that are basically just a bunch of garbage and banner ads and viruses.

Shatner

7:01 am on Apr 7, 2011 (gmt 0)

>>>most of the scrapers that outrank me have no authority, are even totally off topic. I would agree that the NY Times can outrank me scraping my content, but not a Joe the Plummer site.

There seems to be no middle area. You're either the New York Times, or you're everyone else.

jecasc

7:12 am on Apr 7, 2011 (gmt 0)

Fact: Scraper sites rarely scrape from just one source, usually the scrape content from 3 or 4 sources, sometimes more.

To detect them, simply look for sites that seem to have content which exactly duplicates the content on several different sites, and contain no content which isn't a duplicate.

Sounds like you are describing Google News...

tristanperry

7:26 am on Apr 7, 2011 (gmt 0)

My fear is that Google is hoping the +1 button will help them solve this, thinking that if a site is a scraper site, people won't +1 it and they'll +1 the original site. If so, we are doomed because that is 100% false. The average internet user not only can't tell the difference they don't care about the difference even if they could tell. They will +1 the crap out of the first site they see.

This. It is worrying to see that Google does seem to be losing the fight against scrapers which is a scary thought since detecting scrapers (and not rewarding them against the original sources) shouldn't be a difficult task IMO.

I hope they aren't relying on the +1 button either. Heck, Google's own user tests have shown that the average user can't tell a spam site from a good one. So I'm not sure where the +1 button is meant to help them.

levo

7:27 am on Apr 7, 2011 (gmt 0)

I'm going to try adding a 90-120 mins delay to the feed.

bramley

1:27 am on Apr 8, 2011 (gmt 0)

A thought on 'thin' scraped site ranking better. If it is true, as I believe, that uniqueness of a page within a site is part of the algo, a consequence is that a scaper site with only that one page on that topic could rank higher than a site (the originator) which has a number of pages covering the topic (to some degree) - it's a dilution effect; and why noindexing can be helpful.

The issue is that page uniqueness / focus competes somewhat with site focus - good site focus entails more overlap of themes across pages.

This is why careful organisation of what is and is not indexed is the key.

The algo could also be improved. For instance, site focus is maybe not weighted high enough; though the difficulty is what to do with the e-hows and wikipedias.

TheMadScientist

2:22 am on Apr 8, 2011 (gmt 0)

...they USED TO BE BETTER AT THIS.

Yeah, there's something we're missing wrt why...

I'm not convinced that the template is a major algorithm factor...

Hmmm ... I'm not sure, major, maybe not, but remember they can't 'see colors' and graphics with an algo, so 'design' specifically may not be the correct word, but my guess is there are layout styles people have a 'more positive' reaction to than others, and, well, it would be a really long post to describe how I would try to use that in English, and I really don't feel like making it right now, but I may post on bounce rate again one of these days... lol.

For my part, I really don't understand why it's so hard for Google to detect scraper sites.

Fact: Scraper sites rarely scrape from just one source, usually the scrape content from 3 or 4 sources, sometimes more.

To detect them, simply look for sites that seem to have content which exactly duplicates the content on several different sites, and contain no content which isn't a duplicate.

I think this may be an occasion where you can't go off half cocked ... As soon as you do that, you increase the spinning and reworking, making it more difficult to detect what's scraped and what's not while you're working on a solution ... I think the way to handle the issue is to use what you have now as a tool and learn to detect it while it's still where it is, without making the problem obscure itself by blasting away now and then trying to figure out how to deal with the problem you just compounded ... I hope that's close enough to English ... Basically, I think there are some things it's better to fix after you have more than the 'easy solution' in place, because if you simply put the easy (or what seems easy) solution in place, you may easily end up causing a new issue that's even tougher to solve, so it may be better to solve both before you implement a solution for only one...

I'm not sure the preceding is why they don't do something like is being suggested, but my guess is sometimes we don't see all the issues the same way as they do, because our POV is the Results Output, and there's is the Result Input (for lack of better phrasing).

[edited by: TheMadScientist at 2:23 am (utc) on Apr 8, 2011]

Dan01

2:22 am on Apr 8, 2011 (gmt 0)

For my part, I really don't understand why it's so hard for Google to detect scraper sites.

Good point.

In the video on the previous page, Matt Cutts was saying that scraper sites could get the Google bots quicker because they put up more content. All they have to do is scrape it. Meanwhile, we have collect data, check sources, do research, check spelling, grammar... And finally we make a post. The Googlebot says, "oh, they don't produce much, let's only hit their site once or twice a day..."

bramley

2:30 am on Apr 8, 2011 (gmt 0)

The logic might not be too complex (though MadScientist has a point that it is not so simple as first appears - at least to make a robust, future-proof / not easily worked-around, solution. The reason for it not being done might be that it is so computationally expensive. Even if you have all the content of every page, in every version, stretching back years, do cross reference all that is a truly massive task.

TheMadScientist

2:39 am on Apr 8, 2011 (gmt 0)

Yeah, just an addition to my previous post, because in thinking about it, I would really wait...

Does anyone realize how much more difficult you make the issue of solving scraping when people stop posting exact duplicates of the originals because they don't rank any more?

Right now people still do it, so you have exact copies with 'spider times' and other signals on scraped content that's easily detectable, but as soon as that content no longer ranks, you lose those signals, because people will not quit scraping, they will rework better, which severely hinders your ability to get signals other than (A === B === scraped) and put those signals together to help more reliably detect (A is close to (B + other signals) === scraped).

I would wait, because once people stop posting the exact copies of originals since they don't rank any more, you stand to lose a whole bunch of possible signals to use to detect reworked scrapes, but right now all those other signals can be found easily, because there are exact copies to work with, so you almost have to leave it alone until you get all the signals you need to reliably detect close and spun duplication, otherwise you may seriously compound the problem before solving it.

bramley

2:57 am on Apr 8, 2011 (gmt 0)

Maybe the whole issue is almost moot now because in a few days I could probably write some software that could rewrite some text so differently (and with randomised elements) tat one could not see it as scraped or detect that it has been computer generated.

Too tired to search what already exists bt if there's not much now, there probably will be soon enough.

brotherhood of LAN

3:01 am on Apr 8, 2011 (gmt 0)

bramley, there's a running thread that covers that exact topic [webmasterworld.com], 'spinning' of content has been done for a while and there are a number of public and private tools at work to do just that.

TheMadScientist

3:01 am on Apr 8, 2011 (gmt 0)

...(and with randomised elements)...

That's exactly why you have to try to detect scraping through content similarities and non-content signals, and to have easy access to the non-content signals you have to let the duplication slide so it still happens ... I believe there is a way to do it, but it's HIGHLY complicated and imo external signals could really help the process.

How many English Language Experts do they have on staff?

Detecting duplicate content isn't really the issue, imo ... It's the spun content you have to be able to detect and that's a different cup-o-tea, or kettle-o-fish, or even possibly a different barrel-o-monkeys...

TheMadScientist

3:17 am on Apr 8, 2011 (gmt 0)

Actually, I'm fairly sure it's possible to do, but saying 'do this' and writing a script to do it reliably are two different containers-o-items.

BTW: I think that's all you're getting out of me on this topic! lol

bramley

3:34 am on Apr 8, 2011 (gmt 0)

Ideally an AI approach would not be a matter of rewriting scraped content, but an intelligent crawl of the web, learning the concepts, making intelligent insights and writing an engaging article with a unique twist, just as we do.

Some guy at Bing suggested that search as we know it might soon have had its day, but I dont think he had quite this in mind !

But surely this day is not too far off. If it can beat Wikipedia - and that's not too difficult - it could be sooner than we think that the best web content is machine created ...

TheMadScientist

3:38 am on Apr 8, 2011 (gmt 0)

I know nuthink!

I gotta quit reading this thread ... It's their job to figure out how to do it, and if they haven't already then I'll leave it to them, but again, it's possible to do imo. (Note: Bing's view relates to the method of querying and displaying answers, but not organizing the answers displayed, so they're really two different things.)

bramley

3:47 am on Apr 8, 2011 (gmt 0)

Maybe the future lies with scrapers - not the low-life rip-off sites that dominate now - but intelligent sites that can give you just what you need in the style you like and all generated on the fly.

Facts are not copyrighted, once the s/w is sufficiently intelligent this will be possible and Google is likely front runner (even if they haven't thoght of this yet). More applicable to info sites than e-commerce or forums,n but might be the end for ehow, wikipedia and, er my sites :(

Reno

4:26 am on Apr 8, 2011 (gmt 0)

Facts are not copyrighted

And if Google simply sees its mandate as supplying the correct answer to a query, then it will not matter to them where that information was uncovered, nor will it matter how that correct answer ended up on the page that they ranked.

A concept such as "content origination" has a huge moral component to it, and I honestly believe that is of only marginal interest to the current PTB in Mountain View. If the user is satisfied, then from their point of view, they did their job.

..................

Dan01

4:39 am on Apr 8, 2011 (gmt 0)

Maybe the whole issue is almost moot now because in a few days I could probably write some software that could rewrite some text so differently (and with randomised elements) tat one could not see it as scraped or detect that it has been computer generated.

I think that software is already out there. My wife and I bought a program a few years ago but never used it.

tedster

5:18 am on Apr 8, 2011 (gmt 0)

This thread is so painful. Not too long ago (January 28) we were discussing Google's Scraper Update [webmasterworld.com] which Matt Cutts described like this: "The net effect is that searchers are more likely to see the sites that wrote the original content rather than a site that scraped or copied the original site's content."

Seems like something went seriously wrong with that intention.

chrisv1963

5:49 am on Apr 8, 2011 (gmt 0)

This thread is so painful. Not too long ago (January 28) we were discussing Google's Scraper Update [webmasterworld.com] which Matt Cutts described like this: "The net effect is that searchers are more likely to see the sites that wrote the original content rather than a site that scraped or copied the original site's content."

And ... Cutts and Panda stated that they are happy with the results of the Panda update. Did they check the results properly before making such a stupid statement? Something is broken and they didn't even notice it.

rico_suarez

6:35 am on Apr 8, 2011 (gmt 0)

something has changed. looking at my niche, scraper sites more or less dissapeared from page 1 and 2. However, just about when Panda rolled out they were on page 1 outranking ton of high quality sites, sending them to page 2. It lasted for several weeks and you can see by their Alexa stats that their traffic increased significantly during that time. It was unbelivable, sites few months old with stolen content ranking high on page 1. I assume that Google has tweaked algo significantly since the first Panda update. Now searches for most important keywords look more familiar like before the update.

Shaddows

8:04 am on Apr 8, 2011 (gmt 0)

And ... Cutts and Panda stated that they are happy with the results of the Panda update. Did they check the results properly before making such a stupid statement? Something is broken and they didn't even notice it.

Panda is a substantial improvement. Scraper was a major fail. They're not the same update.

rlange

2:36 pm on Apr 8, 2011 (gmt 0)

falsepositive wrote:
So Google may be sending the signal that we fix our site's quality or suffer the humiliation of being outranked by scrapers...?

Not a chance. Even ignoring that they're a for-profit business, Google's search engine doesn't exist to benefit website owners; it exists to benefit people looking for information. There's absolutely no reason that they'd intentionally make their user's experience worse just to shame a few website owners into improving their sites.

Besides, if this was their intent, I would expect a very clear statement from Google indicating such. Otherwise, it would send the opposite message: If scrapers are ranking higher than you, then that must be what Google likes.

No, this is definitely unintentional.

bramley wrote:
Maybe the future lies with scrapers - not the low-life rip-off sites that dominate now - but intelligent sites that can give you just what you need in the style you like and all generated on the fly.

Heh. Isn't that what search engines are supposed to be?

chrisv1963 wrote:
And ... Cutts and Panda stated that they are happy with the results of the Panda update. Did they check the results properly before making such a stupid statement? Something is broken and they didn't even notice it.

Keep in mind that they are a business. Even if they did notice that something was very, very wrong after unleashing Panda on the U.S., they're not going to come out and say, "Yeah... this update that we've been working on for a year is apparently not that great. In fact, it's actually backfiring in more situations than we expected." They have shareholders and public perception to keep in mind.

That said, I doubt the problems that we're all complaining about are significantly widespread. The update was probably an overall improvement. In any complex equation there's bound to be some combinations of inputs that give unexpected results. We may just be those unlucky "edge cases" where the update has had the opposite effect.

--
Ryan

Brett_Tabke

4:08 pm on Apr 8, 2011 (gmt 0)

Forget that google can do anything here. Spam reports takes months upon months for someone to lay eyes on - if at all. DMCA is nice if you are talking one site - often we are talking dozens and you'd have to be a full time lawyer to send out all those notices.

Solution: Only allow the original content to be crawled by Google for 48hrs. eg: cloak it - then release it to the general public after it shows in Google index.

indyank

5:11 pm on Apr 8, 2011 (gmt 0)

Brett, the problem is content copied one year later than the original was indexed is ranking above the original. This is the biggest failure of panda and google ignores it in the name of quality.

crobb305

5:15 pm on Apr 8, 2011 (gmt 0)

Brett, the problem is content copied one year later than the original was indexed are ranking above the original. This is the biggest failure of panda and google ignores it in the name of quality.

I had some articles that I wrote 5 years ago that suddenly got scraped over the past few months. They were getting republished in article hubs by lazy webmasters trying to get some links. Those hubs were outranking me after Panda. Pathetic failure on Google's part, given that they seem to have zero memory of the original after 5 years.

tedster

5:30 pm on Apr 8, 2011 (gmt 0)

DMCA is nice if you are talking one site - often we are talking dozens and you'd have to be a full time lawyer to send out all those notices.

I wonder if sending DMCA only for those scraper sites that are ranking well is a better approach. Handling all of them can be impossible sometimes, but handling just the one or two doing current SERP damage might help.

rlange

6:11 pm on Apr 8, 2011 (gmt 0)

Brett_Tabke wrote:
Solution: Only allow the original content to be crawled by Google for 48hrs. eg: cloak it - then release it to the general public after it shows in Google index.

It's trivially easy for any user agent to pretend to be a different user agent. I would be surprised if the "larger" scrapers aren't already using scripts that pretend to be Googlebot.

--
Ryan

[edited by: rlange at 6:13 pm (utc) on Apr 8, 2011]

This 105 message thread spans 4 pages: 105