Google no longer knows who the owner of content is

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google no longer knows who the owner of content is

chrisv1963

4:45 pm on Apr 6, 2011 (gmt 0)

Well, I guess we can no longer count on Google to protect our property.

I have been searching today with snippets of text from my website to find content thiefs. Very disappointing. In many cases Google is ranking the thief higher than the original source.

One sample of stolen content (text + image) is really unbelievable. The violating website is absolutely low quality, with nothing else but stolen texts + images. Advertising allover (the surface used for advertising is about twice the surface used for text): Three 300x250 Adsense blocks and one 300x250 Amazon block.

We have been working like crazy to improve the quality of our websites because after Panda Google told us to do so. What we see however is that low quality websites are running of with our content and getting good rankings for it. This is not the Google I used to know. Something is very wrong.

I'm sorry, but I lost ALL trust in Google. Isn't Google simply broken or do we need to use black hat tacticks to rank for our own content?

iamlost

6:12 pm on Apr 8, 2011 (gmt 0)

piatkow:

It isn't always made clear to newcomers that DMCA is a remedy under United States law only. I am located in the UK, my site is hosted in the UK and I have a .uk domain. In the unlikely event of receiving a DMCA notice (remotely possible I suppose if one of our reviewers submitted the same reviews elsewhere as well) I might just about be bothered to contact the originator with the words "go forth and multiply".

While non-US sites and hosts may well ignore a DMCA both Google and Bing are US corporations and do comply. So the infringing copy may still exist on the sites but will not be shown by the two major English SEs.

Of course, DMCA is not a universal panacea but it can be beneficial in itself plus be a helpful foundation should one proceed further.

tedster:

I wonder if sending DMCA only for those scraper sites that are ranking well is a better approach. Handling all of them can be impossible sometimes, but handling just the one or two doing current SERP damage might help.

Absolutely.
I only check query results for significant SE query traffic changes. If a traffic drop check shows one or more copyright infringers I simply send the necessary information to my law-type-person and she files appropriate DMCAs. Each year we file several hundred. Costly? Yes. Worth it? Yes. Whether it is worthwhile for someone else is their business decision.

hyperkik

6:15 pm on Apr 8, 2011 (gmt 0)

Not a chance. Even ignoring that they're a for-profit business, Google's search engine doesn't exist to benefit website owners; it exists to benefit people looking for information.

Google's search engine exists to create profit for Google. There's a strong correlation between what causes Google users to be happy and what maximizes Google's profits, but the effective penalizing of quality content on white hat sites in favor of plagiarists and those content farms that benefited from the algorithm changes serves to highlight that the two are not always the same.

I wonder if sending DMCA only for those scraper sites that are ranking well is a better approach.

Honestly, I got frustrated enough with the algorithm changes that I submitted DMCA reports for the first time. They're time-consuming to prepare, some are arbitrarily rejected, and Google may reserve the right (as it did with Scientology documents) to keep the copyright violator's entry in its index but with a link to your DMCA report at Chlling Effects instead of the stolen content. All in all, whatever their intent, it feels like they're doing their best to deter and to some extent even punish the submission of legitimate DMCA reports. My present feeling is that DMCA reports for Google Search aren't worth the time or aggravation.

(I mean no offense to Google employees who have to work through thousands of DMCA reports. I assume they're acting in good faith.)

[edited by: hyperkik at 6:24 pm (utc) on Apr 8, 2011]

Content_ed

6:15 pm on Apr 8, 2011 (gmt 0)

Google generally takes over a month to even respond to DMCA, and if there's room to wiggle, they try to wiggle. I'm just hearing back now on Google DMCAs filed right after Panda.

tedster

6:45 pm on Apr 8, 2011 (gmt 0)

It's trivially easy for any user agent to pretend to be a different user agent. I would be surprised if the "larger" scrapers aren't already using scripts that pretend to be Googlebot.

But it is not so trivial to spoof an IP address and as far as I know, it's impossible to spoof your way through this process. See How To Verify Googlebot [webmasterworld.com]

Reno

7:53 pm on Apr 8, 2011 (gmt 0)

You guys are expecting too much of google.
........
Solution: Only allow the original content to be crawled by Google for 48hrs
........
It's trivially easy for any user agent to pretend to be a different user agent

I think we can all agree that it will not be easy to adequately deal with this scraper problem, and I don't know if it's realistic to put the entire solution on Google's back, though they must of course play a central role.

The first thing I want them to do is to make a public announcement that scraping content is Target #1, and they won't rest until there is a mechanism in place to handle it.

Then they may need to try multiple approaches until they hit on the right combination. They may have to start in one direction then take a radical turn, and do that more than once. The important thing will be to make it clear to the scrapers that they are in the crosshairs.

So here's one thought:

Via GWT, Google provides storage space for text only. They could charge for it ~ $10 annual per 100 MB. It would not be "linkable", its strictly so new text content can be uploaded to the site's password protected GWT account prior to public viewing. The text could be on a Word doc, or a Notepad .txt, or whatever. Our GWT verification code is put at the top of the page and a button is clicked that tells Google to index it. After they do that, they have a copy in that account showing the siteowner uploaded the text at a date specific. So you check your GWT a day later, see the "Page Indexed" icon along with a unique ID, which gives you the green light to upload to the public.

So what does this do? The next time we see a scraper ranking higher than us, we submit to Google that scraper's page URL along with our unique ID verification proof. They send their bot to check it out so it can be compared to the original; then they notify the scraper that they have X number of hrs to get it down; and finally, if necessary, slap that scraper page with a -50 penalty. That will only have to happen a few times before the scrapers will get the message.

I do not pretend that this is the "solution", but as I said, if Google wants to convince the webmaster community that they are serious, then it's time to think outside the box. This is simply one idea ~ no doubt there are hundreds more.

Memo to Google: DO SOMETHING.

..............................

TheMadScientist

8:05 pm on Apr 8, 2011 (gmt 0)

So what does this do? The next time we see a scraper ranking higher than us, we submit to Google that scraper's page URL along with our unique ID verification proof.

Now, go one step further with your thought ... What's a good scraper do when the exact duplicate no longer ranks? Do they stop scraping or do they make it more difficult to detect? You have to solve the issue of the near duplicates and the spun content before you stop the duplication, imo, and your solution only applies to people 'in the know' which leaves out all the people who don't have a WMT account, don't know they exist, don't want to have one, or just plain don't want to go to all the extra trouble of logging in to Google and telling them we've posted a new page, like me.

It's a problem they need to solve on a large scale basis, not create some new work-around for which forces people to use their system even more than they already try to do, and that said, they'll probably do some BS just like you're suggesting rather than fixing the issue, because that's how they are...

Really, way to suggest we all be forced to use Google's system whether we want to or not, even more than we already are, and your solution doesn't do anything for the trillion or so pages already on the web...

Yeah, I'm a bit ranty about this one, sorry.

ADDED: And what about directories? Do I seriously need to find a way to drop 10,000+ pages into a WMT account when I update one, or does my content there not really 'count' as content? You're thinking about a limited, one-page-at-a-time solution, and I can't even imagine the effort it would take to try and get a neat little Unique ID tag for a site like a directory where I would have to generate and install 10,000+ of them to have one on each and every page, or a site like CNN where they publish more than a handful of pages a day and the author likely does not have access to the source code to insert an ID of any type.

TheMadScientist

8:28 pm on Apr 8, 2011 (gmt 0)

Since I'm ranting a bit (lol)...

What would happen to the people who don't know the system exists, or choose to not use the system, or publish so much content it would require a massive investment and full-scale system change (like CNN or NYT) to even think about implementing it?

What about a site like wikipedia where the content can be updated anytime by anyone? How would they get a unique id tag to put on their pages? Or shouldn't their content be protected too?

What if a scraper got the content from a page where someone forgot to use the neat little id system and got an id first? Is that just too bad for the content publisher?

What if they don't know it exists and a scraper does, so they check for the id on the page, and when they don't see it they copy and paste the content into their account? Is that just too bad for the site owner and content originator?

Leosghost

8:31 pm on Apr 8, 2011 (gmt 0)

What about a site like wikipedia where the content can be updated anytime by anyone? How would they get a unique id tag to put on their pages? Or shouldn't their content be protected too?

I thought they were OK if you used their content anyway ( I don't)..Just posting so you could catch your breath TMS ;-)

Reno

8:33 pm on Apr 8, 2011 (gmt 0)

TMS...

The scraper problem is so serious it may be necessary to fight it on multiple fronts, with scaleable solutions. For those of us with small websites, what I suggested may be one approach; for megasites, it may be something else. The point is, Google could beneift if they would only get the dialog going with professional webmasters, to fully define the extent of the problem, and to perhaps brainstorm some ways to combat it. Right now, I'm not totally convinced it's a top priority for them, especially when they use meaningless generic terms like "quality", which they refuse to define.

..................

TheMadScientist

8:33 pm on Apr 8, 2011 (gmt 0)

Thanks, needed it ... lol ... I think you have to give attribution, and isn't that still duplication, even if you attribute, and why should the republisher outrank the original, and if they didn't use the system and the copier did, then the copier could apply for the ID and outrank wikipedia by claiming originality, right?

Reno, the point I'm trying to make is your system totally backfires ... There's not an easy solution to this, because as soon as a unique id is the answer anyone who doesn't have one is completely hosed, because as soon as a scraper sees it's not present they're going to claim the content...

TheMadScientist

8:39 pm on Apr 8, 2011 (gmt 0)

New webmaster creates a website and does not know the unique id system exists ... After publishing content for 6 months and not ranking they learn about the system and try to get an id because they're the originator of the content, only to find out someone else has stolen their content and has the unique id...

What now? DMCA? The content thief has proof of an initial publication date on their site, does the new webmaster who has no clue scraping and content theft is an issue have proof of when they originally published something laying around? Probably not, imo, so the site they spent 6 months working on, which is original, is considered a duplicate and won't ever rank, because the 100 pages, although originally created by them, are claimed by someone else...

The only type of unique id system that could possibly work is one internally at Google, and then it has to be based on discovery date, so it's not even going to be correct all the time ... I'm sure if there's an easy solution to this they would have already made it happen just to shut people up about it.

[edited by: TheMadScientist at 8:46 pm (utc) on Apr 8, 2011]

brotherhood of LAN

8:45 pm on Apr 8, 2011 (gmt 0)

I made a post here in the private forums about using a unique ID approach, [webmasterworld.com...] ... basically to encapsulate unique content with a signature and ping the signature to search engines, which wouldn't require a wait for googlebot. Brett's idea was an alternative mentioned, either way it involves technical know-how.

The nice thing about using the ID approach is that it's better than the current situation of a free-for-all. Those who value their intellectual property can participate. Major downside is that all content up until the release of such a service is still vulnerable. Content would have to be evaluated still (shingling) for near duplicates.

[edited by: brotherhood_of_LAN at 8:47 pm (utc) on Apr 8, 2011]

TheMadScientist

8:47 pm on Apr 8, 2011 (gmt 0)

Major downside is that all content up until the release of such a service is still vulnerable.

And any content not claimed by the originator could be claimed by anyone ... Using the system you're talking about, unless Brett found a way to claim on a post-by-post basis, I could claim this thread!

All I would have to do is copy and paste the stinking thread on to my site and apply for the id before Brett did, and the thread would be mine...

brotherhood of LAN

8:50 pm on Apr 8, 2011 (gmt 0)

Agreed TMS,

Talk is cheap nowadays. Look at the price of hosting, how easy it is to make a site and to use a keyboard to write content.

If there was a service that could guarantee any (new) content would not be plagiarized for a small fee, say 1 cent a page, I think anyone who takes their site seriously would be interested.

onepointone

8:58 pm on Apr 8, 2011 (gmt 0)

Maybe g somehow losing its ability (or desire?) to keep track of the source of original content factors into the latest algo changes?

i.e. 'ranking websites' vs. 'organizing the worlds information'?

who cares if site B wrote it? site A has so many more tweets and +1's!

johnhh

9:03 pm on Apr 8, 2011 (gmt 0)

I think "discovery date" is probably the best option -If you create new unique content regularly as we do - at some expense - the site will be crawled on a regular basis.

Or even combine with an ID system - for new webmasters it would be just another thing to learn - in the same way we learnt about H1 tags etc. Given enough publicity it would work.

Content theft ( for thats what it is, theft ) is one of our major problems - last week we found a school in New Zealand that copied most of our site - including page design ! Their excuse - oh - it was meant for internal use only !

Content_ed

9:07 pm on Apr 8, 2011 (gmt 0)

Pre-Panda was the best option. Never saw scrapers ranking above us until post-Panda:-)

johnhh

9:08 pm on Apr 8, 2011 (gmt 0)

Oddly onepointone I just read this

[computerworld.com...]

[note to mods - remove link if unacceptable under T&C ]

tedster

12:19 am on Apr 9, 2011 (gmt 0)

The link is fine - it points to an authoritative news source. We've got a thread dedicated to discussing that news, in fact: +1 really counts - Larry Page ties bonuses to social media success [webmasterworld.com]

Freedom

1:54 am on Apr 9, 2011 (gmt 0)

DMCA rarely works. It's not a real solution. It's a make believe solution to a problem that is out of control.

GoogleSoft won't solve the original content problem unless they are embarrassed into it publicly. The more this is discussed on forums, the more the press picks it up. The more they pick it up, the more GoogleSoft can't ignore it.

Meanwhile, they are too busy beating up webmasters who follow their guidelines, but then decide to punish them by changing the guidelines without telling them.

Their own hubris is killing them.

ScubaAddict

1:57 pm on Apr 10, 2011 (gmt 0)

< moved from another location >

Just like many others, we took a big hit on Feb 24th.

One area that took a hit was an large educational resource database for teachers to give to their student to encourage creative writing and learn about historical events at the same time. We have always had our copyright listed boldly on our pages (~"republication on the internet is NOT allowed"). But with technology being integrated more and more into the classroom, a lot of assignments are given, and completed, online. As a result, our intellectual property is being copied and pasted into websites and teacher 'edublogs' all over the internet. These are NOT malicious republications - rather a "natural progression" of education with technology and the internet.

Since panda was released - this was one portion of our site that was 'penalized'. For hundreds of these resources, we are now outranked by the teachers and students who have republished our copyrighted work. NOTE: I have read thousands of pages on theories about panda, and how it is not a 'penalty', etc. etc - so please lets not fight over semantics.

So now since google no longer knows that I am the originator of the text, and I am losing traffic and income due to decreased rankings, I am forced to stop creating free educational content and chase down hundreds of copyright infringers sending Cease and Desist emails, and DMCA's to the students and teachers that I serve. And what for? Using the materials I create to educate (their intended purpose)?

Google (in their infinite wisdom) really can't decipher the originator of duplicate text where one site is a blog that has 1 link to it from their teachers' website, and a site that has hundreds of links to the original content? More importantly: *Why could they prior to Feb 24th*?

[edited by: tedster at 3:58 pm (utc) on Apr 10, 2011]

tedster

4:15 pm on Apr 10, 2011 (gmt 0)

More importantly: *Why could they prior to Feb 24th*?

That is the right question for us to look at, I think. First came the Scraper Update [webmasterworld.com] at the end of January. To my eye, it kind of hit in some areas and made new problems in other. But then a month later we got Panda, and the scraper problem - as well as legitimate syndicator problem, seemed to get a lot worse.

TheMadScientist

4:21 pm on Apr 10, 2011 (gmt 0)

I honestly think they applied the filters in the wrong order for some of the other things to work they way they did before if they applied quality after trust and relevance, etc.

If you apply relevance and trust first, then sort by quality, you get 'top quality' at the top, but sacrifice some relevance and trust ... If you apply quality first, then relevance and trust, you get higher relevance and trust, but sacrifice some quality ... It's all an order of operations thing, imo.

I'm sure they'll figure it out, unless they're all too concerned with figuring out how to get people to click the neat little +1 buttons so they get a bigger bonus check, then the content originators could be completely hosed ... So much for that year of focus on quality we heard about, huh? Maybe it was a short year, like US New Year's Day to Chinese New Year's Day or something.

ScubaAddict

5:03 pm on Apr 10, 2011 (gmt 0)

We are copied/scraped on many sites, and I didn't see a difference from the scraper update, my traffic graphs in 2011 are identical to 2010 up to Feb 24th, where we take a very noticeable hit.

Prior to Feb 24, I used to be able to take sentences from our content, place in google and we would be at the top, followed by all of the sites who copied our content. Now, in most instances our site is preceded by all of the user sites at wordpress.com, edublogs, (insert site here) who have copied us. In addition, and more importantly, the main keyword phrase having to do with this content (the authoritative keyword phrase?) that is a competitive w/medium traffic phrase, we used to rank #1, #2 and some-days #3 - we no longer rank on the first page.

I wonder if somehow wordpress.com and edublogs (and eHow, etc) have for some reason been given some sort of authoritative status, and therefore make them appear as the originator of the content... heck I don't know. Age can't be much of a factor, as we have been online for well over a decade.

I do have multiple other sites that are more programatic tools for parents and educators (ie. stuff that can only have it's "Concept" copied - not actual "words and text"). These sites suffered no hit from panda. This only reinforces my thoughts that copied 'textual' content play some role in Panda.

indyank

5:08 pm on Apr 10, 2011 (gmt 0)

They figured out that only by loosing this memory they can bring to top some good sites like e-how and they are happy to loose memory of the content originator in their "true" fight against spam.

synthese

12:43 am on Apr 11, 2011 (gmt 0)

@chrisv1963 I feel your pain. After panda, if I do a search outside the US, our article always comes up first when searching for the title. Inside the US, it is often down on page 2 or 3 buried below some very low quality sites.

Before Panda it was fine. So Google have been pretty good at spotting the original article, but something in Panda has got it the wrong way round.

It's not about better templates or nicer design.

One difference is my site is in Google News. Articles still rank well in Google News (US and outside) -- yet rank below low-brow spam sites on the main search engine. Go figure?

DMCAs etc are futile. I've spent days and days on it, and barely got anywhere. I've even asked authorized syndicators to stop - but nothing makes any difference.

luke175

12:45 am on Apr 11, 2011 (gmt 0)

It boggles me that two important metrics aren't considered by Google.

1. Publication Date: How on earth can a site who has a domain that was registered after my content was posted outrank my original content? This is more flagrant than just the date the content was first published.

2. Backlinks: Ya, those things that are supposed to be so important to Google. A site can scrape my RSS feed with a full backlink back to my site yet still outrank me for that content? How does that make sense at a basic level?

ScubaAddict

1:04 am on Apr 11, 2011 (gmt 0)

luke175 - where would the original publication date come from? If this is a databased date of publication, it would be easily forged. If it is a server timestamp - what happens when you change the file, or change it's name? What if you have a hard drive failure, and had to reupload an entire site?

I can't see how they could get a reliable 'publish' date from anything.

The only thing that could be of use (that I can see) would be an index date as recorded in google's database, even though that still has it's faults.

Content_ed

1:05 am on Apr 11, 2011 (gmt 0)

Pre-Panda, when Google trusted themselves to determine relevancy through linking and PageRank, the originators site was almost always the winner for content. It's natural. The only exceptions would come if your content was ripped off by a higher authority site which almost never happens, and could always be dealt with when it did.

And when it came to legitimate quotes and excerpts, legitimate sites always gave real links, without NOFOLLOW, which made it easy for Google to keep straight. Is it any wonder that the big winners of Panda, such as eHow, Yahoo Answers, etc, use NOFOLLOW in all it's varieties?

I've got no opinions about quality in big retail sites, community sites, etc, but for original content sites, Panda got it exactly wrong.

falsepositive

1:43 am on Apr 11, 2011 (gmt 0)

This whole thread truly makes me wonder where Google's head is. It makes me think more and more that they just don't care that much about original source. Firstly, they keep saying at the Plex that they are pretty good at identifying original source and who does the copying. Then later on, I see people claiming that they've lost the ability to figure things out. Which is it? I'd certainly like to know because in a way, knowing would allow me to focus on certain actions vs others. If it's a duplication issue, then I would resort to spending some time/energy to fight duplicators. If copies ranking ahead is just a side effect of what's going on, then I would focus on making my site better so that it's "strong" enough to fight off the scrapers.

Of course SEOers will suggest to try both but I don't have the time (I have hundreds of scrapers copying thousands of pages). The sad story is how much time/effort/energy we've spent and wasted on this bull. The time/energy we could have spent providing our users with our services/products/ideas instead of being sapped by this nonsense.

This 105 message thread spans 4 pages: 105