| This 66 message thread spans 3 pages: < < 66 ( 1  3 ) > > || |
|Scraped content ranking higher than the original source|
Since last month my traffic has started dropping tremendously and after investigating, I found out that another website (in China) is copying the content and pictures from my website and pasting it on their site. (manually)
For those pages that are duplicated, (around half my site) Google ranked their pages higher than my site for key words. In some cases, Google only show their site and drop ours completely.
All our pictures are watermarked with our domain, so how do I report to Google. The drop in traffic is hurting us real bad, and all out photos are taken in house. So we have to pay our photographers etc...
Really need help!
|brotherhood of LAN|
It definitely is a tight situation, short of Google having a copy of the Internet at any one time, and spidering everything as soon as it's online.... and the storage/mechanism to filter out appropriate duplicate information... the problem remains.
Perhaps its time for a pro-active search engine to have content 'receivers' rather than emphasis on fetchers.
The solution should be simple, something I call "Crawl-Delayed Publishing"
Publish content via a sitemap ping only to Google first, do not link it into your site or add to RSS feeds yet. Wait for Googlebot to crawl the new pages before actually linking the page to your site and adding to the RSS feed for the rest of the world to see.
At this point your page is 100% unique, absolutely first, nobody can claim ownership otherwise because you were the first to publish it worldwide.
Pretty trivial really, just keep it hidden until Googlebot fetches the page.
Bill, just a thought, if you keep it hidden Google will not have record of any links to the page when they find it, until they re-crawl the page it can easily be outranked with a copy that DOES have links pointing to it.
When you make the article live is when the link juice to the page can kick in and it's possible what you describe results in a much bigger window between publish and re-crawl versus the normal publish and first crawl.
Google will know your page came first but there is no guarantee that publish date is more important than popularity (ie: pagerank, trust rank and whatever else Google uses). There's a chance delayed publishing actually helps the scraper... if the scraping site is of high quality (in google's robotic eyes)
brotherhood - receivers don't work, they accumulate too much automated spam.
Yea I noticed for the first time yesterday that a blog post I published yesterday morning was ranking 1 result higher on another blog that's been republishing all of my posts. I've got it pretty well laced with links back to my site and proper attribution, which they don't take out, so it usually ends up sending me traffic. But this is the first time I ever saw Google rank it above mine. So now I guess I gotta do something.
One of my website which I published last year received good rankings in Google, had 100% unique content. Then a few months back my site got penalized, and on checking a few things I found that all of my articles were copied by more than a thousand scrapper sites. And they were now ranking above me on all keywords. I can not do a thing here, because I can not even think about going after thousand site owners, it will take ages. I still remember those good old days of 2000-2007
|if you keep it hidden Google will not have record of any links to the page when they find it |
When Google crawls a single page from a sitemap it doesn't know of any other links to that page already, this is nothing new.
My suggestion is just a possible a way to establish content ownership by flagging it to be crawled before being displayed to the rest of the world, thus bypassing aggragators that may be indexed faster than you from claiming the content first.
Once Google crawls that new page, you can use another sitemap ping to show them a page that links to that new page.
It sure would be nice if they honored such a simple scheme to prove ownership.
This is what I Do: I get the PAGE that Link is on CRAWLED FIRST(Only BOT Sees it), then as soon as the page that link points to gets crawled It goes live with in 20 Minutes or so. RSS has usually one sentence from the entire Article and in Article itself it that sentence is usually rewritten to use different wording. People who subscribe to that RSS know that it's in "We almost There" status, and would usually link to it as soon as it becomes live. CUSTOM CMS.
Yes it is a bit of cloaking but for the sake of being first to deliver 100% unique, absolutely first, nobody can claim ownership of that content.
My 403 Magic Wand is pretty rough and I do keep automatic Scrapers at BAY, been doing it for a while so I am not really worried about that.
And still, after the content is copied to another site, off the rankings go.... :(
I've used the DMCA route about 200 times with I'd guess a 95% success rate... it is very simple once you get the paper work correct. At this point I've done so many that I don't need to do paper work.. I just tell google it is me again and send my url and the offender url and it is taken care of.
And completely pointless because Google seems to ignore the age of urls (as I pointed out above and pointed out several years ago).
|Pretty trivial really, just keep it hidden until Googlebot fetches the page. |
Google have a habit of first indexing a page, then deleting it, then indexing it again. Scrapper sites can benefit this way.
Great topic; as with many above sites I run are continuously copied, manually and automatically.
Three interesting issues
1) The mechanics of issuing DMCAs... I always file against google, and the host if based in Europe in the US (I've had no response from Chinese ISPs) - create the paper work for them, it costs them - though google seem to have outsourced there DMCA to some country with English as a 9th language, and instructed the drones to not read the DMCA and send scripted responses. I hate the "please list all the copied URLs" - it's the ENTIRE DAMN SITE - it has thousands of copied pages.
2) Why does Google (any SE) not establish some kind of trust mechanism that means that if in doubt site A (established, good site, running for years, often updated) out ranks - site Z - new scrapper site hosted in China - when there is any duplicate content.
Tedster: " The part of Google's algorithm that is supposed to locate the original and filter out the copies is currently broken. Make that more broken than it was before. It's a tough problem, I'll give them that, but it used to be better than it is. "
Totally agree with Tedster here.
3) The ethics of the no-cost instant digital copies.
Tedster: "There is a philosophy that underpins a lot of the technical world - not just Google - that the human race is entering a new age where intellectual property rights just will vanish into "the cloud". If you ever casually reused an image without being certain you had the right, or if you ever took a copyrighted song or movie from P2P or the torrents, you may be sharing in this same mindset."
No no no ;) - there is a huge difference here! If I were to copy Avatar, remove all the credits, list myself as the director, release it at the cinema, convince Amazon and Walmart to carry my "Avatar" and drop the James Cameroon original, rename the actors, take the credit for the special effects, sell merchandise and scoop up the profit from it - that is what is happening here.
What would happen if Amazon were found to be selling pirated DVDs? this is what google are complicit in - Amazon would still make the profit, as does google as 99% of scraper sites have adsense plastered all over them. It's disgusting that they do not take it more seriously.
gethan, I'm not saying that the two different acts are of the same severity, because they clearly are not. But they do both spring from the same underlying mindset - at least for those who feel they have a right to use others' creativity for free. For people who know they are stealing but do it anyway, well, they're in a different place.
I agree with Tedster. The definition of intellectual property is becoming fuzzy. With digital technology, the concept of property is freed from being tied to something tangible. That makes re-evaluating what we mean by "property" inevitable.
It doesn't mean all respect for IP will necessarily go away, though. I actually think the opposite will happen. As people collaborate MORE and share MORE of the fruits of their creativity, there will grow a common sense of mutual respect for everyone's property - however its limits are defined.
An area ripe for change is fan fiction. Unless going through special arrangements, you can't make money off of fan fiction because of miserly intellectual property rights. But fan fiction has blossomed with the Internet - and there's a hugely loyal following. The only reason it's not monetizable are those old-fashioned notions of IP.
But I can think of a couple of business models offhand that could allow fan fiction to be monetized so nobody is hurt (except perhaps the original writer's ego) and everybody wins (including the original writer's pocketbook).
With more liberal ideas of what constitutes fair use, and more models for compensation, we could see such an explosion of creativity it's not funny.
That said, I think all that is in the far future, decades away. In the meantime, we're in a struggle between people who respect IP and those who don't. Although I do think it's a pretty weak struggle, because once again, IP's going through a change. And that change seems to correlate with a generational shift as younger people are staking out their territory online.
The ethics of intellectual property are tied to the daily reality of living. And that reality differs among the generations.
When I was growing up, it was easy to make tape recordings and photocopies and sneak into movie theaters. Our generation (Gen X) didn't think that was wrong; the reasoning went, if it was that wrong, there would be some enforcement, wouldn't there? And anyway (the reasoning continued) the rules couldn't be fair, because not everybody could afford to read, see, and listen to all that media, but knowledge of the media was vital to keeping our social position, which was vital to our finding mates, succeeding in careers, etc., and what kind of "equal opportunity" was that, anyway? And so on. The point is, the world looked different to a Gen Xer than to a baby boomer. (In general.)
These days, it's easy to copy and paste and watch pirated TV on YouTube. Try to tell a Millenial that copying is wrong, mashups are so wrong they're thievery, and not paying for something they can't afford in the first place is malicious mischief, and you'll get looked at like you're insane. Not because they're morally destitute, but because the ethical world you're describing isn't described in the real daily fabric of their world.
And incidentally, there's another dimension here, too, across cultures. It may not always be a lack of respect that's behind scraped content. Though I can't remember the source - sorry - what I recall is that in China, the government is reputed to suppress people's access to certain web content. As a consequence, it's a common practice to distribute content as far and wide as possible, not necessarily, or only, to steal the content, but to protect it.
Not saying that's what is happening even most of the time, or even that that makes a difference to us - after all, from our perspective, scraped content is scraped content - but just that ideas of intellectual property blur from culture to culture because of different degrees of freedom and opportunity.
Once language barriers are surmounted by technology...once the younger generations gain more of a voice...different populations will be influencing each other like mad.
Which is why I just don't know what IP will look like in the future, but I think it's going to be very different from the IP of the past.
I reported the use of Google google.com / webmasters to review your site. I've also written for the web hosting site to offend, but still received no reply. What else can I do?
To be blunt, this is nonsense.
|The definition of intellectual property is becoming fuzzy. |
I haven't bothered to read a copyright notice in full for many years but they are pretty all-encompassing as are relevant laws. They are certainly able to cope with the digital age.
Maybe I'm cynical but I think this is an absolute classic mistake. I doubt there ever was or has been a filter that attempted to distinguish copies from originals. Any perceived filter is almost certainly a side effect of other algorithms.
|The part of Google's algorithm that is supposed to locate the original and filter out the copies is currently broken. Make that more broken than it was before. |
Frankly, I'm fed up of saying it but, just one more time...
The only way Google could implement automated identification of copies would be by using the age of urls. If Google (and others) had ever done that, there would be very little professionally copied material on the internet.
Google has an interest in removing duplicate content because it clogs up its database but it has no interest in determining what is original and what is the copy. Google does not care and has never cared about third-party IP rights and they never will until someone forces them to do so.
PS I hereby recind all IP rights to the above paragraph. Copy, paste do whatever you want, start a new thread but please stop talking about Google's broken filters, etc. - this is almost certainly pure fantasy.
Actually, I couldn't care less if someone else copies my content. I just need Google to stop ranking them instead of my site!
Imagine working so hard to create original content but traffic ends up going to an ugly looking site elsewhere that uses your pictures with your watermark.
It's really adding insult to injury.
I thought I was only one thinking about same.
|Google is getting original source attribution wrong more often than they were, ranking the scraped or mashed-up URL and filtering the original. My theory? |
Our site(s) content is completely available on blogspot and wordpress subdomains, after reporting those are removed but cannot control this atuomatically.
Only 1 way found was to disable rss feature.. but still scrappers copy/paste entire pages effecting our site very badly.
tedster is there anyway we can keep a watch on the guests continously harvesting our website ?
some tool which can track this ?
I broadly agree with Kaled, and can even give you a reason for the apparent deterioration of the perceived "filters".
Google used to revere "age" for it's own sake. Older was better, whether it be domains or content. IMHO, it wastn't just valued, it was overvalued.
Recently, the value of "age" has decreased and "Freshness" is the new UberValue.
So, what's the real world implication of this, as it relates to scraping? Well, the first thing to say is that Original content used to outrank the scraper because it was older. Only older, not because it was original, although this was a happy coincidence. Older scrapings might eventually outrank the original, but usually it had been penalised into oblivian before that happened. New scaper sites had no chance, except where the content was also newish. This lead to the myth that scrapers only win because Google could not differentiate the original, when in fact no attempt was made to differentiate.
In the Era of Freshness, "Age" might be a tie-breaker, but it's not a defence against scrapers- especially if the scraper "buzzes" it's content through social media.
Result: Scrapers outranking original content becomes more common.
Solution: Probably can't be tackled from inside the system (i.e through SEO). DMCA is one option. Class action suit might be an option- I suggest it be filed in Europe, probably Germany or France. Much better chance of scaring Google, but with the potential downside of uniting America against those pesky, interfering, anti-Amercan Europeans. Also, the financial settlements are historically smaller outside the US than within.
Otherwise, its long and messy .htaccess files to block at point of scrape. Which silly, because Google could stop this very easily. The overwhelming majority of content will be indexed before it is scraped- Google HAS the age information, it should use it.
Although, of course, ecommerce and afilliate sites would be in a fix, because most product pages use manufacturer information, for the very good reason of not misdiscribing goods for sale.
Short of taking legal action (which is probably the only that will work) writing to the BBC (or other news organisation) might help. If this problem became well known it is possible that Google would be forced to take action.
If someone fancies going this route, be aware that the blame must be laid firmly at Google's door. It's no good allowing them to claim innocent victim status it must be made clear that not only are they active participants in IP theft and deception of the public but they profit from this theft and that their algorithms actively encourage it when it would be easy to crush it.
Make that argument loudly enough and maybe someone will sit up and take notice. Maybe one or two agencies that are already investigating Google over the wireless data-logging fiasco will take it up to use as a big stick.
|2) Why does Google (any SE) not establish some kind of trust mechanism that means that if in doubt site A (established, good site, running for years, often updated) out ranks - site Z - new scrapper site hosted in China - when there is any duplicate content. |
I really think that part of the problem we face as Webmasters today, is that almost ALL of our sites DO HAVE HISTORY with Google, and regardless of whether we admit or not, what we did yesterday "to rank better" that was squeaky clean and cutting edge, is now considered questionable and often penalized, putting us at a disadvantage to begin with. (Link directories, article marketing, long tail targeting, link exchanges, etc)
The REAL ISSUE with scraper sites are sites that are +90 days old and still outrank you, or are even in the engine for that matter! If you have those kinds of aged scrapers outranking you in serp's, you need to fix the inhouse or onsite issues as well managing the offsite scraped content.
If Google continues to keep scrapers in their serps after they KNOW the content is copied, and lets face it, they DO know it... well then they suck! There, I said it! :-)
Just my .02
|The REAL ISSUE with scraper sites are sites that are +90 days old and still outrank you, or are even in the engine for that matter! |
No, the real issue is the technology exists to stop most scrapers and instead of installing something that will solve the problem, people just keep getting scraped and complain about it.
Google can only do so much and Google isn't the only search engine in town, the problems also exist in Yahoo and Bing (soon to be the same).
At some point the webmaster has to protect themselves and claiming they "don't care about scrapers" yet "care scrapers outrank them" is idiotic because technically they CARE ABOUT SCRAPERS!
We used to have 302 hijackings, back in 2006, and we raised hell about validating spiders until it culminated with Dan Thies and myself raking Google over the coals in public at SES '06 in San Jose to get them to agree to give us the tools to fix the problem. Both Microsoft and Ask jumped up in at that session and promised to do it as well, and they all did, and it became an easy fix.
However, webmasters still didn't install the simple fix to validate the spiders, which stopped 302 hijackings, and just kept complaining.
If Google gave everyone the technology to make sure scrapers didn't outrank their content tomorrow, based on past history, my suspicion is most webmasters still wouldn't install it.
Many in the industry, including myself, don't mind fighting the fight to get things fixed but when the fight is over and people ignore the solution, why bother?
|If Google continues to keep scrapers in their serps after they KNOW the content is copied, and lets face it, they DO know it... well then they suck! There, I said it! :-) |
a poster on this board mentioned that bing's share of the search market has been growing slowly, and he interpreted the data to mean that people are searching for something on google first, becoming dissatisfied with the results, then turning to bing as a backup.
I am not sure what the best course might be - legal action would be costly and difficult, so maybe we just need to spam google into action; I am thinking that all legitimate webmasters should buy up a bunch of cheap keyword-rich domains and scrape as much content as possible. Only when google sees its user base plunge - and a drop in adwords revenue, will it actually change how it ranks duplicate content.
|No, the real issue is the technology exists to stop most scrapers and instead of installing something that will solve the problem, people just keep getting scraped and complain about it. |
Doesn't the technology already exist in the voodoo of the algo though? Haven't Goog built filters or triggers for duplicate content, and presumably weight the results toward the authority site?
I hear what you are saying and agree. Its not up to anyone other than the webmaster of a site to maintain its own backyard.
My previous point was that if an toddler-aged scraper site outranks what you (referring to the site owner) thinks is an authority site, its probably time to look inhouse for other issues.
|Doesn't the technology already exist in the voodoo of the algo though? |
Not based on conversations I've had with some of the search experts at the 'plex.
Content ownership appears to be a big bugaboo that is yet to be solved and aggregators often get the upper-hand with our content just because of their popularity alone.
Remember, technically Google is just a scraper/aggregator of our data so it's hard to fault them for not penalizing others that technically do the same thing.
IMO it's really up to the webmaster to decide which scraper/aggregators are allowed to get your content, not Google.
Many in the industry, including myself, don't mind fighting the fight to get things fixed but when the fight is over and people ignore the solution, why bother
Ok, so what is the solution? I'll implement it right now.
The solution starts at the search engine spider section on this wonderful forum [webmasterworld.com...]
Well, that link tells you how to block the scraping bots.But how about human scrapers.This is particularly difficult to handle if your site has only a few pages as they could manually copy the content.
@kaled, Regading age, what if the scraper site gets indexed first and they do it by feeding the googlebots a lot of "fresh scraped content" and thereby keep them arrested on their website.
I guess the only solution to this is a way where the original content provider can ping google and other search engines to know that they have some fresh original content, before they release it to the world. (a sort of "push" architecture as against the current "pull" architecture that google bots currently use to devour (index) content.
I am not convinced that this is deliberate because some of the scraper sites are scraping Google (Google Groups hosted mailing lists in particular), so Google are losing revenue.
That said, the duplicate content filters are not working well.
Almost everyone who has sent DMCA notices say they work, so the is a solution once you notice the scraper: DMCA notice to site owner AND host AND Google and Bing.
Now how do we spot scrapers? Copyscape charges per page, so it becomes really expensive very quickly ($250 a month per thousand pages).
@indyrank, If you go back, you'll see that I covered that. The solution isn't perfect but it's probably the best that can be done.
|Now how do we spot scrapers? |
A workaround Copyscape would be to add some unique words, or typos, within your long text. Most scrapers wouldn't notice and edit them. Use your SE to find and spot them.
|Most scrapers wouldn't notice and edit them |
What is the purpose of "them" noticing it if you have to fight it later on anyway? The source of the Issue is $$$.$$$.$$$ collected by IT(GORG, Scrapers), in the same layer of: will let it go tile till most B&M&W start "*itching bout that"?
| This 66 message thread spans 3 pages: < < 66 ( 1  3 ) > > |