homepage Welcome to WebmasterWorld Guest from 23.23.57.182
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 41 message thread spans 2 pages: < < 41 ( 1 [2]     
My SERP positions taken by scraper sites
danijelzi



 
Msg#: 4298972 posted 3:54 pm on Apr 16, 2011 (gmt 0)

I don't know if it's due to Panda or not:

My site is 5 year old and has relevant inbound links, mostly pointing to originally written news post pages, and I don't think my site is a content farm.

Here's a pattern for the last two days:

- I publish an original news post and after an hour or more I get links from relevant sites.
- The related posts with backlinks to my post (on these relevant sites) get on the top of Google SERPS and my page is somewhere around #5.
- After a couple of hours, a couple of scrapper sites take over my position and I'm on the 2nd, 3rd page or simply nowhere.

I was curious and did the same check for my competitor's news, which has a similar site as I do. The pattern in his case is less severe, but scrappers are above him anyway.

I've filled a spam report on GWT and waiting for solution.

Does anyone else experience the same or similar thing?

 

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4298972 posted 7:58 am on Apr 20, 2011 (gmt 0)

Update 2: It takes around 10 minutes for a scraper site to take my content and position itself on #1 in SERPs


Why are you allowing anyone to scrape?

Are you publishing full RSS feeds or just snippets?

Are you whitelisting robots.txt and .htaccess to just allow a few spiders and valid browsers to access your site?

Are you cloaking tagged content for browsers (not SEs) to ID who scraped your site by putting tracking codes in your text hidden with CSS so you can see who they are and block them?

Are you sending DMCA notices?

If you aren't, stop complaining and get busy stopping the scrapers.

rico_suarez



 
Msg#: 4298972 posted 10:25 am on Apr 20, 2011 (gmt 0)

Google cannot possibly tell original author from scraper site (unless you make complaint). When scraper site steals your content, it must happen after your published your content (obviously), and that makes scraper's site content more recent and thus higher in SERPS. What I've noticed with scraper sites is that most of them are still using black hat SEO with ton of keywords and tags on page and it obviously works. Whoever said that a ton of internal links and content stuffed with keywords doesn't work, have no idea what they are talking about.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4298972 posted 10:43 am on Apr 20, 2011 (gmt 0)

Google cannot possibly tell original author from scraper site (unless you make complaint).

Not true whatsoever.

Simple methods exist but Google ignores them.

If the author of the content uses a sitemap ping and waits for Google to index the content before releasing it public, it's quite obvious where it appears first in the web.

Only the original owner would be capable of making that first sitemap ping, just wait a bit before releasing the content to make sure nobody else could register it in a sitemap withing a minute or so, real simple.

Oh wait, they could already do that and don't, assuming people are doing live sitemap pings for new content.

tranquilito

5+ Year Member



 
Msg#: 4298972 posted 11:30 am on Apr 20, 2011 (gmt 0)

Finding original source seems to be a total disaster [seomoz.org ]

rico_suarez



 
Msg#: 4298972 posted 11:38 am on Apr 20, 2011 (gmt 0)

you are right but there is a simple technical catch. imagine hundreds of millions of posts, images, sounds etc being published every day. can you imagine an algo that could compare all those sitemaps with each other and then make a decision on who was first for what. I think many people are overstating the power of google algo. because of amount of data it must take into consideration, it must make shortcuts and it's certainly doing so. it's simply not viable to take an article and compare it to billion pages and then do the same thing for every new article on the Internet just to establish the original author. if Google could do that, then DMCA complaint wouldn't exist. Google would know who is stealing what. That's why they created PR and other parameters to give weight to sites. And sometimes scraper sites outrank original sites. That's sad but true.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4298972 posted 11:48 am on Apr 20, 2011 (gmt 0)

can you imagine an algo that could compare all those sitemaps with each other and then make a decision on who was first for what.


Yes, wouldn't be that complicated at all.

I was suggesting a simple PING thing, like maybe a one document (or more) per sitemap, not a whole sitemap, just brand new content only, maybe an extra field on the ping for "author".

When the SE ranks the sites to display the SERPs simply rank in order of the original content sitemap PING with the first original AUTHOR by date as #1. The real author would always take top billing, not anyone hijacking the content. If you find additional sites claiming to be the author for the same content, or someone trying to "author" pre-existing content, the fake "authors" go supplemental instantly.

How's that for an incentive not to cheat, scrape or aggregate knowing you'll nuke your page if you attempt to claim ownership for something you didn't write?

Simple solution, authors get top billing, nobody gets to claim historical pre-indexed content as new.

I like it.

rico_suarez



 
Msg#: 4298972 posted 1:09 pm on Apr 20, 2011 (gmt 0)

If you ping authorship on article "How to build a house" at 12:05, and scraper site steals your article and changes title to "Building a house - simple way" at 13:05, and creates a ping for authorship, if the scraper site has higher PR or more traffic, it could easily outrank you. Google does not read content to compare it. Most probably, it takes other paramters like PR, links, web brand etc. to decide on which position will this article be placed. Your article and stolen article are the same, but they have different author and if Google doesn't read them, both are perfectly legit and different. Then other parameters decide.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4298972 posted 1:14 pm on Apr 20, 2011 (gmt 0)

The title isn't the content, if the content fundamentally matches, and we have to assume Google can detect this, in Panda, unless it's gotten some serious text spin, then the original should win. However, if you change the title and spin the text it ranks for different words.

rico_suarez



 
Msg#: 4298972 posted 1:23 pm on Apr 20, 2011 (gmt 0)

I seriosly doubt that Google compares your or mine content with other sites to see if it matches. It would take tremendous resources to find out which articles, images, sounds, products etc. are copied from where and why they were copied - because I as author authorized the other site to publish it or that site stole it. I agree that original author should win, but there is a whole science behind why it don't always win.

londrum

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4298972 posted 1:37 pm on Apr 20, 2011 (gmt 0)

the problem is this: the only people who care about who wrote it first is us, the writers.
but google doesnt need to please us, they need to please the reader.
so if a good, big site copies it then the chances are that the reader will prefer to read it there.

google doesnt want little blogs filling all the top spots when they can have bigger sites there instead. its not worth the effort of trying to sort it all out, there's no incentive.

danijelzi



 
Msg#: 4298972 posted 2:35 pm on Apr 20, 2011 (gmt 0)

stop complaining and get busy stopping the scrapers.


incrediBILL, i'm not actually complaining about that someone scrap my site. All my complaints go to the fact that after Panda for some keywords im my niche SERPs look like:

#1 scrapper, adds all over the page
#2 scrapper, adds all over the page
#3 scrapper, adds all over the page + malware
...
#20 relevant site, original source.

I don't want Google to investigate who first wrote an article, I just want them to provide people who search with at least normal user experience, not loads of totally irrelevant sites with pages full of irrelevant ads and text 1000 pixels below the fold.

Regarding the scrappers, I've tried a couple of things:
- blocking their IPs is effective only for those who take content from RSS feeds.
- RSS delay doesn't help, I just delays scrapping and high ranking of spammers.
- filling spam reports didn't help, it will maybe help later.

However, my last article wasn't scrapped by anyone even after 18 hours and that's after I had made some changes:

- immediate ping to Pubsubhubub
- added Copyright notice and privacy policy link on the footer (maybe these scrappers automatically avoid scrapping pages with copyright notice, I don't know)
- added rel=canonical tag to pages.

If this stays that way, I can say that the scrapping problem is solved. I'll report after more analysis if that helped me in SERPs.

This 41 message thread spans 2 pages: < < 41 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved