| This 31 message thread spans 2 pages: 31 (  2 ) > > || |
|Idea for new algorithm to prevent scraper sites from outranking you|
Just a little more intelligence could weed the scum out of the index
I have a great idea for Google’s algorithm architects. Since there are over 65,000 scraper sites out there that have copied contents from my web site alone, and probably yours as well, and they rank higher than you for searches of your own content, I have a great suggestion for Google’s algorithm.
Maybe Matt Cutts can pass this suggestion over the wall to his coworkers:
BASE THE RANKING ON AGE, goof balls!
They don’t seem to do that now!
Use an automatic WHOIS lookup to see which sites have been around the longest.
If you come across 2 sites with duplicate content, give the ranking to the site who has been online the longest, and delete the 2nd web site.
Period. End of story.
Further enhancement: Delete any site that looks like search engine site.
Further enhancement: Delete any site with a ton of keywords stuffed in the bottom of their page.
Crawl Jeff’s Original bridal tips and diamond buying guide site
Crawl scammer’s scraper site
Find Duplicate Content (which scraper site stole from our site)
Perform WHOIS lookup on Both sites
Jeff’s Site Online: 8 years Scammer site: 8 days
Result: Scammer site not in index, and URL sandboxed.
Jeff’s Site Rank=1, PR=7.
End of Algorithm
What if the scraper's domain is older than the domain they stole the content from?
Usually when a scaper outranks your site for your own site name or specific text from your site, it means your site either doesn't have many links pointing to it or it has a penalty which is causing it to rank lower than the scrapers.
So you could scrape stuff off my year old wedding site and yours would get indexed and mine wouldn't, even though I'm the author?
I have always said that a simple cache comparison might tell them who had the content first. If my content was created and indexed 18 months ago, then bam - a new page shows up with the same content today, clearly it has used or stolen the original.
Jeff, your advice is awful.
In case of that algorithm spammers will buy old sites.
In case of that algorithm spammers DO buy old sites.
|I have always said that a simple cache comparison might tell them who had the content first. If my content was created and indexed 18 months ago, then bam - a new page shows up with the same content today, clearly it has used or stolen the original. |
This sounds pretty good. I would prefer the above idea than the one suggested by the OP.
Hey jeffostroff, I have a better idea. Go right down to the patent office, register that algo immediately, and set up your own SE! G will be out of business in no time. Just make sure you use a very old domain name.
Seriously though, there are many reasons why this is not such a hot idea. I know scrapers do buy old domains. There are also millions of good sites out there that set up in the last 3 years, and why should they be penalized? I have 2 sites - the one set up 3 years ago is arguably 'better' than my older site set up 8 years ago. Nevertheless the younger one has been mysteriously penalized in BD costing me untold revenue.
I am of the opinion that maybe Google does have a bit of code already in place that favours very established (5 years+) sites, as I know of many other old sites that have stood the test of time with G right up to present. But I hear where you are coming from, this epidemic of rubbish/spam/scraper sites is killing the middle ground.
You are a genius. I wouldn't be suprised if Google recruits you directly from WebmasterWorld.
I have a good way to clean up some spam and scrapers in the index.
Ban, Adsense sites....
How about indexing faster and keeping a history .. kinda like the sup results stuff .. then comparing the historic dates with scraped results?
Oh they all ready do that and it is failing miserably never mind
I think you got it backwards, your assessment of my advice is awful.
When I look at the 65,000 to 85,000 scraper sites that have targeted our site, they are mostly created this year, some last year and in 2004 as well, many in the last month.
Many of these scraper sites use automated tools to generate their sites quickly using new domain names. The algorithm I propose would work on the vast majority of scraper sites.
Could a few scrapers sneak by the algorithm via obtaining older domain names? Absolutely! But we are talking a few, not all of them.
As it turns out, we created our sites in 1998, so the scrapers have their work cut out trying to find domain names older than that to camp out on. We have good links pointing to us, and we have shut down over 40 scraper sites this month alone. That will all help. Whenever we find a Page Not Found scraper site, we submit to Google’s automated URL removal tool, 3 days later they are gone.
But to dismiss my idea as awful shows a clear lack of thinking this through. The idea is to remove as many spam results as we can, and certainly this would remove the vast majority of scraper sites.
No single solution can remove 100% of the fraud.
It's getting so bad, that almost any search I do on Google these days hardly ever yields what I am looking for, all I get is scraper sites that look like SERP pages, without the content that Google shows should be there.
Anyway, I stand by my ideas, you can't just pick a few exceptions and claim it won't work.
I like the idea that crobb305 presented above where they use the cache as well to help filter our scraper sites. That would catch the scrapers who buy older domain names.
Google should buy Archive.org, so they could also bounce the searches off Archive.org to see who has the original content (me), and who has the duplicate content (scammer from Korea). I often use screen shots of Archive.org in my DMCA Cease & Desist letters to web hosts to shut down sites who steal content from us.
|When I look at the 65,000 to 85,000 scraper sites that have targeted our site, they are mostly created this year, some last year and in 2004 as well, many in the last month. |
Bingo, and the entire idea makes alot of sense.
There are going to be spammers that get smart and buy old domains to try to outrank. But this usually is not the current case. 95% of spammers use new domains.
Effectively, this could eliminate those dupe content and spam rank issues by that 95% if done correctly.
Older domains and established sites often cost more to purchase as well. Tag in a check for how long the domain is registered would help to further sperate the grain from chaff.
Great idea about comparing pages, and their dates.... until you change one word on your page and suddenly your page is the brand new page...
|Great idea about comparing pages, and their dates.... until you change one word on your page and suddenly your page is the brand new page... |
Still, they could maintain a cache, similar to what webarchive does, and do a cache comparison. You could change a word, or the entire body, but there would still be a history showing the original content, and how it changed. The folks who change out their content regularly might be less concerned about people using their content than those who archive their old articles and keep them up, unchanged, indefinitely.
Of course, the person stealing your content could change a word in every sentence. That sort of theft would be hard to catch, I think. But in the instance where bots are just taking content ver batim, the cache comparisons might work.
I think we are all just thinking out loud here, and I don't think the original idea that started this thread would work too well, but it is an interesting problem to think about nevertheless.
Using just age and or the first crawled date as factors are a bit dodgy.
Today, if you have RSS or rank well, your site and content is duped in a matter of minutes. There is a good statistical chance that it will end up on a site that is older than yours and googlebot will index it before you.
Unfortunately ... google mantra these days is statistical and there can be some collateral damage.
Having said all of that, imagine how many times Amazon or any of the seriously top ranked sites are duped every day. They still rank ... why? :)
Forgive me, I am fairly new at this; I understand scraping, but have never heard of a scraper site. Can someone please explain why this is done?
It's done in an attempt to get the scraped version of the page to rank, get traffic and then show that traffic advertising (make money from either impressions or clicks).
I think that we're overlooking the main point here. Search engines are dying and it's at least in part because of the mass of junk directories, scraped content and even because they are just yesterdays technology. In the early days people needed help to find things online, but it's all changed now. Most people have their trusted sites, blogs and auction sites etc. They don't need to look them up on a search engine anymore as the world now accepts that the interent is a great place to buy & sell and there is much wider advertising for websites than just Google.
My point in relation to this post is that, yes, it would be very nice to do away with the junk on Google serps, but in the scheme of things it doesn't make a blind but of difference to the casual searcher. IMHO I think Google is changing away from the old search engine model and gradually becoming community based, as with MSN & Yahoo who seem to offer an altogether more rounded experience for the user.
Really dragging this out (Sorry for harping on), how many people use the telephone book anymore. It has simply become on old fashioned method of finding information, used mainly by call centres trying to sell you double glazing and cheap flights. I think the search engines are the same. They've become a place for people to make a few bucks as a commission salesman ... very rarely do you actually find something earth shattering on Google anymore (My appologies if your site does offer this kind of thing).
Phew! Enough for me today, I'm going for a nap ;-)
All the Best
For sites like ours, it seems like these scraper sites are crawling us daily. At any given time, I can grab a unique sentence off our site, and Google it, and several scraper sites show up in the search results for that sentence, even though it does not appear on their page anywhere. They are using php files to feed Google's crawler.
In fact we added a sentence on our site 2 weeks ago asking people to email us about bridal scams that they have seen lately. Sure enough, some scum bag in the UK copied that sentence already.
We have about a 75% success rate at getting them shut down by their web host, then once they are Page Not Found, we submit them to Google's automated URL removal tool, and 3 days later it's bye bye scum bag.
Got me thinking in a whole new direction...thanks!
What you said makes a lot of sense. I particularly like the phone book analogy.
So, who's going to be the next billionare who comes up with the latest medium? Me!
<<< So, who's going to be the next billionare who comes up with the latest medium? Me! >>>
I hope so. Remember me in your memoirs ;-)
All the best
Addition: I see Geo-tracking systems being the future right now. Linking these to local services will allow people to skip the "Find me a local business" search online, only to find it's a company in India who offers a service in your area. Couple the Geo-track sys with a customer review system like ebay and you may have a real winner.
As I have said many times, there is only one way to determine which of two pages is genuine and which is a copy - that is to use the first-spidered date. This would not get it right 100% of the time (so manual intervention would be required) but it would probably have a success rate greater than 95%.
Unfortunately, Google does not seem to keep this information - I guess they're just too smart to do things the easy way.
What if old sites copy content from new sites and place it on old pages?
You know what would really help us all out. A list of IP addresses that these spambots use. Its really easy to stop these bots by blocking their IP's
-- A list of IP addresses that these spambots use --
and then get the hosting company IP Block Range, and then post it on the forum(or create new one) related to hosting about a hosting company that supports this kind of behavior from its clients. ofcourse if it comes from a hosted environment. and then have the site scrapped so the data gets ranked accordingly. and then charge hosting companies to access the most current data on the fly to pay for site related expences.
there a lot of crazy things that could be done. to help google organize worlds information....
|In fact we added a sentence on our site 2 weeks ago asking people to email us about bridal scams that they have seen lately. Sure enough, some scum bag in the UK copied that sentence already. |
|As I have said many times, there is only one way to determine which of two pages is genuine and which is a copy - that is to use the first-spidered date. |
That wouldnt necessarily work either, as there could be technical spidering issues with a site that writes original content, and another scraper site with pr8 links pointed at the sitemap page in order to get new articles indexed faster. Then Google would falsely determine articles and content to belong to the wrong site.
Relying on spiders to make judgement on what content belongs to who is faulty.
I explicitly said that using the first-spidered date would not be 100% effective. However, if supplemented by manual reviews (in response to complaints) and the complete banning of offending sites, the problem could be much reduced in a couple of years. If Google already keeps first-spidered dates (and simply isn't using this information) then the problem could be sorted much more quickly.
| This 31 message thread spans 2 pages: 31 (  2 ) > > |