Welcome to WebmasterWorld Guest from 188.8.131.52
Forum Moderators: not2easy
I run a network of websites, which are copied wholesale so frequently - 15 to 20 times a month or so - by mostly foreign entities, that writing DMCA letters has become a part-time job, a job that is mostly pointless (7 out of 10 site owners never respond, and the ones that do, often simply republish the site under a different domain a week later). All the offending sites are merely replacing my logo with one of their own and changing the adsense code (though, rarely changing the affiliate links).
Is there prophylactic solution to this that I'm missing, or are scrapers unavoidable for folks with quality sites and original content?
At any rate, it's become increasingly frustrating, and I'd just like to find a magic bullet for future websites - I wonder how larger, commercial sites avoid duplication by scaper?
In addition, some Webmasters may block entire class A networks because their site will not benefit from allowing access from those IP address ranges. For example, some sites may not benefit from out-of-country traffic at all, but might suffer site-scraping from other countries. I'm not suggesting or condemning such a wide-reaching access restriction, just mentioning it; The choice is up to the individual Webmaster.
Many of these solutions have been discussed in the technical/scripting forums here. There are also many Web sites about IP address and open-proxy blacklisting. These methods can help to reduce the number of successful scrapes, and thereby reduce your legal costs, time spent on DMCA filings, worry, etc. None are foolproof, but they can discourage those who might take your site for easy prey.
Have any of your urls noticeably suffered in their rankings due to duplicate content issues, or have you managed to avoid that because you have more pagerank on your urls?
Also, I wonder how you first noticed this was happening. And, for that matter, I wonder how they began to notice your urls? Was it because the sites were prominent for certain keyword rankings?
I have a new site I am working on that will be loaded with hundreds of pages of indepth, original content. Because the subject matter has broad appeal, I am somewhat hesitant to go about optimizing so it will surface in the serps because of this very issue. I've thought about taking EFV's advice and getting a copyright. I've also considered using the *opyscape service.
The thought of someone stealing the content makes me ill. The thought of someone stealing it and causing ME duplicate content problems makes me want to hurl.
How do you block whole countries? I had a list of ip ranges once and it didn't seem all that easy to do. Many countries seemed to be assigned portions of quite a few ip ranges and due to that it seemed very difficult to sort out. Ideally, to be honest, I would block everyone but north america and europe.
There are paid services which provide updated/maintained IP address ranges by country.
I didn't mention 'easy' or 'cheap' above, because these are both subjective terms. But the question was asked, "how can we prevent scraping?"
The next step is to make a page on your site that will add the IP address of ANYONE WHO VISITS IT to the ban list, instantly blocking them from your entire website.
Then you update your robots.txt to disallow access to this page, so no LEGITIMATE spiders will ever visit.
Then the final step is to make a link to this page (the one that bans any visitors) either in a table with style set to display:none or some other hidden method that no one will ever click on but a spider will see the link and follow it.
If the spider is obeying robots.txt it will know not to follow that link - but if the spider ignores robots.txt, as all of the "scrapers" do, it will spider this link and immediately ban itself from your site.
Sometimes are are no “good’ ways of doing something, just any way.
As for the dup content penalties ... there is absolutely nothing more infuriating to me than when scrapers are capable of not only duping an entire site of mine, but SEOing it to death, and ultimately getting my site delisted while the scraper's site achieves decent PR, which is, unfortunately, an all too common experience because I'm more interested in drawing traffic through content than SEO.
I will reiterate, however, that the webmasters who copy my sites are not exclusive to China or South America - I have actually tracked one or two thieves to members of webmasterworld, an even more infuriating experience.
So I wrote him back asking what kind of sites he wanted and some other specs. Told him I wasn't about to give out my urls until I knew more about who he was and what he wanted.
In retrospect, a good idea because I safely assume he was just fishing for profitable AS ideas and content to steal.
IF someone you don't know sticky's you here about wanting to buy your sites, he just might be fishing for ideas and content to steal. After you've talked about daily income and site specs, details, - he has all he needs to know to beat you at your own game.
"Have any of your urls noticeably suffered in their rankings due to duplicate content issues, or have you managed to avoid that because you have more pagerank on your urls?"
I have the same question. I have a couple of informational sites that have scraped to death -- normally I would not care, but some of the "scrapers" have MANY, MANY, MANY links to me from their site and my rankings in Google have dropped "significantly" (no other meaningful changes have been made to the site to cause this).
Does anybody know if Google could be penalizing a site because a "scraper" has MULTIPLE, MULTIPLE, MULTIPLE links to a site it scraped?
This is similar to the situation --often called a 'duplicate content penalty' for the sake of simplicity-- where you have one page on your site that resolves under two or more URLs; It competes with itself for incoming links and PageRank, and competes with itself for eyeballs in the SERPs if both copies are listed. The 'penalty' is the loss of content-uniqueness, and your loss of control over which URL will be used/shown/linked-to. I've never tried massively-duplicating a page to see if it actually trips a filter or garners a real penalty.
However, in the case at hand, this is scrapers competing with you using your content. If their page gets higher-PR incoming links, your page may well disappear.
An ounce of prevention is worth...
an occasional good laugh
Here are two typical sequences of events on a site that takes a few steps to head off problems from scrapers, harvesters, and/or stealth 'bots:
128.***.124.24 - - [18/May/2005:07:06:24 -0800] "GET / HTTP/1.1" 403 867 "-" "-"
128.***.124.24 - - [18/May/2005:07:06:25 -0800] "GET / HTTP/1.1" 403 867 "-" "http://www.***.com/ <more
UA text>; Mozilla/4.0 compatible crawler"
128.***.124.24 - - [18/May/2005:07:06:25 -0800] "HEAD / HTTP/1.1" 200 0 "-" "-"
128.***.124.24 - - [18/May/2005:07:06:25 -0800] "GET /favicon.ico HTTP/1.1" 403 867 "-" "http://www.***.com/
<more uA text>; Mozilla/4.0 compatible crawler"
128.***.124.24 - - [18/May/2005:07:06:25 -0800] "GET /403info.html HTTP/1.1" 200 8043 "-" "-"
The bot came in with no user-agent and no referrer, and was denied access.
It then came back with a user-agent, but an unknown user agent, and so was denied.
Next, it tried a HEAD request, all of which we allow except in the most aggregious cases.
Next, it tried fetching our Favicon, and was denied (unknown user-agent).
Finally, it fetched the 403 info page, which explains our site access policies, in case a human might actually read it.
64.**.136.196 - - [18/May/2005:08:49:50 -0800] "GET / HTTP/1.0" 200 21356 "-" "Mozilla/5.0 (X11; U; Linux
i686; en-US; rv:1.2.1) Gecko/20021204"
64.**.136.196 - - [18/May/2005:08:49:56 -0500] "GET /accessv/mail.cgi?id=user HTTP/1.0" 200 162 "-"
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021204"
64.**.136.196 - - [18/May/2005:08:49:56 -0800] "GET /accessv/user_auth.html HTTP/1.0" 200 162 "-" "Mozilla/5.0
(X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021204"
64.**.136.196 - - [18/May/2005:08:49:56 -0800] "GET /accessv.html HTTP/1.0" 403 670 "-" "Mozilla/5.0 (X11;
U; Linux i686; en-US; rv:1.2.1) Gecko/20021204"
This poor harvester came in, grabbed the home page, analyzed it, and then came back and immediately tried to request three 'poison' files in a row! On the last of three simultaneous requests, it discovered it was already banned by IP address, and crawled away to die.
So much pain and sorrow could have been avoided if they had only fetched and obeyed robots.txt... Requiescat In Pacem.
Modified "bad-bot" script blocks site downloads [webmasterworld.com]
Blocking Badly Behaved Bots [webmasterworld.com]
A Close to perfect .htaccess ban list [webmasterworld.com], the WebmasterWorld classic in four(?) parts
>>But you lose your implicit 'ownership' of the contents
In our case it is more than that. They ‘hot link’ to our software after they scrape download.com. This costs us bandwidth and we have to constantly perform maintenance to keep them out, and time is money. So it is costing us X 2. I have even seen humans from these sites come in after a block, take the URL to the software, and then fix their web pages.
After banning referrals from scrapers-r-us.com from getting the downloads, I have seen them then hot link to directly to download.com. I told their ISP about it, and the ISP said that if we file a complaint about copyright, and if we were wrong, then we would have to pay attorney’s fees. So the ISP’s, apparently Google, and the scrapers themselves are the only people that support these unethical activities. I don’t have a clue why the ISP’s and G don’t do something about it. In the long term, they are indeed hurting themselves due to their lack of respect for the original content providers.
I'd love to get input on this. I know of a few sites that occupy their positions high in the serps and this has little to do with their on-page content or page rank. Essentially, they do have good onpage content AND high pagerank----but to place high in a crowded field of pages (millions of other pages to compete with), they rely mainly on the strength of incoming links that have good anchor text.
So, even if a copier copies their sites word for word, it won't matter. That site won't make it onto page one of the serps because the copier won't have the thousands of incoming links from other sites with good anchor text. If a copier copies the site and has good pagerank on the copied page, it won't matter for the same reason.
These sites are where they are due to 1. pagerank, 2. content (good keyword relevance and density) and 3. lots and lots and lots of anchor-text-keyword-relevant links from lots of different websites on IPs.
Seems to me, the best way to defeat scrapers and copiers is 1. get as much pagerank as you can get (and never stop accumulating), 2. get as many inbound links as you can (and never stop accumulating), 3. optimize the site well, and 4. never stop adding content.
I told their ISP about it, and the ISP said that if we file a complaint about copyright, and if we were wrong, then we would have to pay attorney’s fees.
Actually, if that is what they told you, then they are not completely correct. The US is not Britain.
In copyright cases, awarding attorney's fees is not automatic. In fact there are some fairly famous cases where they were not awarded.
Fogarty v. Fantasy Inc. is one that comes to mind. <added> I believe he eventually won the fees, but they were never ruled to be automatic</added>
The key word in section 505 is "may".
I have heard of these spider traps before and have considered implementing one but have a question -- I have read in the Google forums here that some have reported that Google has been actually following links despite being disallowed in robots.txt but then not indexing them. I suppose the same concern could run to Google's mediabot as this might do the same.
In these cases are you whitelisting known Google IP ranges or has this just not been a problem thusfar?
I think the issue was that Google stored the robots.txt file at the time of spidering and didn't retrieve it before. So if Google spiders you today (without this "trap" in place) they have your current robots.txt on file. If you change your robots.txt, the next time they spider you they're going off of the one they have stored on their end, not your most current one. So Google would fall into your trap and get banned.
The solution was to set up robots.txt with the nofollow in there but NOT to make the page do the automatic banning right away. After you've been spidered by all of the major search engines with your new robots.txt in place, THEN you add the auto-ban code to the nofollow page.
joined:May 29, 2005