| 11:57 pm on Feb 17, 2010 (gmt 0)|
jomaxx - what you don't seem to realise is that google is forcing geo-location on web sites anyway by not including (eg) UK sites in USA SERPS - or at least, forcing them way down. Google has invented the LRW - Locally Restricted Web - and there's not a lot WE can do. If you're suggesting that google block our sites because WE geo-locate then google could well be accused of hypocrisy.
| 12:44 am on Feb 18, 2010 (gmt 0)|
I believe its possible to compile geoIP as an apache module. You'd still need some form of database. Sorry I'm poorly informed, just providing you with some hopefully helpful pointers.
| 12:56 am on Feb 18, 2010 (gmt 0)|
You'd be Surprised how far "Etot Osel skopiroval nashy temu" will go with Russian ISPs.
| 1:10 am on Feb 18, 2010 (gmt 0)|
Sally, ignore all the naysaying as I got a lot of naysaying when I took matter into my own hands to stop rampant copying and such. Over months my stats improved simply because I wasn't competing with myself thanks to scrapers being blocked.
However, just blocking a country isn't enough, but it's a start and those country allocations don't change that often and you can refresh your list every 6-12 months and be just fine. I would recommend adding Russia and the Ukraine, then visit WebmasterWorld's Spider forum to learn a lot more of how to secure your site.
However, if you really want to stop the scraping you'll want to also include the use of the meta tag NOARCHIVE on your pages to stop the search engine from displaying cached copies of your pages, it's a huge scraping source, and block the internet archive's bot ia_archiver from being able to access your site as archive.org is also a scrape target.
Add this to all your HTML pages:
|<meta name="robots" content="noarchive"> |
Add this to your robots.txt file:
|User-agent: ia_archiver |
Also considering whitelisting robots.txt instead of blocking bots one at a time.
New bots that give you no value pop up daily and there are thousands so blocking them individually is a no-win situation.
Example of a whitelisted robots.txt:
|User-agent: googlebot/ |
Be careful as the above robots.txt may need a few more bots allowed for your site.
Of course many robots flat ignore robots.txt altogether so you write a similar set of whitelisting rules for .htaccess that blocks them 100%, a topic touched on many times in the spiders and Apache forums.
To stop international translation scraping, block translation proxies like Google's translator and Yahoo's Babel Fish, topics you can get help with in the Spider's forum.
Besides, as I've previously posted here, AdSense doesn't work properly (or at all) in the translated pages or cache of various search engines so unless you sell to foreign countries or depend on them for your income, which you can easily determine with AdSense in Google Analytics, feel free to block them.
Additionally, make sure all your domain names are listed under AdSense "Allowed Sites" as bad scrapers or someone with malicious intent that leaves your AdSense code in copied pages (happened to me too many times) can really mess up your AdSense in many ways.
Lock it down, bolt it up, then enjoy life without the bad guys messing with you nearly so much.
| 7:21 am on Feb 18, 2010 (gmt 0)|
archive.org respects a noarchive meta tag so you do not have to block the robot as well - although it might save you a bit of bandwidth.
I am surprised that scrapers rank well enough to be competition.
|I am quite popular in India and Pakistan, and I don't want to be. I feel that showing my pages over there is costing me money, additional competition and lowered ranking. |
It probably means you are getting lots of links from there as well. What is the problem unless the scrapers outrank you - and a lot of them as silly enough to keep links in the text in so making suite your internal links are absolute means that you can get links from them as well.
Are you sure you actually have a problem? I it ain't broke don't fix it.
| 7:45 am on Feb 18, 2010 (gmt 0)|
|archive.org respects a noarchive meta tag so you do not have to block the robot as well - although it might save you a bit of bandwidth. |
In my experience - it did NOT honor the tag, nor the robots.txt, and I had to contact them directly to get out of archive.org.
IMO they are not as benign as they appear.
|I am surprised that scrapers rank well enough to be competition. |
Google likes NEW content and black hats scraping sites play by different rules.
So new scraping snippets mixed into other scraping snippets becomes new content and some of the other tricks played manage to make them overrun your long tail keywords for a period of time.
I've been battling this nonsense for years, I beat them down, (mostly) won the battle, so I kind of know what I'm talking about in this arena.
| 11:57 am on Feb 18, 2010 (gmt 0)|
a few year back i tested a script for generating ip block lists by country for use with iptables which seemed to do the job pretty well. I forget the name but the following search terms should get you on your way: 'Country block list ip'...
| 12:01 pm on Feb 18, 2010 (gmt 0)|
to block archive.org just disallow it in robots.txt and it should drop all your pages... Takes a little while though.
| 2:03 pm on Feb 18, 2010 (gmt 0)|
|to block archive.org just disallow it in robots.txt and it should drop all your pages... Takes a little while though. |
Not quite, it might stop making archived versions of your pages publicly available, but it is still archiving your pages. This means it is still wasting your server resources.
| 2:22 pm on Feb 18, 2010 (gmt 0)|
I just thought that it would be more interesting to do at the DNS level, not a webserver level. Just make the entire domain not to exist depending on geolocation.
| 6:18 pm on Feb 18, 2010 (gmt 0)|
Blocking at the DNS level sounds as an interesting solution, but this is only possible if you run your own primary DNS server for your domain name. I can't think of any third party DNS provider, ISP or hosting company which would implement this on their main servers.
And if you have to host your own primary DNS server you can just as easy add the IP lists of the blocked countries in the firewall. This will reduce even more traffic as the DNS queries from blocked sources won't even reach the DNS server.
| 11:08 am on Feb 19, 2010 (gmt 0)|
You could check the Accept-Language header, and if you don't like the primary language you could redirect them to one of your link exchange partners with similar subject matter.
I'm not sure what would happen with the referrer off the top of my head, or what would happen if a SE requested the page with an unwanted language header.
The visitor would probably be clueless about what was going on though, at least for awhile.
| 3:25 pm on Feb 19, 2010 (gmt 0)|
Does that really sound like a good idea? Banning even Americans whose first language isn't English? Anyone this obsessed with controlling who may see their website should just open a drive-through window and dispense their information that way.
| 3:46 pm on Feb 19, 2010 (gmt 0)|
|Anyone this obsessed with controlling who may see their website should just open a drive-through window and dispense their information that way. |
It's not an obsession thing, it's a survival thing.
I think you missed that part about traffic and sales actually going up, not down.
Also, those Americans no longer in America didn't amount to any sales.
None, nada, can't worry about the few in those countries where all the troubles originate from, block 'em and be done with it.
Those countries blocked used to mirror the site, or scrape it so fast the server would stop serving other customers, etc. etc.
I don't run a site for kiddies with scripts or some criminal underground to use my resources, I run it for the visitors.
If you don't think there's a war going on, just ask Brett right here at WebmasterWorld because this very forum blocks everything except a few search engines.
Read the robots.txt file for more information:
Like I said, it's not an obsession, it's about survival.
Smaller site operators probably won't notice but larger busy site operators definitely know what I'm talking about.
| 3:51 pm on Feb 19, 2010 (gmt 0)|
Your wasting your time Bill.
| 6:05 pm on Feb 19, 2010 (gmt 0)|
With such respected and knowledgeable experts in disagreement, what is a poor, hapless non-programmer webmaster to do?
I guess I can only hope for Google to someday empower us with better options and controls. Will it ever happen? Only if they perceive that it will make them more money. So probably not.
| 6:16 pm on Feb 19, 2010 (gmt 0)|
Get all the IP from the country and block them.
| 6:43 pm on Feb 19, 2010 (gmt 0)|
jomaxx, I am going to give you a quick practical example of why.
I have a store, 400-500 products, Items of FASION, I write ALL descriptions myself of the products when I hold them in my hand - 2 Sentences, maybe 25 words, pages properly SEO'd. I allow ARIN(80%), Partly Ripe(20%) and Partly APNIC(10%). On the average since the site is popular a 1/4th of the traffic is 403.
One of my Main Competitors: Wide OPEN, over 1000 Unique Items average. In order for them to compete with me for the long tail phrases they have to write 75-100 word description(borring) on the average and spend $$$ to promote the product via some type of AD System. Their content gets scraped, republished in no time. I know of several sites that republish their content and slap ADS on it, straight MFA Types. Those MFA Sites get scraped by others and so long. Since they don't rank well for the Long Tale their Inventory is stale.
I know what they buy their widgets for, and the pricing on my site is 15% higher just due to that.
Result: Takes me 15 minutes to write a description, 3/4 days to get the page ranked, another 5 to sell the Item. Fast turnaround = better pricing from manufacturers = more $$$ in the bank, period.
0 Dollars spent on ADS. $$$$$ made from ADS from leading manufacturers advertizing on my site, Healty profit on widgets and TONS of 403s.
| 7:23 pm on Feb 19, 2010 (gmt 0)|
I have no idea how that story demonstrates anything, and I note that everybody responding to my last post is ignoring the fact that I was responding to the nonsensical suggestion of banning users based on their primary language settings. If anybody wants to defend that idea, speak up.
| 7:47 pm on Feb 19, 2010 (gmt 0)|
Just as KenB alluded to, if you block based on IP addresses, you need to update the list periodically. A better method might be to block using data/services from a major geolocation data provider, such as Quova.
Quova is likely best-in-class, and would allow you to block beyond mere IP ranges -- they also use some network tracerouting methodology and have partnered with Wi-Fi and mobile phone providers to be able to pinpoint users' geolocations in more cases than through a very basic block of IP ranges.
And, jomaxx is right - banning based on language settings might be attractive due to its ease of implementation, but it's hardly a good idea. Not only would there be potentially a great many high-quality clicks blocked from America (do you know how many programmers, doctors, and other professionals there are in America who originated from India or Pakistan?), but there are a lot of people in other countries who haven't changed the language preference settings for their browsers, so they'd appear English-speaking. So, you'd end up blocking people you'd want, and missing people you want to block.
| 9:00 pm on Feb 19, 2010 (gmt 0)|
|A better method might be to block using data/services from a major geolocation data provider, such as Quova. |
What happens when their service is down or the route between your server and theirs fails?
Plus you have to connect to their service per page anyway adding a built-in lag to your entire site.
Not to mention that solutions like Quova are kind of pricey and would cost me about $200/day or more to use it!
That's why I recommend DIY because the hosted services have some serious side effects that maintaining your own don't and the minor amounts of potential error in the data certainly aren't worth $6K/month!
[FYI, I tested them for giggles and they got my city wrong ;)]
|banning based on language settings might be attractive due to its ease of implementation, but it's hardly a good idea. |
Agreed, language blocking is a bad idea, use location blocking.
| 10:47 pm on Feb 19, 2010 (gmt 0)|
All of the IPv4 Ranges are avalable for FREE last Time I checked. I simply reload the data into a local DB on monthly bases with on click and a little home made, 15 liner Script WooDoo.
| 10:51 pm on Feb 19, 2010 (gmt 0)|
I agree with incrediBill that a DYI solution would probably be the best way to go. I use a stripped down GeoIP CSV file (integer IPs and country codes only) that has all US IPs removed to make it as small as possible. I use geo location to target ads, etc. I use a really simple PHP function to parse the file and return the country code. All in all, it only takes a few miliseconds for the file to be opened and parsed.
You can see a thread discussion of my implementation at: [webmasterworld.com...]
| 6:30 am on Feb 20, 2010 (gmt 0)|
I agree with IncrediBill.
As to the mentions of htaccess etc and manually blocking IPs, this is a hiding to nowhere. It will be a maintenance nightmare and full of holes even if you try to keep it updated.
The only solution is to use a geo IP solution. You don't need to pay for it, although some are cheap, and you don't need to connect to a third party server every time; it's quite easy to cache their databases locally. The above naysaying is totally wrong. I use one myself which is quite good and reliable. You can cron a script to pull in their monthly free update and store it locally, so you have an automated update system which can be left to its own devices once set-up. It is 99% accurate, very fast, and will let you do what you will with with country info for any given visitor.
In your position I'd be setting all no-archive options inc translation proxies etc, then cloaking the site in the suspect countries. I wouldn't return a 403 as that's an "in your face" challenge to the bad guys. I'd replace it with rubbish content to mislead them. And obviously your ads will only be embedded as you wish.
[edited by: tedster at 7:42 am (utc) on Feb 20, 2010]
| 1:27 pm on Feb 20, 2010 (gmt 0)|
Why is language filtering a bad idea ?
| 2:54 pm on Feb 20, 2010 (gmt 0)|
|Why is language filtering a bad idea ? |
Because it has very little basis in where the user is. A Chinese speaker may live in the U.S. and speak English, but prefer to read Chinese when available. Someone in India would still in all probability have their default language set to English. As such if you wanted to block Indian users blocking by language would not work.
| 4:01 pm on Feb 20, 2010 (gmt 0)|
I have had a huge IPtables block list of thousands of IP addresses for years now including a whole stack of /8s and I add more nearly every day. Result: very little spam, very few access attempts on my server, stats that actually mean something, unique content that stays unique and isn't spread all over the spammiest sites in the world, a low level of fraud attempts and sites that still load quickly for those countries I want to show them to. My Google, Yahoo and MS rankings haven't suffered in the slightest as far as I can see and the rest don't really matter anyway. The only thing you have to watch is to make sure you don't block welcome bots like those from search engines but lists of their IP addresses are easily available on the web anyhow.
| 9:13 am on Feb 22, 2010 (gmt 0)|
|In my experience - it did NOT honor the tag, nor the robots.txt, and I had to contact them directly to get out of archive.org. |
No new pages on my site have been index by them since I added the meta tag about two years ago.
|So new scraping snippets mixed into other scraping snippets becomes new content and some of the other tricks played manage to make them overrun your long tail keywords for a period of time. |
Is this a problem she actually face though? My biggest problems are search sites that get index by Google (similar result to the black hats, a page of snippets from other people's sites) and bloggers manually copying and pasting stuff (I do not care to much as long as they link back).
I have found one Russian site that has a complete copy of my site though. Not sure what to do about that given that the contacts are all in Russian. I could send a DMCA complaint to Google, but they do not rank anyway.
| 10:10 am on Feb 22, 2010 (gmt 0)|
Country blocking is always problematical.
There's not just the technical issue of getting the right IP ranges blocked, but the human issues.
Some of your viewers will be Western citizens on holiday or doing business in blocked countries. Do you want to block those people too? One way to overcome that is to not block visitors who already have a cookie from the site (dropped on their machine when they were resident at home) and allow them access even when they view from a 'blocked' country.
Even when you block direct access to the site, content can still be copied simply by viewing the searchengine cache page, or TheWaybackMachine and other such places.
Serve 'blocked' users alternative pages of content, and they may never know they were blocked.
| 7:08 pm on Feb 23, 2010 (gmt 0)|
There is a service out there that does this at the DNS level.
We were using the .htaccess file to block countries, but when IPs are changed/reassigned, it makes it very difficult to keep up with.
We now use the DNS service and can redirect traffic from a particular country or group of countries to an entirely different server.
[edited by: tedster at 8:30 pm (utc) on Feb 23, 2010]
| 6:34 am on Feb 24, 2010 (gmt 0)|
This discussion did prompt me to do some checking for how much scraping there was, and I sent a DMCA notice to the US based host of a blog that copied and pasted stuff with no acknowledgement, and that had no contact details.
Most copies of my text on sites that rank at all are on Yahoo Answers. Maybe I should start a thread to discuss that.
| This 68 message thread spans 3 pages: < < 68 ( 1  3 ) > > |