| This 46 message thread spans 2 pages: < < 46 ( 1  ) || |
|Stopping scrapers from the get-go|
I'm putting a *huge* number of pages of content online. I'm looking to stop the scraping/copying/bots from the outset and I need bandwidth kept to a minimum. I've never done this before, so I'm not quite sure where to start.
Most of the content is on static html pages. My prelim reading suggests that may be problematic (since I'm not putting out the pages programatically).
Can anyone suggest details as to what I should be doing? Here's areas I think:
1) in htaccess, block a list of IP's from spamhaus
2) in htaccess block a large list of IP's from other countries?
3) in htaccess, block a lot of user agents (get the code from WebmasterWorld)?
4) White list Google, Yahoo and MSN in robots.txt
5) block google and the other bots from crawling my images. I think this will block all robots from crawling gif's at any level of my site?
6) Then I think I'd like to block IP's from hosting companies. Is there an easy to use list of those IP's?
7) after that I should do some IP blocking dynamically I think. Like trigger a block if someone is crawling too many pages too fast. But since I'm serving static html, how do I do that? Set up a cron job to run a script every minute that reads the log and takes action? This seems complex and burdensome.
8) Since the content is static, Google and the rest don't need to download the html 8 times a month. Once a year is fine. What's the best way to tell the bots that a page hasn't changed, thus no need to crawl? etags? I think that stuff requires I change the page headers, and that's tough to do with static html pages.
Anything else I missed?
>> rewrite *.html to a php file
OP says this is not an option.
|I ended up with static HTML just because of the way the content was generated. It's way too much work/effort/money to put this into a cms right now. |
John, I think what he's saying is you can force html to be exececuted as a php script (instead of viewed as a static file). That way I can inject php code into the top of the html files, and it'll work just fine.
I suppose I could also just rename everything from .html to .php.
|>> rewrite *.html to a php file |
OP says this is not an option.
also not required, you can prepend a php file to an html file anyway
Known scrapers is easy but if its competitors using commercial software or a professional scraper then best of luck with htaccess. Test with a HTML parser to see how scrapable you are.
Software is setup to take a bit at a time to avoid detection so if you are re-publishing static html with headers etc then why not mess with your templates to break the parser, mix in some partial feeds with your your affilliate IDs in, make it hig effort/low value to scrape, or if its high value content just make it to your advantage to be syndicated.
Not quite clear on how I would implement that stuff aspdaddy, I'm new to this (and I don't have any affiliate stuff at all).
I did remove the crawl duration=5 from robots.txt. After a week Google only had my home page and about page indexed. Granted, part of that's the fact that I have no backlinks, but I'd have thought I'd have more pages indexed.
In any event, I've got about 3-4K pages uploaded. I get another 10K or so uploaded and I'll start going after some backlinks. Then I should be able to publish with impunity.
I run a cron job every 5 min to check the latest log file entries. I export the results to a perl script that then does a lookup on the ip of any infringers. At that point if the infringer is not in my allowed list of bots, they get banned through APF for 4 hours.
This seems to keep the aggressive ones at bay. They get a handful of pages before my script snags them. if they come back 4 hours later they just get a handful more.
Every 4 hours I clear the ip blocks. Less risk of blocking good traffic down the road when an ip gets reassigned. It happens. I've seen it.
>> John, I think what he's saying is you can force
>> html to be exececuted as a php script (instead of
>> viewed as a static file). That way I can inject
>> php code into the top of the html files, and
>> it'll work just fine.
>> I suppose I could also just rename everything from .html to .php.
Perhaps an academic point if you've already launched content and begun linkbuilding, but there is a tried-and-true method of creating a mini-CMS on the fly to serve static content and enforce access restrictions.
In a nutshell:
1- store the content outside the docroot, e.g. /home/httpd/static/
2- write a php script called "widgets" (no extension)
3- publish content URLs like example.com/widgets/1, example.com/widgets/2, etc
The "widgets" script enforces access rules, and if it determines the client should see the requested file, imports /home/httpd/static/1 or /home/httpd/static/2 or whatever is requested.
Needless to say, you can have full-text filenames, even paths too.
The point is that the static content remains static, but you get the full benefit of intelligent access control without modifying every static file, without importing them into a database, etc. The server config is minimal. And it's a trivial task to have the "widegts" script parse Apache's REQUEST_URI to figure out what static file had been requested.
Alternatively, put all the logic into the 404 handler.
Search "incrediBILL" on G, his blog appears at the top, and check some of his categories in the right column like "Scrapers". There's some great stuff in there.
Going back to the original post:
|I'm looking to stop the scraping/copying/bots from the outset and I need bandwidth kept to a minimum |
A lot of posts have discussed the first part, but not much has been said about the second - I would like to hear some comments from others regarding how you reconcile the desire to reduce bandwidth and latency to a minimum on a large, static dataset with the legitimate concerns over content-scraping.
If you are looking for speed (and you should be, as speed is a critical factor), then there is nothing faster than static content. For evergreen content such as the original post implies, static HTML and aggressive caching rules can make a huge difference in server load and page-load speed for the end user as well as significantly reduce bandwidth. You can tell Apache to set a max-age header with a long expiry time for text/html content, then the server can simply reply with 304 not-modified responses for user-agents such as Googlebot.
Does anyone have any evidence of scraping via ISP or other public caches? If not (and I'm not aware of such a problem ever being discussed), then static HTML is the way to go in my opinion.
|Search "incrediBILL" on G, his blog appears at the top, and check some of his categories in the right column like "Scrapers". There's some great stuff in there. |
Exactly what I meant to say:
|> Anything else I missed? |
incrediBILL would most probably have some thoughts to add on this issue.
> Does anyone have any evidence of scraping via ISP or other public caches?
Two points, probably minor (they are for me):
1. TalkTalk (UK ISP) are sending a chinese bot to re-read a file after it has been loaded by one of their customers. Claimed to be anti-virus but what use it is arriving 30 minutes after the original page read rather kills that excuse. Also illegal under UK/EU law.
2. UK schools often use a common proxy service. Some of these are a bit scrapy. A few UK ISPs also use cache but I haven't seen so much mis-configuration in recent years.
Apart from that there is some good news:
If you block the first access attempt, usually to the default home page, then unless the "scraper" knows about your complete site, that is the only page that will be attempted. This alone will diminish the bandwidth overhead. Feed the scraper a purely minimal 403 whatever and that reduces it further. You need some kind of blocking mechanism for this - if you have htaccess then that should do it.
Block any country you don't want by IP range (there are public databases available for download - look into rbldnsd (wrbldnsd if Windows) which is for mail servers but could probably be adapted for web servers).
Block all server farms as discovered.
Again if Windows: add serious IP offenders into IIS Directory Security - except from reading this thread you probably can't as it's not your server. :(
We use Mod-Security combined with Honey Traps. Works pretty well.
Common user agents used by scrappers:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)
Mod Security 2.x rules
SecRule HTTP_User-Agent "Indy Library" "deny,log,status:403"
SecRule HTTP_User-Agent "Nutch" "deny,log,status:403"
And these geniuses set their crawler to use a malformed user-agent called.... 'user-agent'.
SecRule HTTP_User-Agent "User-Agent" "deny,log,status:403"
Also, we block known spam/hacker server farms at Leaseweb, Singlehop, Limestone Networks, Calpop, Softlayer/ThePlanet, etc.
In some cases we don't want to reveal all our independent research as to which clueless UA's these folks do. Post it here and they will change it, so keep SOME OF THAT under your hat!
Nutch, Indy Library, libwww-perl and all that other out of the can stuff is okay.
Let's just not make our work harder by saying "Hey, idiot! We found 'this' bwahahaha!" because next week we get 'that'...and it only takes us six months extra to figure 'that' out. Let's just be smart...
The best way to prevent scrapers is to get your content indexed before scrapers get to it.
|incrediBILL would most probably have some thoughts to add on this issue. |
I shared them in front of a large audience at PubCon back in Nov. 2010, if I can find my memory stick I might post the slide deck.
The message doesn't change: whitelisting
Anything else is a waste of time chasing your tail monitoring logs and making big stupid ugly lists and I really hate wasting my time. Work smart, not hard, especially when it comes to real time sucks like spider hunting.
Then to stop cloaked bots, you need scripts for speed traps, volume traps, also monitor built-in spider traps such as visitors don't typically open robots.txt, privacy policies or legal info pages but cloaking spiders nail 'em every time on the first visit.
What do you do when something hits a spider/speed/usage trap?
Make them solve a simple captcha, ask up to 10+ times and auto-block if you don't get a response.
Moving right along...
|The best way to prevent scrapers is to get your content indexed before scrapers get to it. |
Completely ineffective and often content is scrambled into a keyword gibberish stew and you'll never know which site grabbed your content for that purpose unless you put beacons in your content, which I do.
Besides, not all scrapers republish content, often they are data miners and other resource suckers that don't belong which make millions mining your sites.
- Thumbs up!
Thanks for the post.
| This 46 message thread spans 2 pages: < < 46 ( 1  ) |