Forum Moderators: phranque
I have one pretty large website running from a database (with about 150.000 entries).
Now i see that some people would like to get my database contents and are using some kind of downloading software to download all my pages and then probably they rip out the content from downloaded pages. Not only that they are stealing my database, they produce a high load on my server and eating up my bandwith traffic.
How can i prevent that - do anyone have experience in this field?
Thank you!
I use a throttle control to damp both problems.
Anyone hitting my sites faster than a given rate gets put onto an escalating series of bans -- ultimately their IP address gets banned for 7 days.
Usually, the ten minute ban (during which all incoming requests get sent a page saying "you are spidering too fast") is enough to stop most out of control spiders....they exhaust their cache of links and assume their job is done.
There are several levels of acceptable spidering (eg -- not the actual numbers: more than 3 CGI executions in a second is a ban. More than 30 in a minute is also a ban).
That won't stop a well-behaved spider getting the whole site. But (for a typical site of mine) that'll take them a week or more. That solves the crazy bandwidth problem.
It also solves several other problems as badly behaved spiders (like HTTrack) do not retry at a controlled rate -- they assume the site is closed to them.
I agree with what's been said so far. However, there are some solid steps you can take via .htaccess with mod_rewrite (for Apache servers) to ban known downloading agents. Be careful to research each user agent before you ban them. What's bad for one website may be a good thing to another.
Related Threads [google.com]