Msg#: 7974 posted 10:53 pm on Feb 27, 2005 (gmt 0)
I have one pretty large website running from a database (with about 150.000 entries).
Now i see that some people would like to get my database contents and are using some kind of downloading software to download all my pages and then probably they rip out the content from downloaded pages. Not only that they are stealing my database, they produce a high load on my server and eating up my bandwith traffic.
How can i prevent that - do anyone have experience in this field?
never put the site up. You could track that people were looking at every page in order like a spider and only allow certain spiders to do that. Have a spider trap that might help as well. There is no way to stop somebody from getting one page or a few pages but you can detect if somebody is methodicly getting every page in a certain order and ban them by cookie or ip.
Spiders sometimes go evil too (even Google's) and start sucking up bandwidth as if they were the only thing that mattered.
I use a throttle control to damp both problems.
Anyone hitting my sites faster than a given rate gets put onto an escalating series of bans -- ultimately their IP address gets banned for 7 days.
Usually, the ten minute ban (during which all incoming requests get sent a page saying "you are spidering too fast") is enough to stop most out of control spiders....they exhaust their cache of links and assume their job is done.
There are several levels of acceptable spidering (eg -- not the actual numbers: more than 3 CGI executions in a second is a ban. More than 30 in a minute is also a ban).
That won't stop a well-behaved spider getting the whole site. But (for a typical site of mine) that'll take them a week or more. That solves the crazy bandwidth problem.
It also solves several other problems as badly behaved spiders (like HTTrack) do not retry at a controlled rate -- they assume the site is closed to them.
I agree with what's been said so far. However, there are some solid steps you can take via .htaccess with mod_rewrite (for Apache servers) to ban known downloading agents. Be careful to research each user agent before you ban them. What's bad for one website may be a good thing to another.
Thanks for all the info guys! I think i'll try it with both - the too-many-hits-ban (although i have some concerns with the SE spiders who can spider a lot of pages at once pretty fast) and the user agent ban.