homepage Welcome to WebmasterWorld Guest from 54.161.175.231
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque & physics

Webmaster General Forum

    
How to protect the website from download?
Prevent downloading for offline browsing
Philarmon




msg:336001
 10:53 pm on Feb 27, 2005 (gmt 0)

Hi!

I have one pretty large website running from a database (with about 150.000 entries).

Now i see that some people would like to get my database contents and are using some kind of downloading software to download all my pages and then probably they rip out the content from downloaded pages. Not only that they are stealing my database, they produce a high load on my server and eating up my bandwith traffic.

How can i prevent that - do anyone have experience in this field?

Thank you!

 

txbakers




msg:336002
 4:11 am on Feb 28, 2005 (gmt 0)

you can't.

ogletree




msg:336003
 5:36 am on Feb 28, 2005 (gmt 0)

never put the site up. You could track that people were looking at every page in order like a spider and only allow certain spiders to do that. Have a spider trap that might help as well. There is no way to stop somebody from getting one page or a few pages but you can detect if somebody is methodicly getting every page in a certain order and ban them by cookie or ip.

victor




msg:336004
 7:53 am on Feb 28, 2005 (gmt 0)

Spiders sometimes go evil too (even Google's) and start sucking up bandwidth as if they were the only thing that mattered.

I use a throttle control to damp both problems.

Anyone hitting my sites faster than a given rate gets put onto an escalating series of bans -- ultimately their IP address gets banned for 7 days.

Usually, the ten minute ban (during which all incoming requests get sent a page saying "you are spidering too fast") is enough to stop most out of control spiders....they exhaust their cache of links and assume their job is done.

There are several levels of acceptable spidering (eg -- not the actual numbers: more than 3 CGI executions in a second is a ban. More than 30 in a minute is also a ban).

That won't stop a well-behaved spider getting the whole site. But (for a typical site of mine) that'll take them a week or more. That solves the crazy bandwidth problem.

It also solves several other problems as badly behaved spiders (like HTTrack) do not retry at a controlled rate -- they assume the site is closed to them.

keyplyr




msg:336005
 6:39 pm on Mar 1, 2005 (gmt 0)


Hello Philarmon,

I agree with what's been said so far. However, there are some solid steps you can take via .htaccess with mod_rewrite (for Apache servers) to ban known downloading agents. Be careful to research each user agent before you ban them. What's bad for one website may be a good thing to another.

Related Threads [google.com]

Philarmon




msg:336006
 6:57 pm on Mar 1, 2005 (gmt 0)

Thanks for all the info guys! I think i'll try it with both - the too-many-hits-ban (although i have some concerns with the SE spiders who can spider a lot of pages at once pretty fast) and the user agent ban.

You're great :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved