Forum Moderators: DixonJones

Message Too Old, No Replies

How can I know if somebody is downloading my site through a bot/sw?

And how can I prevent this?

         

gutabo

3:19 pm on Nov 4, 2002 (gmt 0)

10+ Year Member



I have a site with a huge gallery. Last month my bandwidth(data transfer) jumped from 18gb to 28gb(!). Haven't checked the logs yet(but I will do it tomorrow), however I read somewhere around here about bandwidth growth and software that copies all of the content of a site and I think that's the problem, won't be sure 'til I check the logs though, but I wanted to know, how can I know if there's a software downloading pics from my site? And, how can I prevent this?
I will keep you updated(just in case)
Thanks in advance!

sugarkane

1:35 pm on Nov 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Gutabo,

Did your logs turn anything up? There are plenty of site mirroring programs out there, and they may or may not respect your robots.txt file. First thing I'd do is block your images directory from all robots, or just selectively if you find one particular problem program in your logs.

If the software doesn't obey robots.txt, you might be able to block by user agent [webmasterworld.com] using .htaccess

If that still doesn't help, you'll probably have to look into banning the IP address of persistant offenders.

incywincy

3:01 pm on Nov 5, 2002 (gmt 0)

10+ Year Member



i've thought about this problem for quite a while now and have come up with the following conclusions:

1) most spidering/copying programs provide the ability to set the user agent. easy to disguise yourself as googlebot for example (although the ip address would be wrong of course) so filtering on user agent alone wouldn't work.

2) use of an anonymous proxy would hide the ip address of a transgressor. so you couldn't filter on ip address

3) even if you detected an ip address that was downloading a lot of data the culprit could still cycle through a list of anonymous proxies

The only way i can think of to prevent this is to force your users to register before downloading and ensure that the login page requires human attendance to be successful using some sort of dynamically create PIN (Alta Vista use this for url submission)

Just an idea i had once!

ps also you could hide a link somewhere, exclude the target url using robots.txt, then ban any ip that tries to retrieve it, or maybe redirect it to some unsavoury website!

[edited by: incywincy at 3:24 pm (utc) on Nov. 5, 2002]

jdMorgan

3:21 pm on Nov 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The majority of site downloaders are not very sophisticated. The ones I see seem to be downloading pages to scan for e-mail addresses, since most of them download only html pages and not graphics or included scripts. I may be biased here, because I don't have any "valuable" graphics on my pages anyway. I haven't seen any of them using sophisticated multiple-proxy attacks.

The most useful recent addition to my site's armament is a small script which automatically adds a "ban" to my .htaccess file. It does this whenever a bot attempts to fetch a page which is Disallowed in robots.txt

A few "trap" links are scattered within the high-traffic pages of the site. These invisible links lead to pages which are disallowed by robots.txt. If a 'bot requests one of these pages, the script is invoked. The 'bots IP address is then added to .htaccess, and further requests receive a 403-Forbidden response.

This script [webmasterworld.com] was originally posted here on this site by Key_Master, and another member and I have tweaked it a little, adding file-locking to avoid problems if the script is invoked from two or more requests at the same time. It works great, and I recommend it to anyone who is tired of manually adding unwelcome visitors' IP addresses to their ban list.

Jim

gutabo

4:05 pm on Nov 5, 2002 (gmt 0)

10+ Year Member



*partial update*

We're moving from verio to myacen, so all activities are frozen til we are up again... so no webtrends til that...
BTW, thank you all for your posts. Don't know yet what's going on tho...
Right now, (today's 5) we already have 5.9 GB(data transfer-wise)...
Verio is too expensive...
And... we're DOWN.
(will keep u updated)
Thanks again! and...
Thanks in advance!