Forum Moderators: phranque
So far, I have found this on the forum: [webmasterworld.com...]
But the last entry is dated in early 2003...
Can anyone direct me to the best solution to block these rippers? I want to protect my member zone on my website and too many users currently use rippers on my server.
Thanks.
DrJOnes
Do you know the user-agent names of the common site rippers? If so, block them by user-agent name. Next, add the bad-bots script [webmasterworld.com], and finally, the php script you found.
Jim
I am in the process of installing the trap you directed me to.
I have a question. I see in the trap script that I can receive an email to inform me when a user was cuaght using a site ripper. Th einfo that is being sent to me is his IP and the agent he was using (assuming the agent name is not fake). What interests me most is to install this trap into a member zone (pw protected) and I would like to receive the USERNAME (login name) in the email as well. This way, I could know what user was using a site ripper.
Is this possible to get this info and what code should I add in the trap.cgi in order to fetch this kind of information?
Thanks!
You'll be able to tell who tried it, because they'll probably complain by e-mail that they can't access the site anymore.
[added] Don't install the script into user directories. That would be a major security vulnerability. Instead, install the 'bait' into pages accessed by users, and use mod_rewrite to rewrite requests for bait files to the script. In this way, the script and the real path to the script are invisible to users. [/added]
Jim
The good rippers, like WGET, also can be set to follow robots.txt, so the spider trap won't work in this case, somebody just ripped my site using that, and also changed the user agent to something else that wasn't predictable.
I'm not certain, but I suspect that the better rippers can also be set to limit requests per second, so really there's no way in any practical terms to block a smart site ripper, since it will act exactly like a search engine spider, possibly even better, in the case of the recent over zealous msn bot. Or a very curious site visitor browsing your whole site over an hour, very difficult behavior to stop, I don't actually think it's possible if they know what they're doing.
Which describes most of them, on my sites.
Just because you can't stop them all doesn't mean it's not worth trying to stop some of them, if there is some 'cost' to your site due to their activities. But this discussion is off-topic anyway, because the thread owner is looking for methods, not discussion of whether it's a good idea.
Jim
I have some trouble making the above two work. So far I must be doing something wrong... 'cause I still can rip the site with TELEPORT PRO. I am surely doing something wrong in the installation steps. I have emailed someone I know who will probably be able to assist me. I wish there was "install guides for dummies" with these two scripts...
Martin
But this discussion is off-topic anyway, because the thread owner is looking for methods,
I'd say it's on topic, it's important to understand the limitations of methods like this when you implement them, I use versions of both those methods, but they are getting increasingly easy to work around, with easily available software, so it's good to understand what level of protection is actually available.
The php spider blocking script is pretty much plug and play, easy to implement, search for it here, it works fine for standard blocking.