Forum Moderators: phranque

Message Too Old, No Replies

How can I block users using site rippers?

looking for the most efficient way to detect and block a user site ripper

         

DrJOnes

7:09 am on Sep 3, 2004 (gmt 0)

10+ Year Member



I would like to find an efficient way to detect a user usnig a site ripper and automatically block him when he's detected.
I tried a search with the keywords "block ripper" and haven't had any luck...

So far, I have found this on the forum: [webmasterworld.com...]
But the last entry is dated in early 2003...

Can anyone direct me to the best solution to block these rippers? I want to protect my member zone on my website and too many users currently use rippers on my server.

Thanks.

DrJOnes

jdMorgan

6:53 pm on Sep 3, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



DrJOnes,

Do you know the user-agent names of the common site rippers? If so, block them by user-agent name. Next, add the bad-bots script [webmasterworld.com], and finally, the php script you found.

Jim

DrJOnes

11:03 pm on Sep 3, 2004 (gmt 0)

10+ Year Member



Thanks jdMorgan!

I am in the process of installing the trap you directed me to.

I have a question. I see in the trap script that I can receive an email to inform me when a user was cuaght using a site ripper. Th einfo that is being sent to me is his IP and the agent he was using (assuming the agent name is not fake). What interests me most is to install this trap into a member zone (pw protected) and I would like to receive the USERNAME (login name) in the email as well. This way, I could know what user was using a site ripper.

Is this possible to get this info and what code should I add in the trap.cgi in order to fetch this kind of information?

Thanks!

jdMorgan

11:20 pm on Sep 3, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The "User Login" state and information is maintained by the user's browser, not by your server. Therefore, the username will not be available if the user switches to another program to rip the site.

You'll be able to tell who tried it, because they'll probably complain by e-mail that they can't access the site anymore.

[added] Don't install the script into user directories. That would be a major security vulnerability. Instead, install the 'bait' into pages accessed by users, and use mod_rewrite to rewrite requests for bait files to the script. In this way, the script and the real path to the script are invisible to users. [/added]

Jim

isitreal

2:05 am on Sep 4, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



All site ripppers worth anything offer the option to ID as whatever browser they want, so looking at the navigator user agent is only going to keep out the most clueless site rippers, which might be one or two, but once they realize what happened all they have to do is go to another IP and rerip the site with IE 6 browser string.

The good rippers, like WGET, also can be set to follow robots.txt, so the spider trap won't work in this case, somebody just ripped my site using that, and also changed the user agent to something else that wasn't predictable.

I'm not certain, but I suspect that the better rippers can also be set to limit requests per second, so really there's no way in any practical terms to block a smart site ripper, since it will act exactly like a search engine spider, possibly even better, in the case of the recent over zealous msn bot. Or a very curious site visitor browsing your whole site over an hour, very difficult behavior to stop, I don't actually think it's possible if they know what they're doing.

jdMorgan

2:31 am on Sep 4, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> so looking at the ... user agent is only going to keep out the most clueless site rippers.

Which describes most of them, on my sites.

Just because you can't stop them all doesn't mean it's not worth trying to stop some of them, if there is some 'cost' to your site due to their activities. But this discussion is off-topic anyway, because the thread owner is looking for methods, not discussion of whether it's a good idea.

Jim

DrJOnes

3:16 am on Sep 4, 2004 (gmt 0)

10+ Year Member



Correct... I'd rather block most of them than none of them. Of course, ideally I'd LOVE to be able to block them ALL! If anyone can come up with that kind of solution, that'd be just great! Otherwise, I'll settle for the best possible solution that can block as many rippers as possible.

I have some trouble making the above two work. So far I must be doing something wrong... 'cause I still can rip the site with TELEPORT PRO. I am surely doing something wrong in the installation steps. I have emailed someone I know who will probably be able to assist me. I wish there was "install guides for dummies" with these two scripts...

Martin

jdMorgan

3:46 am on Sep 4, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



From msg#2 above:
Do you know the user-agent names of the common site rippers? If so, block them by user-agent name. Next, add the bad-bots script, and finally, the php script you found.

Jim

isitreal

3:48 am on Sep 4, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But this discussion is off-topic anyway, because the thread owner is looking for methods,

I'd say it's on topic, it's important to understand the limitations of methods like this when you implement them, I use versions of both those methods, but they are getting increasingly easy to work around, with easily available software, so it's good to understand what level of protection is actually available.

The php spider blocking script is pretty much plug and play, easy to implement, search for it here, it works fine for standard blocking.