Forum Moderators: phranque

Message Too Old, No Replies

Preventing site scraping

rewrite, scraping, RewriteCond, LWP

         

uncobeth

5:35 pm on Feb 17, 2009 (gmt 0)

10+ Year Member



Hi -

I have someone who's lifting material right from my web site using different tools. First it was lwp-trivial, and I was able to stop that from happening. Now it's LWP::Simple/5.805 since they couldn't get the content anymore using lwp-trivial.

I added just plain LWP (see line 2) but it didn't stop them last night. Do I need the actual version number (like line 4)? I didn;t for lwp-trivial or Wget. Maybe something else is wrong with this? I had accidentally put an [OR] after the line with no other condition following. Could that have made it fail?

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^lwp-trivial [OR]
RewriteCond %{HTTP_USER_AGENT} ^LWP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LWP::Simple/5.805
RewriteRule ^.*$ [my.domain...] [R,L]

Thanks,
Beth

wilderness

6:40 pm on Feb 17, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



#Reads IF lwp reagardless of case is contained
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} lwp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget
RewriteRule ^.*$ [my.domain...] [R,L]

uncobeth

6:59 pm on Feb 17, 2009 (gmt 0)

10+ Year Member



Many thanks!

janharders

7:23 pm on Feb 17, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



that'll only slow them down. Once they figure out how to use LWP with a custom useragent (and it's not that hard, really, after all, it's perl and perl makes easy things easy), they'll be back on your site. your only chance is to go by cookies (if they use 'em) and originating IPs. But those cannot be done by adding some rewrite-rules, sadly.

uncobeth

8:07 pm on Feb 17, 2009 (gmt 0)

10+ Year Member



Thanks. I know they figured out that I got them on lwp-trivial and switched to LWP::Simple. The URL they got sent to is a page full of irrelevant and bizarre perl error messages that I pasted in. I hope it stumped them for a while. :) I wonder what other fun pranks I could pull on them for stealing my content and bandwidth? A honeypot that contained #*$! would be hilarious, because it's a realtor's MLS stuff that's being stolen by a company who's charging them for the same thing they're paying us for. Oh, how funny that would be!

uncobeth

8:09 pm on Feb 17, 2009 (gmt 0)

10+ Year Member



Hmm. I had used 4 letter word that means dirty pictures and it was censored. :)