Forum Moderators: phranque

Message Too Old, No Replies

site being scraped I think

scraper bot

         

Roger30

2:02 am on Mar 3, 2006 (gmt 0)

10+ Year Member



Hi everyone. Im new to the forum and hoping someone can help me out. Im also new to freebsd and have recently had some problems with what appear to be scraper bots hitting my website non stop all day long. Here is a bit of my log file to show what I mean


204.13.153.60 - - [01/Mar/2006:10:47:01 -0800] "GET / HTTP/1.1" 200 407 "-" "Mozilla/4.0 (compatible; Win32; WinHttp.WinHttpRequest.5)"
72.18.136.188 - - [01/Mar/2006:10:50:31 -0800] "GET / HTTP/1.1" 200 407 "-" "Mozilla/4.0 (compatible; Win32; WinHttp.WinHttpRequest.5)"
216.153.94.14 - - [01/Mar/2006:10:53:03 -0800] "GET / HTTP/1.1" 200 407 "-" "Mozilla/4.0 (compatible; Win32; WinHttp.WinHttpRequest.5)"

Anyone know of an easy way I can ban the user agent involved here? Its coming in from hundreds of different ip's so I cant really just ban by ip.

Remember Im a freebsd newbie so be specific if you can. I was thinking there should be a way to add it to my IPFW rule, or possibly in httpd.conf. Thanks a lot in advance!

extras

5:51 pm on Mar 3, 2006 (gmt 0)

10+ Year Member



Please learn to search.
There are tons of examples all over the net.

[webmasterworld.com...]
[google.com...]
[webmasterworld.com...]

Once you've done your home work, and if you still have specific problem,
someone will give helpful answer, most probably.

You can hire someone if you don't have time to search/learn, as well.

jdMorgan

6:34 pm on Mar 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Roger30,

Welcome to WebmasterWorld!

You can block WinHttp user-agent requests using mod_rewrite or mod_access, as described in our long-running series of threads, A close to perfect htaccess ban list [google.com].

Note that many of the user-agents cited in those threads are no longer active; If you decide to use this method, block only those user-agents that are actually a problem on your site.

For more information, see the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].

Jim

Roger30

12:15 am on Mar 4, 2006 (gmt 0)

10+ Year Member



Ok I tried to figure it out from that thread but I am getting an internal server error trying to hit a rewrite rule I had in there already. Here is what I have.

note: I actually have my web site url where it says google.com though.

RewriteEngine On
deny from all
</htdocs>
RewriteCond %{HTTP_USER_AGENT} ^WinHttpRequest.5
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule!^http://[^/.]\.google.com* - [F]
RewriteRule linkout [linkouthere.com...]

jdMorgan

12:35 am on Mar 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A few tweaks that may help:

RewriteEngine on
# Next line if for mod_access, not mod_rewrite - commented out
# deny from all
#
# Next line is almost certainly misplaced -- I can't comment without more info
</htdocs>
# Remove the anchoring to match what you posted for the visiting UA string above
RewriteCond %{HTTP_USER_AGENT} WinHttp
RewriteRule .* - [F]
#
# Remove end-anchor, escape literal periods
RewriteCond %{HTTP_REFERER} ^http://www\.iaea\.org
# Syntax was 'very wrong' - guessing at your intent
RewriteRule .* - [F]
#
# Any URl on your site contianing "linkout" gets redirected to another site
RewriteRule linkout http://linkouthere.com [R=301,L]

If you get a 500-Server Error, look at your raw server error log file -- It will often tell you exactly what is wrong.

Jim

iProgram

2:09 am on Mar 4, 2006 (gmt 0)

10+ Year Member



You are lucky because it has an agent name.

Roger30

6:20 am on Mar 4, 2006 (gmt 0)

10+ Year Member



Thanks so much for the help. I almost have it working now but stuck still. I have this so far and the internal server error is gone.

The RewriteRule for my linkout is working fine now. Only problem is the user agent in question is still hitting the site non stop. What i have so far doesnt appear to be working.


RewriteEngine On
# Remove the anchoring to match what you posted for the visiting UA string above
RewriteCond %{HTTP_USER_AGENT} ^WinHttp*
RewriteRule .* - [F]
#
# Remove end-anchor, escape literal periods
RewriteCond %{HTTP_REFERER} ^http://www\.iaea\.org
RewriteRule .* - [F]
#
# Any URl on your site contianing "linkout" gets redirected to another site
RewriteRule linkout http://linkout.com [R=301,L]

Roger30

6:50 am on Mar 4, 2006 (gmt 0)

10+ Year Member



have also tried
RewriteCond %{HTTP_USER_AGENT} WinHttp.WinHttpRequest.5

The log file looks differe now though, the hits looks like this.

209.132.211.130 - - [03/Mar/2006:01:54:12 -0800] "GET / HTTP/1.1" 403 279 "-" "Mozilla/4.0 (compatible; Win32; WinHttp.WinHttpRequest.5)"
216.239.136.23 - - [03/Mar/2006:01:54:17 -0800] "GET / HTTP/1.1" 403 279 "-" "Mozilla/4.0 (compatible; Win32; WinHttp.WinHttpRequest.5)"

Its now a "GET / HTTP/1.1" 403 279

Does this mean they are being blocked? And if so any way to not log the hits?

larryhatch

6:58 am on Mar 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



403 means they are indeed getting blocked.

Look up 403 + 404 and similar codes for their definitions. -Larry,

fusion5

9:04 pm on Mar 23, 2006 (gmt 0)

10+ Year Member



I got hit by that UA also.
Mozilla/4.0 (compatible; Win32; WinHttp.WinHttpRequest.5)
Coming out of Germany at: 82.165.250.52

Cheers.