Forum Moderators: phranque

Message Too Old, No Replies

How to block all those domain name scrapers

         

zeus

10:33 pm on Mar 8, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello - any one know how i can block all those sites that just use my whois info as content?, robots.txt I think such site ignore those, so htaccess is the only option.

but how do i find those ips to block?

wilderness

5:03 am on Mar 9, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



but how do i find those ips to block?


1) view your visitor logs
2) copy and past the IP range into a WHOIS search (i. e., same way they find you).

topr8

8:05 am on Mar 9, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



you can't block bots getting your Whois info, it is scraped from the registrar.

zeus

11:09 am on Mar 9, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



hmm yes i think you are right, totally missed that.

wilderness

3:38 pm on Mar 9, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



FWIW,
A webmaster is not able to control access requests for another domain, however your own server and control of access requests are unrestricted.

The following are all the same domain (with the exception of the head request) from 2002-2011. Access to all of these requests are certainly controllable, whether through use of IP or UA.

216.122.66.zz - - [19/Nov/2002:05:23:01 -0800] "GET / HTTP/1.1" 206 4096 "http://www.example.sc/" "SurveyBot/2.2 <a href='http://www.example.sc'>Whois Source</a>"
69.225.183.zz - - [14/Jan/2005:02:56:57 -0800] "GET /myPage.html HTTP/1.1" 200 10727 "www.example.sc" "Mozilla/5.0 (X11; U; Linux; i686; ja-JP; rv:1.5) Gecko"
72.52.193.zzz - - [02/Mar/2008:22:17:46 -0600] "HEAD /cgi-bin/whois.pl HTTP/1.1" 403 - "-" "libwww-perl/5.805"
64.246.187.zz - - [12/May/2008:06:02:00 -0500] "GET /robots.txt HTTP/1.0" 403 - "http://www.example.sc/" "SurveyBot/2.3 (Whois Source)"
64.246.165.zz - - [06/Feb/2010:20:21:49 +0000] "GET /robots.txt HTTP/1.0" 200 4954 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022 Firefox/3.5.2 (.NET CLR 3.5.30729) SurveyBot/2.3 (#*$!)"
64.246.161.zz - - [06/Feb/2010:23:30:56 +0000] "GET /robots.txt HTTP/1.0" 200 5046 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022 Firefox/3.5.2 (.NET CLR 3.5.30729) SurveyBot/2.3 (#*$!)"
216.145.5.zz - - [28/Feb/2011:21:05:30 -0700] "GET /robots.txt HTTP/1.0" 200 3966 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022 Firefox/3.5.2 (.NET CLR 3.5.30729) SurveyBot/2.3 (#*$!)"

The particular bot, may in fact honor robots.txt (there are many more of these tools in existence and I see new ones (different names)frequently).

Additionally, two of these bot requests offer malformed UA's which offer a simple solution for all requests (not just bots).