homepage Welcome to WebmasterWorld Guest from 54.211.73.232
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
How do i block these scrapers?
IP blocking doesnt work
tangster



 
Msg#: 4549211 posted 9:47 pm on Feb 26, 2013 (gmt 0)

So far i tried ip blocking, even put in the word "torkaland" and "streamica" in referrer and user-agent block list. None of it works! Pls help.

torkaland.blogspot.com and streamica.com

 

Frank_Rizzo

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4549211 posted 10:10 pm on Feb 26, 2013 (gmt 0)

Did you restart apache after each change?

tangster



 
Msg#: 4549211 posted 10:31 pm on Feb 26, 2013 (gmt 0)

The block are in the .htaccess. I didn't know you have to restart Apache for it to take affect?

The problem is i cant block the IP of the blogspot site because its owned by Google, i am afraid they might use the same IP to crawl my site and get blocked.


Whereas streamica seems to be pulling RSS feeds from a different IP than what its hosted on and i don't know which IP they are using to scrape the site.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4549211 posted 11:29 pm on Feb 26, 2013 (gmt 0)

Anything in htaccess takes effect immediately. The only exception is that if your browser has already cached the page, it may not know that there have been changes.

It is trivial to make a conditional block to say, for example,

RewriteCond %{REMOTE_ADDR} {give the numerical IP here}
RewriteCond ${USER_AGENT} !Googlebot


... and then take it from there. Currently all the googlebot variants such as the imagebot and the three-or-more mobiles contain the element "Googlebot" (capitalized) in their User-Agent string.

even put in the word "torkaland" and "streamica" in referrer and user-agent block list.

What exactly do you mean by this? That is, what did you do physically?

tangster



 
Msg#: 4549211 posted 2:06 am on Feb 27, 2013 (gmt 0)

Would it be safe to block blogspot.com which is on a Google IP? My concern is that Googlebot may also use the same IP sometimes. If someone can confirm it doesn't that would be of immense help.


Below is an example of what i meant...


SetEnvIfNoCase user-agent "torkaland" keep_out
SetEnvIfNoCase user-agent "streamica" keep_out

and

RewriteCond %{HTTP_REFERER} streamica.com [NC,OR]

tangster



 
Msg#: 4549211 posted 8:59 pm on Feb 28, 2013 (gmt 0)

I found out torkaland has added our site to widget BlogList on blogger, anyway to stop the RSS feed?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved