homepage Welcome to WebmasterWorld Guest from 54.204.215.209
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Accredited PayPal World Seller

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
How do i block these scrapers?
IP blocking doesnt work
tangster




msg:4549213
 9:47 pm on Feb 26, 2013 (gmt 0)

So far i tried ip blocking, even put in the word "torkaland" and "streamica" in referrer and user-agent block list. None of it works! Pls help.

torkaland.blogspot.com and streamica.com

 

Frank_Rizzo




msg:4549230
 10:10 pm on Feb 26, 2013 (gmt 0)

Did you restart apache after each change?

tangster




msg:4549237
 10:31 pm on Feb 26, 2013 (gmt 0)

The block are in the .htaccess. I didn't know you have to restart Apache for it to take affect?

The problem is i cant block the IP of the blogspot site because its owned by Google, i am afraid they might use the same IP to crawl my site and get blocked.


Whereas streamica seems to be pulling RSS feeds from a different IP than what its hosted on and i don't know which IP they are using to scrape the site.

lucy24




msg:4549256
 11:29 pm on Feb 26, 2013 (gmt 0)

Anything in htaccess takes effect immediately. The only exception is that if your browser has already cached the page, it may not know that there have been changes.

It is trivial to make a conditional block to say, for example,

RewriteCond %{REMOTE_ADDR} {give the numerical IP here}
RewriteCond ${USER_AGENT} !Googlebot


... and then take it from there. Currently all the googlebot variants such as the imagebot and the three-or-more mobiles contain the element "Googlebot" (capitalized) in their User-Agent string.

even put in the word "torkaland" and "streamica" in referrer and user-agent block list.

What exactly do you mean by this? That is, what did you do physically?

tangster




msg:4549287
 2:06 am on Feb 27, 2013 (gmt 0)

Would it be safe to block blogspot.com which is on a Google IP? My concern is that Googlebot may also use the same IP sometimes. If someone can confirm it doesn't that would be of immense help.


Below is an example of what i meant...


SetEnvIfNoCase user-agent "torkaland" keep_out
SetEnvIfNoCase user-agent "streamica" keep_out

and

RewriteCond %{HTTP_REFERER} streamica.com [NC,OR]

tangster




msg:4549971
 8:59 pm on Feb 28, 2013 (gmt 0)

I found out torkaland has added our site to widget BlogList on blogger, anyway to stop the RSS feed?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved