Forum Moderators: phranque
I am attempting to block ia_archiver. However, when I look at my logs there it is, sucking up bandwidth. How do I successfully get rid of it?
For several weeks I have had this in my robots.txt:
User-agent: ia_archiver
Disallow: /
also, in .htaccess something like the following:
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^rufus [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule!^http://[^/.]\.example.com.* - [F]
where "example.com" is my website. I know mod_rewrite is on.
Still ia_archivers shows up.
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^rufus [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule .* - [F]
Your robots.txt entry appears to be correct, at least the part that you posted. And IA usually obeys robots.txt. Because of that, I wonder if this is really the IA Archiver 'bot, or if it might be someone using the "WayBack Machine" to check old versions of your site, or even someone spoofing ia_archiver. Your might want to look at the IP address that these requests are coming from in your raw logs to find out.
Be aware that neither the robots.txt nor the mod_rewrite code will stop ia_archiver from 'coming back.' But robots.txt should stop them from trying to spider your site (assuming your entire robots.txt file is valid and correctly-structured) and if robots.txt fails, then the mod_rewrite code will feed them a 403-Forbidden response to every request. Best-case, you'll still see IA fetch your robots.txt occasionally, but it should then leave without fetching any other pages.
Jim
There is alot of confusing information out there as to how to block these bad bots....
Consider this line:
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
I have also seen something like:
RewriteCond %{HTTP_USER_AGENT} WebStripper [OR,NC]
What is the difference in the way the two lines work? Is it true the second line covers more possibilities?
The second pattern is not anchored, and also has an [NC] flag to make it case-insensitive. The lack of anchoring makes this pattern much less efficient processing-wise.
You should anchor patterns when you can based on the observed user-agents visiting *your* site.
For more information on anchoring and patterns, see the documents cited in our forum charter [webmasterworld.com]. For some examples, see the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].
Jim