Forum Moderators: phranque

Message Too Old, No Replies

ia_archiver bot

         

rfontaine

8:58 pm on Dec 26, 2005 (gmt 0)

10+ Year Member



Anyone face a similar problem?

I am attempting to block ia_archiver. However, when I look at my logs there it is, sucking up bandwidth. How do I successfully get rid of it?

For several weeks I have had this in my robots.txt:

User-agent: ia_archiver
Disallow: /

also, in .htaccess something like the following:

RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^rufus [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule!^http://[^/.]\.example.com.* - [F]

where "example.com" is my website. I know mod_rewrite is on.

Still ia_archivers shows up.

jdMorgan

10:25 pm on Dec 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That code can't possibly work, because RewriteRule never sees the protocol or the domain, only the local URL-path:

RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^rufus [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule .* - [F]

should work better.

Your robots.txt entry appears to be correct, at least the part that you posted. And IA usually obeys robots.txt. Because of that, I wonder if this is really the IA Archiver 'bot, or if it might be someone using the "WayBack Machine" to check old versions of your site, or even someone spoofing ia_archiver. Your might want to look at the IP address that these requests are coming from in your raw logs to find out.

Be aware that neither the robots.txt nor the mod_rewrite code will stop ia_archiver from 'coming back.' But robots.txt should stop them from trying to spider your site (assuming your entire robots.txt file is valid and correctly-structured) and if robots.txt fails, then the mod_rewrite code will feed them a 403-Forbidden response to every request. Best-case, you'll still see IA fetch your robots.txt occasionally, but it should then leave without fetching any other pages.

Jim

rfontaine

2:46 am on Dec 27, 2005 (gmt 0)

10+ Year Member



Excellent jdMorgan, it seems to be working the way you showed it.

There is alot of confusing information out there as to how to block these bad bots....

Consider this line:

RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]

I have also seen something like:
RewriteCond %{HTTP_USER_AGENT} WebStripper [OR,NC]

What is the difference in the way the two lines work? Is it true the second line covers more possibilities?

jdMorgan

4:55 am on Dec 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The difference is that the first pattern is start-anchored, and will match only a user-agent that *starts with* WebStripper.

The second pattern is not anchored, and also has an [NC] flag to make it case-insensitive. The lack of anchoring makes this pattern much less efficient processing-wise.

You should anchor patterns when you can based on the observed user-agents visiting *your* site.

For more information on anchoring and patterns, see the documents cited in our forum charter [webmasterworld.com]. For some examples, see the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].

Jim