Forum Moderators: phranque
I'll call these: site1 and site2.
This following mod_rewrite allows ia_archiver to access these sites and works fine:
RewriteCond %{HTTP_HOST}!^www\.(site1¦site2)\.com
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC]
RewriteRule .* - [F]
This does not however stop ia_archiver from spidering site1/sub-directory, site2/sub-directory, etc.
Does anyone have a suggestion of how best to accomplish this, other than using the robots.txt
RewriteCond %{HTTP_REFERER}!www\.(site1¦site2)\.com/(home¦index)\.(html?¦php)
RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC]
RewriteRule .* - [F]
If not(homepage of defined site) and (UA is ia_archiever) then forbid.
Not tested, but should work. You still have to adapt the last part of the first regexp to match you home pages URIs.
Welcome to WebmasterWorld!
I suspect you want to block ia_archiver if the requested host is not in the allowed list or the requested page is not in the allowed list:
RewriteCond %{HTTP_HOST} !^www\.(site1¦site2)\.com [OR]
RewriteCond %{REQUEST_URI} !^(/¦/index\.html)$
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC]
RewriteRule .* - [F]
Consider that you may want to allow ia_archiver to fetch the images you display on the home page as well.
Replace all broken pipe "¦" characters in this code with solid pipes before use.
It seems to me that using robots.txt to do most of the work might be easier, but that's not what you asked...
Jim
RewriteCond %{HTTP_REFERER}!www\.(site1¦site2)\.com(/((home¦index)\.(html?¦php))?)?
RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC]
RewriteRule .* - [F]
Shouldnow match
www.site1.com
www.site1.com/
www.site2.com
www.site2.com/
www.site1.com/home.htm
www.site1.com/home.html
www.site1.com/home.php
www.site1.com/index.htm
www.site1.com/index.html
www.site1.com/index.php
www.site2.com/home.htm
www.site2.com/home.html
www.site2.com/home.php
www.site2.com/index.htm
www.site2.com/index.html
www.site2.com/index.php
RewriteCond %{REQUEST_URI} !^(/¦/index\.html)$
Using DeMorgan's theorem, this is logically equivalent to
If (REQUEST_URI != "/") AND (REQUEST_URI != "/index.html")
I should also note that you must use REQUEST_URI here. HTTP_REFERER is not going to work for several reasons. First, because you want to control (allow) access to the URLs "/index.html" and "/", rather than to control (allow) access to the files referred-to by "/index.html" and "/". And second, because spiders rarely or never provide an HTTP_REFERER header (which means mod_rewrite will see it as blank).
Jim