Forum Moderators: phranque

Message Too Old, No Replies

mod_rewrite to allow ia_archiver to spider only index page

ia_archiver, mod_rewrite

         

gordongecko

9:03 am on Jan 13, 2005 (gmt 0)

10+ Year Member



We've got several dozen sites and I'm trying to allow ia_archiver to spider ONLY the index page of some of these sites, BUT NOT any of the sub directories of said sites.

I'll call these: site1 and site2.

This following mod_rewrite allows ia_archiver to access these sites and works fine:

RewriteCond %{HTTP_HOST}!^www\.(site1¦site2)\.com
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC]
RewriteRule .* - [F]

This does not however stop ia_archiver from spidering site1/sub-directory, site2/sub-directory, etc.

Does anyone have a suggestion of how best to accomplish this, other than using the robots.txt

Marino

9:28 am on Jan 13, 2005 (gmt 0)

10+ Year Member



Hi,

RewriteCond %{HTTP_REFERER}!www\.(site1¦site2)\.com/(home¦index)\.(html?¦php)
RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC]
RewriteRule .* - [F]

If not(homepage of defined site) and (UA is ia_archiever) then forbid.

Not tested, but should work. You still have to adapt the last part of the first regexp to match you home pages URIs.

jdMorgan

3:46 pm on Jan 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



gordongecko,

Welcome to WebmasterWorld!

I suspect you want to block ia_archiver if the requested host is not in the allowed list or the requested page is not in the allowed list:


RewriteCond %{HTTP_HOST} !^www\.(site1¦site2)\.com [OR]
RewriteCond %{REQUEST_URI} !^(/¦/index\.html)$
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC]
RewriteRule .* - [F]

Here the allowed list of pages is "/" and "index.html".

Consider that you may want to allow ia_archiver to fetch the images you display on the home page as well.

Replace all broken pipe "¦" characters in this code with solid pipes before use.

It seems to me that using robots.txt to do most of the work might be easier, but that's not what you asked...

Jim

gordongecko

4:25 pm on Jan 13, 2005 (gmt 0)

10+ Year Member



Thanks Jim - that works.

One thing though regarding the "¦". How exactly does this operate in your solution.

As far as I'm aware "¦" indicates a logical "OR". I don't quite see how that works.

How would this be different from say:

!^(/.*/index\.html)$
!^(/[a-z]/index\.html)$

Regards,
gg

Marino

9:52 am on Jan 14, 2005 (gmt 0)

10+ Year Member



Yes, the pipe in a regular expression is a logical OR.
BTW, I have not taken the "/" into account in my solution.
Shoulmd be :

RewriteCond %{HTTP_REFERER}!www\.(site1¦site2)\.com(/((home¦index)\.(html?¦php))?)?
RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC]
RewriteRule .* - [F]

Shouldnow match
www.site1.com
www.site1.com/
www.site2.com
www.site2.com/
www.site1.com/home.htm
www.site1.com/home.html
www.site1.com/home.php
www.site1.com/index.htm
www.site1.com/index.html
www.site1.com/index.php
www.site2.com/home.htm
www.site2.com/home.html
www.site2.com/home.php
www.site2.com/index.htm
www.site2.com/index.html
www.site2.com/index.php

jdMorgan

2:45 pm on Jan 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




RewriteCond %{REQUEST_URI} !^(/¦/index\.html)$

translates to:
If NOT( (REQUEST_URI == "/") OR (REQUEST_URI == "/index.html") )

Using DeMorgan's theorem, this is logically equivalent to

If (REQUEST_URI != "/") AND (REQUEST_URI != "/index.html")

I should also note that you must use REQUEST_URI here. HTTP_REFERER is not going to work for several reasons. First, because you want to control (allow) access to the URLs "/index.html" and "/", rather than to control (allow) access to the files referred-to by "/index.html" and "/". And second, because spiders rarely or never provide an HTTP_REFERER header (which means mod_rewrite will see it as blank).

Jim

gordongecko

3:12 pm on Jan 14, 2005 (gmt 0)

10+ Year Member



Thank you for clearing that up Jim.

Can you recommend a good book/text that might help one to better understand "DeMorgan's theorem" and/or various regex rules, especially as applied in mod_rewrite.

Thanks,
gg