Allowing only some pages for ia archive

Forum Moderators: goodroi

Message Too Old, No Replies

Allowing only some pages for ia archive

sosoo

6:53 am on Dec 13, 2006 (gmt 0)

How can I allow ia_archive to index just the index.html page and maybe one or two others?
Although googlebot seems to support more sophisticated commands and wildcards, i'm not sure what ia_archive can understand.

I have too many pages to individually disallow, and I don't want to allow everything since in my experience ia_archive can be a bandwidth hog if left unchecked.

physics

6:58 am on Dec 13, 2006 (gmt 0)

Welcome to WebmasterWorld.com sosoo!

You might not be able to do this with robots.txt but you can with .htaccess if you're using Apache (are you?).

phranque

7:08 am on Dec 13, 2006 (gmt 0)

you might find some answers in this relevant thread:
[webmasterworld.com...]

sosoo

7:15 am on Dec 13, 2006 (gmt 0)

Yes, i'm using Apache, and I could use .htaccess.

It just seems to make perfect sense if you could so something like this:

User-agent: ia_archiver
Allow: /index.html
Disallow *

Allow doesn't seem to be widely supported though.

phranque

8:20 am on Dec 13, 2006 (gmt 0)

you should put bot specific rules before

User-agent: *

rules.
you must either put the allowed files in a directory above disallowed files or specifically disallow a list of files after allowing all others.
allow is not part of the robots.txt standard.
wildcarding is not supported for path names.
the correct way to allow all paths is

Disallow

(without a path).
the correct way to disallow all paths is

Disallow /

.
please see this for details and examples:
[robotstxt.org...]
you might also try the google webmaster tools for robots.txt verification and testing.