Forum Moderators: goodroi

Message Too Old, No Replies

Allowing only some pages for ia archive

         

sosoo

6:53 am on Dec 13, 2006 (gmt 0)

10+ Year Member



How can I allow ia_archive to index just the index.html page and maybe one or two others?
Although googlebot seems to support more sophisticated commands and wildcards, i'm not sure what ia_archive can understand.

I have too many pages to individually disallow, and I don't want to allow everything since in my experience ia_archive can be a bandwidth hog if left unchecked.

physics

6:58 am on Dec 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Welcome to WebmasterWorld.com sosoo!

You might not be able to do this with robots.txt but you can with .htaccess if you're using Apache (are you?).

phranque

7:08 am on Dec 13, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you might find some answers in this relevant thread:
[webmasterworld.com...]

sosoo

7:15 am on Dec 13, 2006 (gmt 0)

10+ Year Member



Yes, i'm using Apache, and I could use .htaccess.

It just seems to make perfect sense if you could so something like this:

User-agent: ia_archiver
Allow: /index.html
Disallow *

Allow doesn't seem to be widely supported though.

phranque

8:20 am on Dec 13, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you should put bot specific rules before
User-agent: *
rules.
you must either put the allowed files in a directory above disallowed files or specifically disallow a list of files after allowing all others.
allow is not part of the robots.txt standard.
wildcarding is not supported for path names.
the correct way to allow all paths is
Disallow
(without a path).
the correct way to disallow all paths is
Disallow /
.
please see this for details and examples:
[robotstxt.org...]
you might also try the google webmaster tools for robots.txt verification and testing.