ban all except index.htm

Forum Moderators: goodroi

Message Too Old, No Replies

ban all except index.htm

rebelo

7:46 pm on Feb 9, 2005 (gmt 0)

Hi I was wondering if the best way to ban robots from
spidering an entire site except the index.html page would be by a
META NAME="robots" CONTENT="noindex, nofollow"
at the index.html page or if there is a better way.

martin_uboo

3:08 pm on Feb 12, 2005 (gmt 0)

That would tell the robots not to index the index.html page and not to follow any links from it. In theory, they wouldn't crawl anything via that page. However, if there are external links pointing to other pages within the site, that would only exclude the index.html page.

The correct way would be to exclude everything else in the robots.txt file and add the "noindex" and "nofollow" meta tags to all the other pages in the site (as a safety measure in case robots.txt is ignored)and just a "nofollow" tag to the index.html page.

Sample of robots.txt:

User-agent: *
Disallow: /images
Disallow: /cgi-bin/
Disallow: /each-and-every-directory
Disallow: /page-name.html
Disallow: /another-page-name.html
Disallow: /still-another-page-name.html

You would need to list each page in the root directory and each directory.

Robot Manager is an excellent tool to use if you have trouble writing the robots.txt file by hand (hope that's OK to list that resource). There is also an excellent validator here at SEW > [searchengineworld.com...]

Remember robots/crawlers/spiders have been known to ignore all these safeguards. If you have something sensitive you don't want to see in a SE, password protect the directory and/or page.

Hope that helps.