Blocking a file extension with robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Blocking a file extension with robots.txt

Inherited a site with a problem

abbeyvet

10:40 am on Jul 18, 2005 (gmt 0)

I have inherited a site which uses a system of printable pages, where the standard page has a .html extension and the print version a .htm extesion but both the same name. The files are in the same directories.

There are several hundred articles using this convention and thus many duplicate pages. The site does poorly in Google serps in spite of regular indexing and bucket loads of good content. I can only assume this is at least part of the reason.

From my understanding of robots.txt it isn't possible to exclude an extension. Is there a way?

I have already added a noindex nofollow to the print pages - will this be effectve? Or will it be interpreted as an attempt to hide duplicate content?

Span

9:37 am on Jul 19, 2005 (gmt 0)

[google.com ]


User-agent: Googlebot
Disallow: /*.htm$

But using the asterisk is Google only.

ThomasB

9:42 am on Jul 19, 2005 (gmt 0)

Another idea is to move the print-versions into a subdirectory and exclude this from being indexed. This solution would work with every major search engine.

easternBrain

1:47 pm on Jul 26, 2005 (gmt 0)

just what I needed. but what about other search engines?

ThomasB

5:45 pm on Jul 26, 2005 (gmt 0)

easternBrain, first of all welcome to WebmasterWorld!

Unfortunately only GoogleBot currently supports wildcards. If you have the same problem as described by the original poster, I'd suggest a dedicated directory for print versions.