homepage Welcome to WebmasterWorld Guest from 54.161.192.130
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Using .htaccess and X-Robots-Tag to noindex directories
Using .htaccess and X-Robots-Tag to noindex directories
riatkstarley



 
Msg#: 4598050 posted 2:58 pm on Jul 31, 2013 (gmt 0)

Hello!

I was wondering if someone might be kind enough to help me? I'm trying to noindex a few directories and files of a site using the X-Robots-Tag in the .htaccess file and was wondering if someone could check my code before I upload the file in case everything goes horribly wrong - I'm fairly new to .htaccess and don't want to break the site! I believe I have to do it this way as all of the URLs (there are thousands) have been indexed by Google so I don't think just disallowing them in Robots.txt is good enough, and I can't just noindex each page as I'd be doing it for years!

Below are a few example URLs and what I've done in the .htaccess to try and noindex them:

http://www.example.com/index.php?action=highscores&gameid=885&type=overall&p=1770

<IfModule mod_headers.c>
<FilesMatch "^index.php\?action=highscores?$">
Header set X-Robots-Tag: "noindex"
</FilesMatch>
</IfModule>


http://www.example.com/blog/category/new-games/page/51/

<IfModule mod_headers.c>
<FilesMatch "^blog/category/?$">
Header set X-Robots-Tag: "noindex"
</FilesMatch>
</IfModule>


http://www.example.com/games/1057/play.html

<IfModule mod_headers.c>
<FilesMatch "^play\.html$">
Header set X-Robots-Tag: "noindex, nofollow"
</FilesMatch>
</IfModule>


http://www.example.com/rate_aus.php&gameid=1312

<IfModule mod_headers.c>
<FilesMatch "^rate_?$">
Header set X-Robots-Tag: "noindex"
</FilesMatch>
</IfModule>


I also need to redirect URLs with underscores to their (already existing) counterparts with hyphens, e.g.:

http://www.example.com/games/example_game_name.html
to
http://www.example.com/games/example-game-name.html

I've coded this as:

RewriteCond %{QUERY_STRING} ^.+$
RewriteRule ^_$ -? [L,R=301,NC]


Would I then also need to noindex the underscored URLs to avoid duplicate content? Duplicates are currently a big issue and the main reason for most of this work!

I know it's a massive ask, but if anyone could check these and let me know if I'm completely out, it would be helpful. Normally I'm all about the trial and error, but I know .htaccess files are fragile beasts and I don't want to break anything!

Thanks :)

 

dougwilson



 
Msg#: 4598050 posted 6:06 pm on Jul 31, 2013 (gmt 0)

I'm after a similar solution to "noindex" urls containing characters like "?, = or &".

I use [an online htaccess checker] to check syntax

An OK there just means no "500 error"

[edited by: phranque at 10:41 pm (utc) on Jul 31, 2013]
[edit reason] no tools please [/edit]

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4598050 posted 9:57 pm on Jul 31, 2013 (gmt 0)

Can you include parameters in a FilesMatch envelope? I'd be afraid to.

The parameters by definition mean you've got a dynamic page. So why don't you shift the header-setting into the page code itself? Just remember that the header has to be set before the page returns any content at all-- not so much as a line break. Otherwise it's too late; you can't send out a supplementary "Oh! And this stuff too" response header .

RewriteRule ^_$ -? [L,R=301,NC]

This rule will only work on requests for
www.example.com/_
where the lowline is the only thing in the "path" of the URL. And the condition says a query string has to exist-- though the + and flags are superfluous.

That's just as well, since the target means, similarly, redirect to
www.example.com/-
dropping the query string.

The [NC] flag is only needed when you're matching literal text containing alphabetic characters.

The _ to - redirect should probably be spun-off to a separate thread, since it's a completely unrelated question.

riatkstarley



 
Msg#: 4598050 posted 11:07 am on Aug 1, 2013 (gmt 0)

Hi Lucy, thanks for getting back to me.

Unfortunately there are far too many pages for me to enter code into each page, which is why I wanted to do it this way. Someone elsewhere told me that this one:

<IfModule mod_headers.c>
<FilesMatch "^play\.html$">
Header set X-Robots-Tag: "noindex, nofollow"
</FilesMatch>
</IfModule>


would definitely work, so I based the others on this one.

Do you think all the pages are dynamic? I wouldn't have thought the blog archives would be.

Thanks!

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4598050 posted 12:03 pm on Aug 1, 2013 (gmt 0)

If there's a parameter, the page is dynamic. That's what parameters are for. "Dynamic" doesn't necessarily mean that the page is different every time you open it. It just means that the html sent to the user's browser is not identical to some html file on your server. So behind all those separate pages is a php file that makes the page.

If you want to no-index the entire contents of a directory, it's most easily done by putting a minimalist htaccess file in that directory, with just the Header set directive.

The form
<FilesMatch "^play\.html$">
only matches files whose exact name is "play.html" so you don't need the Match; a simple
<Files "play.html">
will do. But for an entire directory, <FilesMatch "\.html$"> should do nicely. Make sure the end of the envelope matches the beginning! </Files> or </FilesMatch>

dougwilson



 
Msg#: 4598050 posted 3:34 pm on Aug 1, 2013 (gmt 0)

I always wondered why something as simple as this wasn't implemented:

User-Agent: *
Noindex: /

While testing some urls against the robots.txt tool at googlewebmasretools I tried

User-Agent: *
NoIndex: /*stats

It worked... If you want to test it, remove the url's from google's cache and serp then NoIndex in robots.txt

Tested: hxxp://domain/directory/stats.php
Result: Blocked by line 47: NoIndex: /*stats

This works with Google. Other SE's (?)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4598050 posted 10:57 pm on Aug 1, 2013 (gmt 0)

"Noindex" isn't a canonical robots.txt directive. It happens to be recognized by google, along with a few other non-canonical formats like "Allow" (as opposed to the normal "Disallow").

User-Agent: *

only works on the googlebot if you don't have anything in robots.txt that calls the googlebot by name.

riatkstarley



 
Msg#: 4598050 posted 9:46 am on Aug 2, 2013 (gmt 0)

I've read that whilst Google does unofficially support noindex in Robots.txt, they could take it away at any moment so it's preferable to do it in the Meta Robots or X-Robots-Tag.

Lucy, thanks for your help, I'll give the individual directory approach a go!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved