Forum Moderators: phranque

Message Too Old, No Replies

Noindexing in the htaccess file

         

kenroar

10:29 am on Jul 11, 2014 (gmt 0)

10+ Year Member



I have been trying to de-index all pdf, doc, docx, ppt, pptx, rtf, and txt files from the Apache server. Although I've added what I thought was the correct code, my files still appear in a Google search two months later. Can someone look at the following code and tell me what I'm doing wrong?

CheckCaseOnly On
CheckSpelling On
RewriteEngine on
# BEGIN GZIP
# mod_gzip compression (legacy, Apache 1.3)
<IfModule mod_gzip.c>
mod_gzip_on Yes
mod_gzip_dechunk Yes
mod_gzip_item_include file \.(html?|xml|txt|css|js)$
mod_gzip_item_include handler ^cgi-script$
mod_gzip_item_include mime ^text/.*
mod_gzip_item_include mime ^application/x-javascript.*
mod_gzip_item_exclude mime ^image/.*
mod_gzip_item_exclude rspheader ^Content-Encoding:.*gzip.*
</IfModule>
# END GZIP

# DEFLATE compression
<IfModule mod_deflate.c>
# Set compression for: html,txt,xml,js,css
AddOutputFilterByType DEFLATE text/html text/plain text/xml application/xml application/xhtml+xml text/javascript text/css application/x-javascript
# Deactivate compression for buggy browsers
BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4.0[678] no-gzip
BrowserMatch bMSIE !no-gzip !gzip-only-text/html
# Set header information for proxies
Header append Vary User-Agent
</IfModule>
# END DEFLATE

<filesMatch "\.(pdf|doc|docx|ppt|pptx|rtf|txt|php|swf)$">
Header set X-Robots-Tag "noindex,nofollow,noarchive"
</filesMatch>

# 480 weeks
<filesMatch "\.(ico|flv|jpg|jpeg|png|gif|js|css)$">
Header set Cache-Control "max-age=290304000, public"
</filesMatch>

# 2 DAYS
<filesMatch "\.(xml|txt)$">
Header set Cache-Control "max-age=172800, public, must-revalidate"
</filesMatch>

# 2 HOURS
<filesMatch "\.(html|htm)$">
Header set Cache-Control "max-age=7200, must-revalidate"
</filesMatch>

#<IfModule mod_speling.c>
#CheckCaseOnly On
#CheckSpelling On
#</IfModule>

Header always append X-Frame-Options SAMEORIGIN

RewriteCond %{HTTP_HOST} ^incredibleart.org$
RewriteRule ^(.*)$ "http\:\/\/www\.incredibleart\.org\/$1" [R=301,L]

penders

12:37 pm on Jul 11, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do the HTTP response headers show what you expect? ie. The X-Robots-Tag header?

aristotle

2:31 pm on Jul 11, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Here's what I use to prevent indexing of images:
<Files ~ "\.(gif|jp[eg]|png)$">
Header append x-robots-tag "noindex"
</Files>

I've been using this for a long time, and as far as I know, it always works.

Edited

not2easy

3:28 pm on Jul 11, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



A note - AFAIK, the Apache directives are case sensitive and the format you use may be contributing to how well they work or don't. I haven't read up recently so that may be old info. Rather than filesMatch, try FilesMatch.

You might need to also disable caching for those files:
#disable cache for script and noindex files

<FilesMatch "\.(cgi|doc|docx|fcgi|pdf|php|pl|ppt|pptx|rtf|txt|spl|scgi|swf)$">
Header unset Cache-Control
</FilesMatch>


Note: I have no idea how this might or might not affect the X-robots "noarchive" header.

aristotle

4:30 pm on Jul 11, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you prevent indexing, wouldn't that automatically prevent caching and archiving?

penders

4:45 pm on Jul 11, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



AFAIK, the Apache directives are case sensitive...


Directives in the configuration files are case-insensitive, but arguments to directives are often case sensitive.


Reference: [httpd.apache.org...]

"filesMatch" should work OK, although to be honest it does seem rare to see Apache directives that aren't correctly camel-cased. (?)

Maybe two months isn't long enough? How often does Google crawl these files? Have these files previously been cached?

encyclo

5:08 pm on Jul 11, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think we need to step back a little from playing with the code or preventing caching (which may help the immediate problem only at the cost of unnecessary increased server load and bandwidth). The .htaccess code mentioned should work in normal circumstances, so there is something else which is causing difficulties.

The first question is in the first reply above, and this needs to be answered before anything else. Is the header actually being set? (Can you see it with a header checker?)

If the header is not there, then I would suspect that the files are not actually being served by the Apache server. This may well be because you have a forward-proxy (nginx) handling static files, so the .htaccess is not called in that situation.

Using the server header to control indexing is useful, but not a panacea. This is due to caching (which is almost always advantageous for static files) and proxies. It is usually better to place such files in robots.txt-excluded directories.

lucy24

7:43 pm on Jul 11, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you prevent indexing, wouldn't that automatically prevent caching and archiving?

Archiving yes. That is: for all we know g### might keep everything in its archive and there's simply no way of knowing about it. To test, you'd have to remove a Noindex header and see whether you suddenly find archived results from before you removed the header.

Caching no, that's an entirely different process, not limited to robots.

Can the X-Robots header ever contain information other than "Noindex"? If not, use "Header set" instead of "Header append".

aristotle

9:32 pm on Jul 11, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks Lucy
But I'm still a bit puzzled. If you explicitly say "noarchive", that still wouldn't prevent Google from saving a hidden copy either.

And apparently I don't understand the role of a cache. Why would Google need it if the page isn't in the index and a public cashed copy can't be shown.

kenroar

6:17 am on Jul 12, 2014 (gmt 0)

10+ Year Member



+aristotle Thanks for the code but I'm not wanting to deindex images. Only PDF's, docs, ppts, etc.

kenroar

6:29 am on Jul 12, 2014 (gmt 0)

10+ Year Member



This is the example of a server header for one of the PowerPoints on the site:

SERVER RESPONSE: HTTP/1.1 200 OK
Server:
nginx/1.4.0
Date:
Sat, 12 Jul 2014 06:35:13 GMT
Content-Type:
application/vnd.ms-powerpoint
Connection:
keep-alive
Vary:
Cookie,Host,X-Middleton-Mode
Set-Cookie:
ezouid=1361990311; expires=Fri, 01-Jul-2016 06:35:13 GMT; path=/; domain=incredibleart.org; httponly
Last-Modified:
Sat, 06 Dec 2003 22:35:20 GMT
Expires:
Mon, 11 Aug 2014 06:35:13 GMT
Cache-Control:
max-age=2592000

kenroar

6:37 am on Jul 12, 2014 (gmt 0)

10+ Year Member



+encyclo
Placing those files in a robots-blocked directly sounds like it would solve the problem. However, it would take a lot of time to change the links to about 1,000 files. I was hoping I could do it with one stone such as a rule in Apache.

lucy24

9:06 am on Jul 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And apparently I don't understand the role of a cache. Why would Google need it if the page isn't in the index and a public cashed copy can't be shown.

The cache has nothing to do with search engines. It's what prevents your browser from having to make a fresh request for 17 images, three scripts and eleven stylesheets when you pull a page out of your history a few hours after your first visit. Or, for that matter, what prevents your browser from making sixty image requests-- most of them duplicates-- when you load up the present page. (Yes, OK, I counted, so shoot me.)

Placing those files in a robots-blocked directly sounds like it would solve the problem. However, it would take a lot of time to change the links to about 1,000 files.

Oh, no, I don't think anyone envisioned moving files. I think the assumption was that you've already got your assorted non-page files in subdirectories that can then be roboted-out. It's not worth moving things otherwise.

You've got a <FilesMatch> envelope, right?

<FilesMatch "\.(pdf|docx?|pptx?|rtf|txt)$">
Header set X-Robots-Tag "noindex"
</Files>

Only don't quote me, because it isn't easy to cut-and-paste from beginning to end of a thread. Mine's only js|txt|xml.

Make your own decisions about .doc and .pdf. Google is able to index these-- or at least pdf, don't know about .doc-- so you have to think whether it's useful to you. If you don't want people heading straight for you pdfs without giving some html page a look-in first, then yes, you want a noindex.

phranque

9:53 am on Jul 12, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



apparently I don't understand the role of a cache. Why would Google need it if the page isn't in the index and a public cashed copy can't be shown.


google keeps a cached version so it knows about the noindex/noarchive directive.
on subsequent requests for the resource google sends the If-Modified-Since header.
if you change your mind about the noindex/noarchive, google would get a fresh version to cache.

penders

10:03 am on Jul 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



... server header for one of the PowerPoints ...


And the X-Robots-Tag header is missing. I assume these files are either ".ppt" or ".pptx"?

By itself, your code in .htaccess should work, so it doesn't appear to even being executed? (Or something is overriding it?)

Maybe what encyclo mentioned has something to do with the problem:

Server: nginx/1.4.0


encyclo: If the header is not there, then I would suspect that the files are not actually being served by the Apache server. This may well be because you have a forward-proxy (nginx) handling static files, so the .htaccess is not called in that situation.


?

aristotle

12:11 pm on Jul 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lucy and phranque
Thanks for the replies. Actually what prompted my questions was the OP's code:
<filesMatch "\.(pdf|doc|docx|ppt|pptx|rtf|txt|php|swf)$">
Header set X-Robots-Tag "noindex,nofollow,noarchive"
</filesMatch>

If I understand it, the "noindex" and "noarchive" are intended for googlebot, bingbot, etc, although the "noarchive" might be redundant. As for the "nofollow", i somehow originally mis-read that as "nocache", so that was partly the cause of my confusion.

encyclo

12:50 pm on Jul 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Cache-Control:
max-age=2592000


So the header doesn't include the X-Robots-Tag or the X-Frame-Options headers, and the cache-control max-age is for 30 days, which isn't defined in the .htaccess either.

What is the deal exactly with the server technology that you are using? If the server response says it is nginx, where does the Apache part fit in? Are you using a CDN or other forward-proxy? Do the other rules work in other contexts, such as the X-Frame-Options header or cache-control (or even the bare domain to www redirect)?

lucy24

9:21 pm on Jul 12, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<topic drift>
on subsequent requests for the resource google sends the If-Modified-Since header.

Does it? I routinely see 304 responses on image requests, but never for pages-- and I don't have any content that changes daily. (It could happen automatically if you include material with a timestamp. I don't.)

:: detour to check a few random days of logged headers ::

6 July: 17 page requests ("I told you I had a tiny site, but you thought I was exaggerating"), of which 4 have the If-Modified-Since header
7 July: 27 page requests, of which 17 have the If-Modified-Since header

:: further detour to raw logs, because headers don't say what page they asked for (Bill? can I tweak the code to include this information?) ::

Huh. So it's about half-and-half with or without the If-Modified-Since header. This only applies to page requests, except that this batch happened to include one 404 for a css file, so the headers ("WITH") were logged when the server sent out the 404 page. Admittedly it's a tiny sample, but I really can't see any pattern. Where a particular page was requested more than once, the with/without was consistent-- but side-by-side pages in the same directory might go either way. This notably includes a few pages whose only English-language content is boilerplate.
</topic drift>

The genuine Googlebot never sent a cache-control header within these two days, though fake ones occasionally do.

kenroar

6:26 am on Jul 14, 2014 (gmt 0)

10+ Year Member



+lucy24
I notice you have a question mark after the docx and pptx.

<FilesMatch "\.(pdf|docx?|pptx?|rtf|txt)$">
Header set X-Robots-Tag "noindex"
</Files>

I don't have any questions marks in mine. (See top post) Do these have to have question marks to work?

lucy24

7:37 am on Jul 14, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



docx?
=
doc|docx

Similarly
pptx?
=
ppt|pptx

A ? after a character means it's optional. (In other situations, it may have other meanings.) It makes things easier for the server because you're saying "now that you've matched 'ppt' there may or may not be a following x". So the server doesn't have to start the match all over again. It's especially a timesaver in situations like this, where it is hardly likely that you would have any extensions beginning in "ppt" that are neither ppt nor pptx.

Oh, and, ahem, in the preceding post I was talking about number of requests specifically from the Googlebot ;) It looks pretty comical otherwise.