Welcome to WebmasterWorld Guest from 3.80.6.254

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

Noindexing in the htaccess file

     
10:29 am on Jul 11, 2014 (gmt 0)

Junior Member

10+ Year Member

joined:July 29, 2007
posts: 71
votes: 0


I have been trying to de-index all pdf, doc, docx, ppt, pptx, rtf, and txt files from the Apache server. Although I've added what I thought was the correct code, my files still appear in a Google search two months later. Can someone look at the following code and tell me what I'm doing wrong?

CheckCaseOnly On
CheckSpelling On
RewriteEngine on
# BEGIN GZIP
# mod_gzip compression (legacy, Apache 1.3)
<IfModule mod_gzip.c>
mod_gzip_on Yes
mod_gzip_dechunk Yes
mod_gzip_item_include file \.(html?|xml|txt|css|js)$
mod_gzip_item_include handler ^cgi-script$
mod_gzip_item_include mime ^text/.*
mod_gzip_item_include mime ^application/x-javascript.*
mod_gzip_item_exclude mime ^image/.*
mod_gzip_item_exclude rspheader ^Content-Encoding:.*gzip.*
</IfModule>
# END GZIP

# DEFLATE compression
<IfModule mod_deflate.c>
# Set compression for: html,txt,xml,js,css
AddOutputFilterByType DEFLATE text/html text/plain text/xml application/xml application/xhtml+xml text/javascript text/css application/x-javascript
# Deactivate compression for buggy browsers
BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4.0[678] no-gzip
BrowserMatch bMSIE !no-gzip !gzip-only-text/html
# Set header information for proxies
Header append Vary User-Agent
</IfModule>
# END DEFLATE

<filesMatch "\.(pdf|doc|docx|ppt|pptx|rtf|txt|php|swf)$">
Header set X-Robots-Tag "noindex,nofollow,noarchive"
</filesMatch>

# 480 weeks
<filesMatch "\.(ico|flv|jpg|jpeg|png|gif|js|css)$">
Header set Cache-Control "max-age=290304000, public"
</filesMatch>

# 2 DAYS
<filesMatch "\.(xml|txt)$">
Header set Cache-Control "max-age=172800, public, must-revalidate"
</filesMatch>

# 2 HOURS
<filesMatch "\.(html|htm)$">
Header set Cache-Control "max-age=7200, must-revalidate"
</filesMatch>

#<IfModule mod_speling.c>
#CheckCaseOnly On
#CheckSpelling On
#</IfModule>

Header always append X-Frame-Options SAMEORIGIN

RewriteCond %{HTTP_HOST} ^incredibleart.org$
RewriteRule ^(.*)$ "http\:\/\/www\.incredibleart\.org\/$1" [R=301,L]
12:37 pm on July 11, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2006
posts: 3153
votes: 7


Do the HTTP response headers show what you expect? ie. The X-Robots-Tag header?
2:31 pm on July 11, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member aristotle is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 4, 2008
posts:3671
votes: 374


Here's what I use to prevent indexing of images:
<Files ~ "\.(gif|jp[eg]|png)$">
Header append x-robots-tag "noindex"
</Files>

I've been using this for a long time, and as far as I know, it always works.

Edited
3:28 pm on July 11, 2014 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4558
votes: 363


A note - AFAIK, the Apache directives are case sensitive and the format you use may be contributing to how well they work or don't. I haven't read up recently so that may be old info. Rather than filesMatch, try FilesMatch.

You might need to also disable caching for those files:
#disable cache for script and noindex files

<FilesMatch "\.(cgi|doc|docx|fcgi|pdf|php|pl|ppt|pptx|rtf|txt|spl|scgi|swf)$">
Header unset Cache-Control
</FilesMatch>


Note: I have no idea how this might or might not affect the X-robots "noarchive" header.
4:30 pm on July 11, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member aristotle is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 4, 2008
posts:3671
votes: 374


If you prevent indexing, wouldn't that automatically prevent caching and archiving?
4:45 pm on July 11, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2006
posts: 3153
votes: 7


AFAIK, the Apache directives are case sensitive...


Directives in the configuration files are case-insensitive, but arguments to directives are often case sensitive.


Reference: [httpd.apache.org...]

"filesMatch" should work OK, although to be honest it does seem rare to see Apache directives that aren't correctly camel-cased. (?)

Maybe two months isn't long enough? How often does Google crawl these files? Have these files previously been cached?
5:08 pm on July 11, 2014 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member encyclo is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 31, 2003
posts:9074
votes: 6


I think we need to step back a little from playing with the code or preventing caching (which may help the immediate problem only at the cost of unnecessary increased server load and bandwidth). The .htaccess code mentioned should work in normal circumstances, so there is something else which is causing difficulties.

The first question is in the first reply above, and this needs to be answered before anything else. Is the header actually being set? (Can you see it with a header checker?)

If the header is not there, then I would suspect that the files are not actually being served by the Apache server. This may well be because you have a forward-proxy (nginx) handling static files, so the .htaccess is not called in that situation.

Using the server header to control indexing is useful, but not a panacea. This is due to caching (which is almost always advantageous for static files) and proxies. It is usually better to place such files in robots.txt-excluded directories.
7:43 pm on July 11, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15932
votes: 887


If you prevent indexing, wouldn't that automatically prevent caching and archiving?

Archiving yes. That is: for all we know g### might keep everything in its archive and there's simply no way of knowing about it. To test, you'd have to remove a Noindex header and see whether you suddenly find archived results from before you removed the header.

Caching no, that's an entirely different process, not limited to robots.

Can the X-Robots header ever contain information other than "Noindex"? If not, use "Header set" instead of "Header append".
9:32 pm on July 11, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member aristotle is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 4, 2008
posts:3671
votes: 374


Thanks Lucy
But I'm still a bit puzzled. If you explicitly say "noarchive", that still wouldn't prevent Google from saving a hidden copy either.

And apparently I don't understand the role of a cache. Why would Google need it if the page isn't in the index and a public cashed copy can't be shown.
6:17 am on July 12, 2014 (gmt 0)

Junior Member

10+ Year Member

joined:July 29, 2007
posts: 71
votes: 0


+aristotle Thanks for the code but I'm not wanting to deindex images. Only PDF's, docs, ppts, etc.
6:29 am on July 12, 2014 (gmt 0)

Junior Member

10+ Year Member

joined:July 29, 2007
posts: 71
votes: 0


This is the example of a server header for one of the PowerPoints on the site:

SERVER RESPONSE: HTTP/1.1 200 OK
Server:
nginx/1.4.0
Date:
Sat, 12 Jul 2014 06:35:13 GMT
Content-Type:
application/vnd.ms-powerpoint
Connection:
keep-alive
Vary:
Cookie,Host,X-Middleton-Mode
Set-Cookie:
ezouid=1361990311; expires=Fri, 01-Jul-2016 06:35:13 GMT; path=/; domain=incredibleart.org; httponly
Last-Modified:
Sat, 06 Dec 2003 22:35:20 GMT
Expires:
Mon, 11 Aug 2014 06:35:13 GMT
Cache-Control:
max-age=2592000
6:37 am on July 12, 2014 (gmt 0)

Junior Member

10+ Year Member

joined:July 29, 2007
posts: 71
votes: 0


+encyclo
Placing those files in a robots-blocked directly sounds like it would solve the problem. However, it would take a lot of time to change the links to about 1,000 files. I was hoping I could do it with one stone such as a rule in Apache.
9:06 am on July 12, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15932
votes: 887


And apparently I don't understand the role of a cache. Why would Google need it if the page isn't in the index and a public cashed copy can't be shown.

The cache has nothing to do with search engines. It's what prevents your browser from having to make a fresh request for 17 images, three scripts and eleven stylesheets when you pull a page out of your history a few hours after your first visit. Or, for that matter, what prevents your browser from making sixty image requests-- most of them duplicates-- when you load up the present page. (Yes, OK, I counted, so shoot me.)

Placing those files in a robots-blocked directly sounds like it would solve the problem. However, it would take a lot of time to change the links to about 1,000 files.

Oh, no, I don't think anyone envisioned moving files. I think the assumption was that you've already got your assorted non-page files in subdirectories that can then be roboted-out. It's not worth moving things otherwise.

You've got a <FilesMatch> envelope, right?

<FilesMatch "\.(pdf|docx?|pptx?|rtf|txt)$">
Header set X-Robots-Tag "noindex"
</Files>

Only don't quote me, because it isn't easy to cut-and-paste from beginning to end of a thread. Mine's only js|txt|xml.

Make your own decisions about .doc and .pdf. Google is able to index these-- or at least pdf, don't know about .doc-- so you have to think whether it's useful to you. If you don't want people heading straight for you pdfs without giving some html page a look-in first, then yes, you want a noindex.
9:53 am on July 12, 2014 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11869
votes: 244


apparently I don't understand the role of a cache. Why would Google need it if the page isn't in the index and a public cashed copy can't be shown.


google keeps a cached version so it knows about the noindex/noarchive directive.
on subsequent requests for the resource google sends the If-Modified-Since header.
if you change your mind about the noindex/noarchive, google would get a fresh version to cache.
10:03 am on July 12, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2006
posts: 3153
votes: 7


... server header for one of the PowerPoints ...


And the X-Robots-Tag header is missing. I assume these files are either ".ppt" or ".pptx"?

By itself, your code in .htaccess should work, so it doesn't appear to even being executed? (Or something is overriding it?)

Maybe what encyclo mentioned has something to do with the problem:

Server: nginx/1.4.0


encyclo: If the header is not there, then I would suspect that the files are not actually being served by the Apache server. This may well be because you have a forward-proxy (nginx) handling static files, so the .htaccess is not called in that situation.


?
12:11 pm on July 12, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member aristotle is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 4, 2008
posts:3671
votes: 374


Lucy and phranque
Thanks for the replies. Actually what prompted my questions was the OP's code:
<filesMatch "\.(pdf|doc|docx|ppt|pptx|rtf|txt|php|swf)$">
Header set X-Robots-Tag "noindex,nofollow,noarchive"
</filesMatch>

If I understand it, the "noindex" and "noarchive" are intended for googlebot, bingbot, etc, although the "noarchive" might be redundant. As for the "nofollow", i somehow originally mis-read that as "nocache", so that was partly the cause of my confusion.
12:50 pm on July 12, 2014 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member encyclo is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 31, 2003
posts:9074
votes: 6


Cache-Control:
max-age=2592000


So the header doesn't include the X-Robots-Tag or the X-Frame-Options headers, and the cache-control max-age is for 30 days, which isn't defined in the .htaccess either.

What is the deal exactly with the server technology that you are using? If the server response says it is nginx, where does the Apache part fit in? Are you using a CDN or other forward-proxy? Do the other rules work in other contexts, such as the X-Frame-Options header or cache-control (or even the bare domain to www redirect)?
9:21 pm on July 12, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15932
votes: 887


<topic drift>
on subsequent requests for the resource google sends the If-Modified-Since header.

Does it? I routinely see 304 responses on image requests, but never for pages-- and I don't have any content that changes daily. (It could happen automatically if you include material with a timestamp. I don't.)

:: detour to check a few random days of logged headers ::

6 July: 17 page requests ("I told you I had a tiny site, but you thought I was exaggerating"), of which 4 have the If-Modified-Since header
7 July: 27 page requests, of which 17 have the If-Modified-Since header

:: further detour to raw logs, because headers don't say what page they asked for (Bill? can I tweak the code to include this information?) ::

Huh. So it's about half-and-half with or without the If-Modified-Since header. This only applies to page requests, except that this batch happened to include one 404 for a css file, so the headers ("WITH") were logged when the server sent out the 404 page. Admittedly it's a tiny sample, but I really can't see any pattern. Where a particular page was requested more than once, the with/without was consistent-- but side-by-side pages in the same directory might go either way. This notably includes a few pages whose only English-language content is boilerplate.
</topic drift>

The genuine Googlebot never sent a cache-control header within these two days, though fake ones occasionally do.
6:26 am on July 14, 2014 (gmt 0)

Junior Member

10+ Year Member

joined:July 29, 2007
posts: 71
votes: 0


+lucy24
I notice you have a question mark after the docx and pptx.

<FilesMatch "\.(pdf|docx?|pptx?|rtf|txt)$">
Header set X-Robots-Tag "noindex"
</Files>

I don't have any questions marks in mine. (See top post) Do these have to have question marks to work?
7:37 am on July 14, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15932
votes: 887


docx?
=
doc|docx

Similarly
pptx?
=
ppt|pptx

A ? after a character means it's optional. (In other situations, it may have other meanings.) It makes things easier for the server because you're saying "now that you've matched 'ppt' there may or may not be a following x". So the server doesn't have to start the match all over again. It's especially a timesaver in situations like this, where it is hardly likely that you would have any extensions beginning in "ppt" that are neither ppt nor pptx.

Oh, and, ahem, in the preceding post I was talking about number of requests specifically from the Googlebot ;) It looks pretty comical otherwise.