Should robots.txt be 301'd on an HTTPS-only site?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Should robots.txt be 301'd on an HTTPS-only site?

olly

1:05 am on Mar 5, 2015 (gmt 0)

But of a tricky one here so going to try my best explain it clearly!

Basically we launched an HTTPS-only site, [example.com...]
This site uses an NGINX reverse proxy in front of the webapp, with a 301 from http to https on all URLS. No http is allowed and we provide HSTS headers on all requests.

Now we tried to look at backlinks in Majestic, and as you know this requires site verification.

OK no problem, so we add the site (the https url btw), but then it complains that http://www.example.com/robots.txt does not exist. Which, is strictly true, since it's 301'd to https.

We also want to submit to Google Chrome's HSTS preload, [hstspreload.appspot.com...] and one requirement is all traffic should be on https.

How might one go about verifying the site with Majestic? And, should we keep a 301 on the http version of robots.txt, or make an exception (and possibly break the ALL_HTTPS rule)

any of you bright folks out there have some wisdom to offer on this?

goodroi

12:06 pm on Mar 5, 2015 (gmt 0)

I try to make my SEO vendors happy unless it conflicts with Google. I try to make Google happy unless it conflicts with my users. I try to make my users happy unless it conflicts with my profit & business goals.

In other words, I wouldn't bother making Majestic happy and would try to comply with all https rule for Google because it could provide a better user experience and hopefully lead to bigger profits.

rish3

2:12 pm on Mar 5, 2015 (gmt 0)

Untested, but something like this should work in NGINX.


server { 
 listen 80; 
 root /var/www; 
 
 set $mg12 '0'; 
 if ($http_user_agent ~ 'MJ12bot') { 
 set $mj12 '1'; 
 } 
 
 location =/robots.txt { 
 if ($mj12 = '1') { 
 try_files /robots-mj12.txt; 
 }  
 }  
}

Edit: grr. code tags don't preserve indentation. switch to pre tags.

phranque

8:33 pm on Mar 5, 2015 (gmt 0)

i don't think olly needs a unique robots.txt file served to MJ12bot, but rather to handle MJ12bot's http request.

rish3

9:27 pm on Mar 5, 2015 (gmt 0)

i don't think olly needs a unique robots.txt file served to MJ12bot, but rather to handle MJ12bot's http request.

Well, one thing he's saying that the MJ12BOT will not follow the 301 redirect for the robots.txt. It wants to be able to pull robots.txt via http.

One way around that is to have the nginx cache serve just the MJ12BOT actual content for that one url, rather than a 301. That allows them to open a tiny http hole for the broken MJ12BOT, versus doing something more drastic. It would leave the site "https only" for all other traffic, which seems to be desired.

phranque

12:37 am on Mar 6, 2015 (gmt 0)

That allows them to open a tiny http hole for the broken MJ12BOT

can't you do that without internally rewriting to a unique filename?
what olly really needs is to exclude robots.txt requests by MJ12bot from the 301 redirect from http to https.

olly, lets see your nginx configuration directives for the 301 redirect from http to https.

rish3

5:04 am on Mar 6, 2015 (gmt 0)

can't you do that without internally rewriting to a unique filename?
what olly really needs is to exclude robots.txt requests by MJ12bot from the 301 redirect from http to https.

There are many ways to solve it. I offered one. If it so happens that the nginx proxy and the end https server run on the same machine, then it could serve up the same robots.txt file.

phranque

5:23 am on Mar 6, 2015 (gmt 0)

If it so happens that the nginx proxy and the end https server run on the same machine,

i don't have much nginx experience, but i would assume an internal rewrite (try_files in this specific case) would only work on the same machine.

lucy24

6:33 am on Mar 6, 2015 (gmt 0)

<tangent>
Do you really want to encourage the MJ12bot's quirks? On my site it's got a solid record of requesting garbled filenames and imaginary directories; if your robots.txt file is https they can jolly well ask for it in that form. See if you can make contact with a human. Yours can't possibly be the only site with this issue.
</tangent>

[edited by: engine at 8:50 am (utc) on Mar 6, 2015]
[edit reason] fixed typo at poster request [/edit]

olly

7:19 am on Mar 6, 2015 (gmt 0)

hey thanks everyone for the ideas, looks like we have something to think about.

phranque here is our nginx.conf. (ps cheers for the tip about pre tags, rfish3)


http { 
 include /etc/nginx/mime.types; 
 default_type application/octet-stream; 
 
 # Default is 1024, Digital Ocean suggests 2048 
 types_hash_max_size 2048; 
 
 # Increase default request size from 1mb to 100mb, account for large documents 
 client_max_body_size 100m; 
 
 #log_format main '$remote_addr - $remote_user [$time_local] "$request" ' 
 # '$status $body_bytes_sent "$http_referer" ' 
 # '"$http_user_agent" "$http_x_forwarded_for"'; 
 
 #access_log logs/access.log main; 
 
 sendfile on; 
 #tcp_nopush on; 
 
 #keepalive_timeout 0; 
 keepalive_timeout 65; 
 
 #gzip on; 
 
 # reduce the data that needs to be sent over network 
 gzip on; 
 gzip_min_length 10240; 
 gzip_proxied expired no-cache no-store private auth; 
 gzip_types text/plain text/css text/xml text/javascript application/x-javascript application/xml; 
 gzip_disable "MSIE [1-6]\."; 
 
 # don't display server version for security 
 server_tokens off; 
 
 # request timed out -- default 60 
 client_body_timeout 10; 
 
 # Define your "upstream" servers - the 
 # servers request will be sent to 
 upstream app_example { 
 least_conn; # Use Least Connections strategy 
 server 127.0.0.1:3000; # NodeJS Server 1 
 # server 127.0.0.1:9001; # NodeJS Server 2 
 # server 127.0.0.1:9002; # NodeJS Server 3 
 } 
 
 server { 
 listen 80; 
 server_name www.example.com; 
 return 301 [$server_name$request_uri;...]  
 } 
 
 # Define the Nginx server 
 # This will proxy any non-static directory 
 server { 
 listen 443; 
 server_name localhost; 
 
 access_log /var/log/nginx/example.com-access.log; 
 error_log /var/log/nginx/example.com-error.log error; 
 
 ## Google PageSpeed Configuration ## 
  
 # DISABLED UNTIL NEEDED 
  
 # pagespeed on; 
 # pagespeed ForceCaching on; 
 # pagespeed FileCachePath /var/cache/pagespeed; 
  
 # Ensure requests for pagespeed optimized resources go to the pagespeed handler 
 # and no extraneous headers get set. 
 # location ~ "\.pagespeed\.([a-z]\.)?[a-z]{2}\.[^.]{10}\.[^.]+" { 
 # add_header "" ""; 
 # } 
 # location ~ "^/pagespeed_static/" { } 
 # location ~ "^/ngx_pagespeed_beacon$" { } 
 
 
 ## SSL Configuration ## 
 ssl on; 
 
 ssl_certificate /etc/nginx/ssl/example-bundle.pem; # example-digicert-v2.pem + DigiCertSHA2SecureServerCA.pem; 
 ssl_certificate_key /etc/nginx/ssl/example-digicert-v2.key; # private key; no password 
 ssl_trusted_certificate /etc/nginx/ssl/oscp.pem; # Contains DigiCertSHA2SecureServerCA.pem only 
  
 # Strictest ciphers. Disabled for now 
 # ssl_ciphers 'AES128+EECDH:AES128+EDH:!aNULL'; 
 
 # Ciphers for backwards compatability 
 ssl_ciphers "ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384: 
DHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256: 
ECDHE-RSA-AES256-SHA:ECDHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES128-SHA256: 
DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA: 
AES256-GCM-SHA384:AES128-GCM-SHA256:AES256-SHA256:AES128-SHA256:AES256-SHA: 
AES128-SHA:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!MD5:!PSK:!RC4"; 
 
 # Disable SSLv2 and SSLv3 - considered insecure 
 ssl_protocols TLSv1 TLSv1.1 TLSv1.2; 
 ssl_session_cache shared:SSL:10m; 
 
 # enables server-side protection from BEAST attacks 
 # http: //blog.ivanristic.com/2013/09/is-beast-still-a-threat.html 
 ssl_prefer_server_ciphers on; 
 
 ssl_dhparam /etc/nginx/ssl/dhparam.pem; 
 
 # OSCP Stapling 
 # As per discussion here [raymii.org...]  
 ssl_stapling on; 
 ssl_stapling_verify on; 
 resolver 8.8.8.8 8.8.4.4 valid=300s; 
 resolver_timeout 5s; 
 
 # Set HSTS for two years  
 add_header Strict-Transport-Security max-age=63072000; 
 
 # Prevent loading in a frame to deny clickjacking attempts 
 # [developer.mozilla.org...]  
 add_header X-Frame-Options DENY; 
 
 
 # Browser and robot always look for these 
 # Turn off logging for them 
 #location = /favicon.ico { log_not_found off; access_log off; } 
 #location = /robots.txt { log_not_found off; access_log off; } 
 
 # Handle static files so they are not proxied to NodeJS 
 # You may want to also hand these requests to other upstream 
 # servers, as you can define more than one! 
 location ~ ^/(images/|img/|javascript/|js/|css/|stylesheets/|flash/|media/|static/|robots.txt|humans.txt|favicon.ico) { 
 root /var/www/example-web; #note this contains the /src/dist folder in the web root. 
 expires 365d; 
 } 
 
 # pass the request to the node.js server 
 # with some correct headers for proxy-awareness 
 location / { 
 proxy_set_header X-Real-IP $remote_addr; 
 proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 
 proxy_set_header Host $http_host; 
 proxy_set_header X-NginX-Proxy true; 
 
 proxy_pass http ://localhost:3000/; 
 proxy_redirect off; 
 
 # Handle Web Socket connections 
 proxy_http_version 1.1; 
 proxy_set_header Upgrade $http_upgrade; 
 proxy_set_header Connection "upgrade"; 
 } 
 }

[edited by: phranque at 9:46 am (utc) on Mar 6, 2015]
[edit reason] inserted line breaks [/edit]

rish3

11:51 am on Mar 6, 2015 (gmt 0)

i don't have much nginx experience, but i would assume an internal rewrite (try_files in this specific case) would only work on the same machine.

Huh? If they aren't the same machine, the idea is that you're placing a single file on the proxy machine...one that gets served up only for a single url from a single matching UA. All other requests go however they were going.

If it happens to be the same machine that file could be the original robots.txt.

rish3

12:14 pm on Mar 6, 2015 (gmt 0)

Ok, so with the context of the original question, what's wanted is:

100% https, other than serving 301 from every http request to the same url, but as https

The issue here is that most crawlers are okay with the 301 that's returned when they request robots.txt from the http (not https) server. The Majestic crawler is not, at least for verification. It wants to get the robots.txt file via http, without being redirected.

Your choices then, are either

Break your 100% https rule, and serve up robots.txt to every robot, without doing a 301, when they ask for it over unsecure http
--or--

Make an exception where the robots.txt file is served up over unsecure http, but only for requests where the UA looks like the MJ12 bot, and only when it requests robots.txt.

Based on your config, above, both the http and https server are running on the same machine. The configuration for the http server is very brief:


server {  
 listen 80;  
 server_name www.example.com;  
 return 301 [$server_name$request_uri;...]   
}

Either direction would consist of adding a "location = /robots.txt" section to that, and handling the request in some way other than returning a 301. You could either restrict that to just the MJ12 bot's UA or not. You could use try_files, or as shown in the existing https section of your config, set the root directory, but only for that specific "location = /robots.txt" section.

Receptional

1:46 pm on Mar 6, 2015 (gmt 0)

Just set up the verification file and put in a support ticket. Thay can manually authorize it via Majestic support. :)

olly

9:06 pm on Mar 6, 2015 (gmt 0)

Ok, so with the context of the original question, what's wanted is:
100% https, other than serving 301 from every http request to the same url, but as https

This is great, thanks rish3.

While I'm not a fan of breaking the https only rule, I imagine reducing the possibility space to a minimum (using the neat conditionals hack) would be preferable.

Just set up the verification file and put in a support ticket. Thay can manually authorize it via Majestic support. :)

Oh wow, how did I not think of this? Hoping we can take the easy route here :)