Welcome to WebmasterWorld Guest from 3.80.4.76

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Should robots.txt be 301'd on an HTTPS-only site?

     
1:05 am on Mar 5, 2015 (gmt 0)

Junior Member

10+ Year Member

joined:June 9, 2009
posts: 110
votes: 0


But of a tricky one here so going to try my best explain it clearly!

Basically we launched an HTTPS-only site, [example.com...]
This site uses an NGINX reverse proxy in front of the webapp, with a 301 from http to https on all URLS. No http is allowed and we provide HSTS headers on all requests.

Now we tried to look at backlinks in Majestic, and as you know this requires site verification.

OK no problem, so we add the site (the https url btw), but then it complains that http://www.example.com/robots.txt does not exist. Which, is strictly true, since it's 301'd to https.

We also want to submit to Google Chrome's HSTS preload, [hstspreload.appspot.com...] and one requirement is all traffic should be on https.

How might one go about verifying the site with Majestic? And, should we keep a 301 on the http version of robots.txt, or make an exception (and possibly break the ALL_HTTPS rule)

any of you bright folks out there have some wisdom to offer on this?
12:06 pm on Mar 5, 2015 (gmt 0)

Administrator from US 

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 21, 2004
posts:3494
votes: 380


I try to make my SEO vendors happy unless it conflicts with Google. I try to make Google happy unless it conflicts with my users. I try to make my users happy unless it conflicts with my profit & business goals.

In other words, I wouldn't bother making Majestic happy and would try to comply with all https rule for Google because it could provide a better user experience and hopefully lead to bigger profits.
2:12 pm on Mar 5, 2015 (gmt 0)

Preferred Member

5+ Year Member Top Contributors Of The Month

joined:May 24, 2012
posts:648
votes: 2


Untested, but something like this should work in NGINX.

server {
listen 80;
root /var/www;

set $mg12 '0';
if ($http_user_agent ~ 'MJ12bot') {
set $mj12 '1';
}

location =/robots.txt {
if ($mj12 = '1') {
try_files /robots-mj12.txt;
}
}
}

Edit: grr. code tags don't preserve indentation. switch to pre tags.
8:33 pm on Mar 5, 2015 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11824
votes: 238


i don't think olly needs a unique robots.txt file served to MJ12bot, but rather to handle MJ12bot's http request.
9:27 pm on Mar 5, 2015 (gmt 0)

Preferred Member

5+ Year Member Top Contributors Of The Month

joined:May 24, 2012
posts:648
votes: 2


i don't think olly needs a unique robots.txt file served to MJ12bot, but rather to handle MJ12bot's http request.


Well, one thing he's saying that the MJ12BOT will not follow the 301 redirect for the robots.txt. It wants to be able to pull robots.txt via http.

One way around that is to have the nginx cache serve just the MJ12BOT actual content for that one url, rather than a 301. That allows them to open a tiny http hole for the broken MJ12BOT, versus doing something more drastic. It would leave the site "https only" for all other traffic, which seems to be desired.
12:37 am on Mar 6, 2015 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11824
votes: 238


That allows them to open a tiny http hole for the broken MJ12BOT

can't you do that without internally rewriting to a unique filename?
what olly really needs is to exclude robots.txt requests by MJ12bot from the 301 redirect from http to https.

olly, lets see your nginx configuration directives for the 301 redirect from http to https.
5:04 am on Mar 6, 2015 (gmt 0)

Preferred Member

5+ Year Member Top Contributors Of The Month

joined:May 24, 2012
posts:648
votes: 2


can't you do that without internally rewriting to a unique filename?
what olly really needs is to exclude robots.txt requests by MJ12bot from the 301 redirect from http to https.

There are many ways to solve it. I offered one. If it so happens that the nginx proxy and the end https server run on the same machine, then it could serve up the same robots.txt file.
5:23 am on Mar 6, 2015 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11824
votes: 238


If it so happens that the nginx proxy and the end https server run on the same machine,

i don't have much nginx experience, but i would assume an internal rewrite (try_files in this specific case) would only work on the same machine.
6:33 am on Mar 6, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15818
votes: 853


<tangent>
Do you really want to encourage the MJ12bot's quirks? On my site it's got a solid record of requesting garbled filenames and imaginary directories; if your robots.txt file is https they can jolly well ask for it in that form. See if you can make contact with a human. Yours can't possibly be the only site with this issue.
</tangent>

[edited by: engine at 8:50 am (utc) on Mar 6, 2015]
[edit reason] fixed typo at poster request [/edit]

7:19 am on Mar 6, 2015 (gmt 0)

Junior Member

10+ Year Member

joined:June 9, 2009
posts: 110
votes: 0


hey thanks everyone for the ideas, looks like we have something to think about.

phranque here is our nginx.conf. (ps cheers for the tip about pre tags, rfish3)


http {
include /etc/nginx/mime.types;
default_type application/octet-stream;

# Default is 1024, Digital Ocean suggests 2048
types_hash_max_size 2048;

# Increase default request size from 1mb to 100mb, account for large documents
client_max_body_size 100m;

#log_format main '$remote_addr - $remote_user [$time_local] "$request" '
# '$status $body_bytes_sent "$http_referer" '
# '"$http_user_agent" "$http_x_forwarded_for"';

#access_log logs/access.log main;

sendfile on;
#tcp_nopush on;

#keepalive_timeout 0;
keepalive_timeout 65;

#gzip on;

# reduce the data that needs to be sent over network
gzip on;
gzip_min_length 10240;
gzip_proxied expired no-cache no-store private auth;
gzip_types text/plain text/css text/xml text/javascript application/x-javascript application/xml;
gzip_disable "MSIE [1-6]\.";

# don't display server version for security
server_tokens off;

# request timed out -- default 60
client_body_timeout 10;

# Define your "upstream" servers - the
# servers request will be sent to
upstream app_example {
least_conn; # Use Least Connections strategy
server 127.0.0.1:3000; # NodeJS Server 1
# server 127.0.0.1:9001; # NodeJS Server 2
# server 127.0.0.1:9002; # NodeJS Server 3
}

server {
listen 80;
server_name www.example.com;
return 301 [$server_name$request_uri;...]
}

# Define the Nginx server
# This will proxy any non-static directory
server {
listen 443;
server_name localhost;

access_log /var/log/nginx/example.com-access.log;
error_log /var/log/nginx/example.com-error.log error;

## Google PageSpeed Configuration ##

# DISABLED UNTIL NEEDED

# pagespeed on;
# pagespeed ForceCaching on;
# pagespeed FileCachePath /var/cache/pagespeed;

# Ensure requests for pagespeed optimized resources go to the pagespeed handler
# and no extraneous headers get set.
# location ~ "\.pagespeed\.([a-z]\.)?[a-z]{2}\.[^.]{10}\.[^.]+" {
# add_header "" "";
# }
# location ~ "^/pagespeed_static/" { }
# location ~ "^/ngx_pagespeed_beacon$" { }


## SSL Configuration ##
ssl on;

ssl_certificate /etc/nginx/ssl/example-bundle.pem; # example-digicert-v2.pem + DigiCertSHA2SecureServerCA.pem;
ssl_certificate_key /etc/nginx/ssl/example-digicert-v2.key; # private key; no password
ssl_trusted_certificate /etc/nginx/ssl/oscp.pem; # Contains DigiCertSHA2SecureServerCA.pem only

# Strictest ciphers. Disabled for now
# ssl_ciphers 'AES128+EECDH:AES128+EDH:!aNULL';

# Ciphers for backwards compatability
ssl_ciphers "ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:
DHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256:
ECDHE-RSA-AES256-SHA:ECDHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES128-SHA256:
DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:
AES256-GCM-SHA384:AES128-GCM-SHA256:AES256-SHA256:AES128-SHA256:AES256-SHA:
AES128-SHA:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!MD5:!PSK:!RC4";

# Disable SSLv2 and SSLv3 - considered insecure
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_session_cache shared:SSL:10m;

# enables server-side protection from BEAST attacks
# http: //blog.ivanristic.com/2013/09/is-beast-still-a-threat.html
ssl_prefer_server_ciphers on;

ssl_dhparam /etc/nginx/ssl/dhparam.pem;

# OSCP Stapling
# As per discussion here [raymii.org...]
ssl_stapling on;
ssl_stapling_verify on;
resolver 8.8.8.8 8.8.4.4 valid=300s;
resolver_timeout 5s;

# Set HSTS for two years
add_header Strict-Transport-Security max-age=63072000;

# Prevent loading in a frame to deny clickjacking attempts
# [developer.mozilla.org...]
add_header X-Frame-Options DENY;


# Browser and robot always look for these
# Turn off logging for them
#location = /favicon.ico { log_not_found off; access_log off; }
#location = /robots.txt { log_not_found off; access_log off; }

# Handle static files so they are not proxied to NodeJS
# You may want to also hand these requests to other upstream
# servers, as you can define more than one!
location ~ ^/(images/|img/|javascript/|js/|css/|stylesheets/|flash/|media/|static/|robots.txt|humans.txt|favicon.ico) {
root /var/www/example-web; #note this contains the /src/dist folder in the web root.
expires 365d;
}

# pass the request to the node.js server
# with some correct headers for proxy-awareness
location / {
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_set_header X-NginX-Proxy true;

proxy_pass http ://localhost:3000/;
proxy_redirect off;

# Handle Web Socket connections
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}

[edited by: phranque at 9:46 am (utc) on Mar 6, 2015]
[edit reason] inserted line breaks [/edit]

11:51 am on Mar 6, 2015 (gmt 0)

Preferred Member

5+ Year Member Top Contributors Of The Month

joined:May 24, 2012
posts:648
votes: 2


i don't have much nginx experience, but i would assume an internal rewrite (try_files in this specific case) would only work on the same machine.

Huh? If they aren't the same machine, the idea is that you're placing a single file on the proxy machine...one that gets served up only for a single url from a single matching UA. All other requests go however they were going.

If it happens to be the same machine that file could be the original robots.txt.
12:14 pm on Mar 6, 2015 (gmt 0)

Preferred Member

5+ Year Member Top Contributors Of The Month

joined:May 24, 2012
posts:648
votes: 2


Ok, so with the context of the original question, what's wanted is:
  • 100% https, other than serving 301 from every http request to the same url, but as https

    The issue here is that most crawlers are okay with the 301 that's returned when they request robots.txt from the http (not https) server. The Majestic crawler is not, at least for verification. It wants to get the robots.txt file via http, without being redirected.

    Your choices then, are either
  • Break your 100% https rule, and serve up robots.txt to every robot, without doing a 301, when they ask for it over unsecure http
    --or--
  • Make an exception where the robots.txt file is served up over unsecure http, but only for requests where the UA looks like the MJ12 bot, and only when it requests robots.txt.

    Based on your config, above, both the http and https server are running on the same machine. The configuration for the http server is very brief:

    server {
    listen 80;
    server_name www.example.com;
    return 301 [$server_name$request_uri;...]
    }

    Either direction would consist of adding a "location = /robots.txt" section to that, and handling the request in some way other than returning a 301. You could either restrict that to just the MJ12 bot's UA or not. You could use try_files, or as shown in the existing https section of your config, set the root directory, but only for that specific "location = /robots.txt" section.
  • 1:46 pm on Mar 6, 2015 (gmt 0)

    Senior Member

    joined:Mar 8, 2002
    posts: 2897
    votes: 0


    Just set up the verification file and put in a support ticket. Thay can manually authorize it via Majestic support. :)
    9:06 pm on Mar 6, 2015 (gmt 0)

    Junior Member

    10+ Year Member

    joined:June 9, 2009
    posts: 110
    votes: 0


    Ok, so with the context of the original question, what's wanted is:
    100% https, other than serving 301 from every http request to the same url, but as https


    This is great, thanks rish3.

    While I'm not a fan of breaking the https only rule, I imagine reducing the possibility space to a minimum (using the neat conditionals hack) would be preferable.

    Just set up the verification file and put in a support ticket. Thay can manually authorize it via Majestic support. :)


    Oh wow, how did I not think of this? Hoping we can take the easy route here :)
     

    Join The Conversation

    Moderators and Top Contributors

    Hot Threads This Week

    Featured Threads

    Free SEO Tools

    Hire Expert Members