Hosting robots.txt on one server and webcontent on another server.

Forum Moderators: phranque

Message Too Old, No Replies

Hosting robots.txt on one server and webcontent on another server.

Taking control of how my partnership website is being indexed.

lgn1

3:57 pm on May 10, 2016 (gmt 0)

I need to host the robots.txt file on a server that I control, and the actual web content on a website that I don't control the actual content.

I have full control of the domain name in question and have full access to DNS settings.

What I'm doing is legal. I am trying to prove a point with a another business, in which we are in a partnership relationship with, and just need some hard numbers, to convince them to the given fact. Please keep this discussion technical.

I'm looking for a simple but efficient way of doing this.

aakk9999

4:25 pm on May 10, 2016 (gmt 0)

So if I understood well, it is the same domain (the same host name), two different physical servers?

And the request for http://www.example.com/robots.txt would go to the server you control and the request for http://www.example.com/ (e.g. domain root) would go to that other server you do not control?

Andy Langton

4:29 pm on May 10, 2016 (gmt 0)

You can proxy the website requests so that you can modify them before they're delivered to the user (think Cloudflare). You then modify requests for robots.txt only and pass-through requests for everything else.

LifeinAsia

4:36 pm on May 10, 2016 (gmt 0)

Offhand, the only way I see of doing it is to use a reverse proxy (like Pound) to send web traffic to the main site and requests for robots.txt to your server (you would change the DNS settings to point to your proxy server instead of the IP for the content site).

However, unless the content site is setup correctly, the IP of your proxy may be logged instead of the actual IP addresses of the visitors. And changing the IP of the domain sometimes causes dips in the SERPs.

What I'm doing is legal.

Please keep this discussion technical.

Technically, it may sound legal. But unless you have the express permission of all parties involved (and a statement of their understanding that there may be some negative consequences of this "test"), I wouldn't call it that.

bakedjake

5:01 pm on May 10, 2016 (gmt 0)

If you're using a web framework that supports routing handlers (like bottle.py on python for example), you could simply have a route for /robots.txt that used a different handler than the rest of the website.

I do this a lot for static content of all types using routes if I need to muck with it for some reason as opposed to reading directly off the filesystem. Another (perhaps more practical?) reason for doing something like this is to serve proper images for retina/non-retina clients without JavaScript trickery or loading images twice.

You'd just configure the /robots.txt handler to point to a script/function that either reverse proxies directly, fetches the robots.txt with a client library then regurgitates it, or just call a remote DB you maintain and have it stored there. Really you can generate the response that's generated any way you want.

lgn1

5:49 pm on May 10, 2016 (gmt 0)

Wow, that was quick. Several good ideas, I didn't even think off.

Nothing sinister about what we are doing. We are testing out display ads, and display ads require that the regular googlebot needs to crawl the image and landing page, unlike regular text adwords (where the special Google ads-bot simply ignores the robots.txt file). The manufacturer is hosting our website, due to security concerns and a unhealthy dose of paranoia, so even getting simple changes done takes forever. We are not sure product display ads are going to be a success, so we don't want to create unnecessary work for the manufacturer.

lgn1

9:41 pm on May 10, 2016 (gmt 0)

Eureka !

I have decided to go with NGINX with the optional ngx_http_sub_module. This way I can remove the NO from NOFOLLOW and NOINDEX, as well as modiifying the robots.txt file.

I will also run in transparent reverse proxy mode and only make changes for Googlebot.