robots.txt using host: and config for cdn - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld - WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt using host: and config for cdn

Milchan

4:08 pm on Jul 26, 2018 (gmt 0)

Top Contributors Of The Month

In the last days I have configured my site to use cloudfront as a cdn. I purchased an SSL cert for cdn.mydomain.com and everything is serving correctly etc, performance is ok.
The issue I have is that GSC started reporting that my site is blocking resources for the images at cdn.mydomain.com - it is just a few right now but i expect there to be lots start to appear as GSC take time, so want to address this asap.
My robots.txt is in the cdn cache so should be read by google BUT i had an entry in it that started host:www.mydomain.com - i have now removed this (so no host entry at all), updated the cdn cache so it has this changed version and told google about the new robots.

Would the host entry be the problem that was causing the resources to seem blocked to google on the cdn domain? There are no entries that block the directories and there was no issue with google crawling those resources when on the main domain with the same robots.txt.

update - i have just added an Allow: /media/* near the top of robots to see if that helps.

Dimitri

5:00 pm on Jul 26, 2018 (gmt 0)

WebmasterWorld Senior Member

Top Contributors Of The Month

I purchased an SSL cert

TLS

By the way, what happens if you try to "fetch as google" for a file on your CDN sub domain ? You'll see if the resource is blocked and "why".

Milchan

6:44 pm on Jul 26, 2018 (gmt 0)

Top Contributors Of The Month

By the way, what happens if you try to "fetch as google" for a file on your CDN sub domain ? You'll see if the resource is blocked and "why".

Its not possible to do that as GSC automatically puts the main domain in when using fetch as google and you just fill in the sub path to the resource/page to test. Of course I could claim the sub domain in GSC and do a test from there , but that isnt really a test against the errors in my main domain GSC , which is where I have the issues.

robzilla

8:16 pm on Jul 26, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Would the host entry be the problem

No. Googlebot does not support the host directive in robots.txt.

Host is a Yandex thing, and it's rather pointless. If you have mirrors of the same content on separate (sub)domains, use rel canonical to indicate what your preferred domain is.

The message about blocked resources should tell you the URL of the robots.txt file that blocks Googlebot's access to those resources. If that's the robots.txt file on the cdn subdomain, then make sure it's now been correctly updated locally and then purge it from the CDN nodes.

You could also do a search for something like robots.txt testing tool to validate your syntax and to verify that no URLs are inadvertently blocked from being accessed by Googlebot.

Milchan

12:06 am on Jul 30, 2018 (gmt 0)

Top Contributors Of The Month

well the robots.txt on the cdn domain is the same one as the main domain (it synchs up to the cdn) which does not block any of the resources that are being reported as blocked. Its weird. So i specifcally put an ALLOW entry in to try to force the issue. Today even more are being reported as blocked.
Ive decided to give up on the CDN and it seems to be causing more problems - since implementing it , my traffic feel , I feel in serps and conversions all but disappeareed. I turned it off last night and reverted to just normal and traffic seems to be recovering today. conversions are back. Im going to look at other performance enhancements rather than a CDN certainly for the time being as it is already been tough in the last months with google.

robzilla

8:56 am on Jul 30, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Sounds like it was a buggy implementation of the CDN. It can be challenging to get everything right, depending on the complexity of the website and your technical know-how. I believe Cloudfront is an unmanaged service, so you probably won't get any support if you need it. Maybe look into some alternatives if you do feel a CDN would increase performance.

Milchan

11:34 pm on Aug 1, 2018 (gmt 0)

Top Contributors Of The Month

for now im just going to leave it and try to let things recover and see what this update brings in the next week or so. So many things going on I find it gets confusing if you make too many changes at once. Ive being doing some performance tweaks on the server (replacing memcached with redis for example) and have improved pagespeed scores a bit more so that pretty much everything is green for mobile and desktop , so i dont see CDN as urgent currently. It woudl be nice to have it to reduce bandwidth costs, but I also have lower bandwdith usage by about 50% by blocking bots so again not as urgent as it was.
I thought about setting up CDN on staging server and leaving it for a bit but then realised as I blokc it from being indexed that wouldnt be a test!

robzilla

10:23 am on Aug 2, 2018 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Reduce bandwidth costs? In my experience, CDNs tend to be much more expensive per GB transferred.

A $10 VPS gets you 2TB of traffic these days. That would cost you $150-200 (15-20x) on Cloudfront.

Milchan

12:09 am on Aug 28, 2018 (gmt 0)

Top Contributors Of The Month

I actually returned to using cloudfront again and looking at my billing using cloudfront has reduced my costs for sure. By implementing blocking of blocks at server level on nginx and moving to a cdn I have reduced the associated costs by about 60-70%. I will also look at adding anti bot measures on the cdn soon but Im wanting to definitely get past this first google crawling issues before introducing too many variables.

An update to the blocking resources issue though - after reenabling the cdn it continued to report blocking resources and this was increasingly bit by bit. I tried turning off http2 on the CDN in case there was some confusion with that as googlebot doesnt us http2 yet. My thinking was that maybe googlebot arrives at my server and requests with http1 and somehow the serving of the cdn resources gets confused and serves as http2 and therefore google cant read them and reports them blocked. That didnt work though so next I decided to force the cdn to serve a different robots.txt that the server one that is synchronised to it - not a straight forward thing to do and I had to use a lamba edge function that is triggered when googlebot visits the cdn and returned basic robots of Useragent : * Disallow: and nothing else. I saw yesterday for the first time a drop in teh reported blocked resources , only a few urls but if it keeps reducing then I might have found a solution. The next weeks will tell I guess.

Milchan

7:25 pm on Aug 29, 2018 (gmt 0)

Top Contributors Of The Month

to update again in case this is useful to anyone in the future it seems the lamba function on cloudfront has fixed the issue - today GSC has reported a drop of about 50% in the blocked resources so I expect that to keep falling until there are none.

So for some reason when you have your main domains robots.txt synchronised across to the cloudfront and it is then servable from cdn.domain.com this causes google to somehow see that teh cdn domains robots.txt is blocking everything. Im not sure why this is , if there is something to do with CORs that is causing , something to do with the way cloudfront reponds , i dont know . But my setting a lamba edge function to return a basic fully open robots.txt whenever the http agent header is Googlebot , it fixes the issue.