Welcome to WebmasterWorld Guest from 3.226.251.205

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt using host: and config for cdn

     
4:08 pm on Jul 26, 2018 (gmt 0)

Full Member

Top Contributors Of The Month

joined:June 28, 2018
posts: 310
votes: 144


In the last days I have configured my site to use cloudfront as a cdn. I purchased an SSL cert for cdn.mydomain.com and everything is serving correctly etc, performance is ok.
The issue I have is that GSC started reporting that my site is blocking resources for the images at cdn.mydomain.com - it is just a few right now but i expect there to be lots start to appear as GSC take time, so want to address this asap.
My robots.txt is in the cdn cache so should be read by google BUT i had an entry in it that started host:www.mydomain.com - i have now removed this (so no host entry at all), updated the cdn cache so it has this changed version and told google about the new robots.

Would the host entry be the problem that was causing the resources to seem blocked to google on the cdn domain? There are no entries that block the directories and there was no issue with google crawling those resources when on the main domain with the same robots.txt.

update - i have just added an Allow: /media/* near the top of robots to see if that helps.
5:00 pm on July 26, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Nov 13, 2016
posts:1177
votes: 275


I purchased an SSL cert

TLS

By the way, what happens if you try to "fetch as google" for a file on your CDN sub domain ? You'll see if the resource is blocked and "why".
6:44 pm on July 26, 2018 (gmt 0)

Full Member

Top Contributors Of The Month

joined:June 28, 2018
posts: 310
votes: 144


By the way, what happens if you try to "fetch as google" for a file on your CDN sub domain ? You'll see if the resource is blocked and "why".


Its not possible to do that as GSC automatically puts the main domain in when using fetch as google and you just fill in the sub path to the resource/page to test. Of course I could claim the sub domain in GSC and do a test from there , but that isnt really a test against the errors in my main domain GSC , which is where I have the issues.
8:16 pm on July 26, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:2091
votes: 370


Would the host entry be the problem

No. Googlebot does not support the host directive in robots.txt.

Host is a Yandex thing, and it's rather pointless. If you have mirrors of the same content on separate (sub)domains, use rel canonical to indicate what your preferred domain is.

The message about blocked resources should tell you the URL of the robots.txt file that blocks Googlebot's access to those resources. If that's the robots.txt file on the cdn subdomain, then make sure it's now been correctly updated locally and then purge it from the CDN nodes.

You could also do a search for something like robots.txt testing tool to validate your syntax and to verify that no URLs are inadvertently blocked from being accessed by Googlebot.
12:06 am on July 30, 2018 (gmt 0)

Full Member

Top Contributors Of The Month

joined:June 28, 2018
posts: 310
votes: 144


well the robots.txt on the cdn domain is the same one as the main domain (it synchs up to the cdn) which does not block any of the resources that are being reported as blocked. Its weird. So i specifcally put an ALLOW entry in to try to force the issue. Today even more are being reported as blocked.
Ive decided to give up on the CDN and it seems to be causing more problems - since implementing it , my traffic feel , I feel in serps and conversions all but disappeareed. I turned it off last night and reverted to just normal and traffic seems to be recovering today. conversions are back. Im going to look at other performance enhancements rather than a CDN certainly for the time being as it is already been tough in the last months with google.
8:56 am on July 30, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:2091
votes: 370


Sounds like it was a buggy implementation of the CDN. It can be challenging to get everything right, depending on the complexity of the website and your technical know-how. I believe Cloudfront is an unmanaged service, so you probably won't get any support if you need it. Maybe look into some alternatives if you do feel a CDN would increase performance.
11:34 pm on Aug 1, 2018 (gmt 0)

Full Member

Top Contributors Of The Month

joined:June 28, 2018
posts: 310
votes: 144


for now im just going to leave it and try to let things recover and see what this update brings in the next week or so. So many things going on I find it gets confusing if you make too many changes at once. Ive being doing some performance tweaks on the server (replacing memcached with redis for example) and have improved pagespeed scores a bit more so that pretty much everything is green for mobile and desktop , so i dont see CDN as urgent currently. It woudl be nice to have it to reduce bandwidth costs, but I also have lower bandwdith usage by about 50% by blocking bots so again not as urgent as it was.
I thought about setting up CDN on staging server and leaving it for a bit but then realised as I blokc it from being indexed that wouldnt be a test!
10:23 am on Aug 2, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Sept 25, 2005
posts:2091
votes: 370


Reduce bandwidth costs? In my experience, CDNs tend to be much more expensive per GB transferred.

A $10 VPS gets you 2TB of traffic these days. That would cost you $150-200 (15-20x) on Cloudfront.
12:09 am on Aug 28, 2018 (gmt 0)

Full Member

Top Contributors Of The Month

joined:June 28, 2018
posts: 310
votes: 144


I actually returned to using cloudfront again and looking at my billing using cloudfront has reduced my costs for sure. By implementing blocking of blocks at server level on nginx and moving to a cdn I have reduced the associated costs by about 60-70%. I will also look at adding anti bot measures on the cdn soon but Im wanting to definitely get past this first google crawling issues before introducing too many variables.

An update to the blocking resources issue though - after reenabling the cdn it continued to report blocking resources and this was increasingly bit by bit. I tried turning off http2 on the CDN in case there was some confusion with that as googlebot doesnt us http2 yet. My thinking was that maybe googlebot arrives at my server and requests with http1 and somehow the serving of the cdn resources gets confused and serves as http2 and therefore google cant read them and reports them blocked. That didnt work though so next I decided to force the cdn to serve a different robots.txt that the server one that is synchronised to it - not a straight forward thing to do and I had to use a lamba edge function that is triggered when googlebot visits the cdn and returned basic robots of Useragent : * Disallow: and nothing else. I saw yesterday for the first time a drop in teh reported blocked resources , only a few urls but if it keeps reducing then I might have found a solution. The next weeks will tell I guess.
7:25 pm on Aug 29, 2018 (gmt 0)

Full Member

Top Contributors Of The Month

joined:June 28, 2018
posts: 310
votes: 144


to update again in case this is useful to anyone in the future it seems the lamba function on cloudfront has fixed the issue - today GSC has reported a drop of about 50% in the blocked resources so I expect that to keep falling until there are none.

So for some reason when you have your main domains robots.txt synchronised across to the cloudfront and it is then servable from cdn.domain.com this causes google to somehow see that teh cdn domains robots.txt is blocking everything. Im not sure why this is , if there is something to do with CORs that is causing , something to do with the way cloudfront reponds , i dont know . But my setting a lamba edge function to return a basic fully open robots.txt whenever the http agent header is Googlebot , it fixes the issue.