Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Accessible files coming back as blocked by robots.txt

         

sfgirl

4:46 pm on Apr 27, 2017 (gmt 0)

10+ Year Member



Here's a perplexing one.

Our css files are on a different domain. There is no robots.txt file. Any file/page that doesn't exist returns a 403 (access denied) instead of a 404. Google claims they treat a robots.txt with a 4xx response as - go for it and crawl the site. Files that do exist, like the css files on that domain, return a 200.

So until recently (pre April 19), the css files on that domain have been accessible. Nothing has changed on our end, but now the css files are coming back as "blocked by robots.txt" in Fetch as Google or mobile-friendly tests. This also makes no sense, since there is no robots.txt on that domain.

This started happening sometime between 4/19 and 4/24 and continues

So,
1. Suddenly css files are seen as blocked, when they are indeed still accessible.
2. Google is saying the css files are blocked by a robots.txt file that doesnt exist (but does have, and always has had, a 403 response, if that's relevant to anything).

Any thoughts/ideas around this? Similar weirdnesses? Especially seeing a difference in how Google treats a 403?

lucy24

6:11 pm on Apr 27, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



since there is no robots.txt on that domain.
Wouldn't it be less vexatious simply to put a robots.txt on that domain and see what happens? If you really allow carte blanche crawling, that's
User-Agent: *
Disallow:
The Googlebot also recognizes the "Allow" directive and file extensions in robots.txt, so you have the option of denying most files while allowing it to crawl css.

This doesn't answer the real question, though, which is why Googlebot's requests for robots.txt are getting blocked. A 403 response means the server doesn't care whether the file exists; the request is blocked regardless. You need to figure out where the 403 is coming from. (On an Apache server this can be difficult, because error logs never say anything but “denied by server configuration”, thank you very much Apache, that much I could have figured out unaided.) Clearly it's not an accidental 403 applied to everyone, or human visitors would be seeing your pages without css.

Edit: You said that files that don't exist return a 403 instead of a 404. Did you mean that the site is intentionally coded to do this? Why, for heaven's sake? It can be useful in some cases to return a manual 404 when you really mean 403, but the reverse doesn't make sense.

sfgirl

4:38 pm on Apr 28, 2017 (gmt 0)

10+ Year Member



Thanks @lucy24.

The primary consideration for us is whether I want to ask another team to pull their engineers off of their projects for a non-trivial amount of work to implement a 404 and/or create a robots.txt (due to the complex way things are routed).

Google specifies a robots.txt returning any 4xx is treated as if it didn't exist or that it allows all - which isn't promising that rerouting engineers to work on this will have any effect.

lucy24

6:36 pm on Apr 28, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't understand why you'd need to “implement” a 404. The 404 response is already the server default for when a file doesn’t exist; to get any other response you’d have to code it explicitly. So we’re talking about deleting code, not adding it.

And creating a robots.txt file that consists of precisely two lines, under a host that already exists, does not strike me as needlessly complex.

Any chance you could send one of the engineers over here to explain what the problem is? I’d really prefer not to think that they’re going home each night saying “You would not believe what I got the boss to swallow today!” ;)