More peculiar robots.txt behavior from Google

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

More peculiar robots.txt behavior from Google

JamesSC

3:16 pm on Dec 7, 2018 (gmt 0)

In a recent post I believe I mentioned having received a notice in GSC that it had stopped crawling my site because it couldn't fetch my robots.txt file, a situation Google shortly thereafter successfully resolved on its own without help from me.

Well, it's back:

Google having successfully requested/accessed my robots.txt file only yesterday,

robots.txt fetch failed
You have a robots.txt file that we are currently unable to fetch. In such cases we stop crawling your site until we get hold of a robots.txt, or fall back to the last known good robots.txt file. Learn more.

Testing my robots.txt for all crawlers just now, all are allowed, although the robots.txt Google had on file previously is gone. When I attempted to resubmit it,

It didn't go through. Try again later.

My only recourse is to wait until Google solves the problem internally, as it did previously.

How common is this?

lucy24

6:25 pm on Dec 7, 2018 (gmt 0)

Did you cross-check your access logs to see what, if any, Google requests showed up in the seconds immediately following your GSC action? Details on alternative approaches would depend on how large your site is.

How common is this?

If you interpret “this” broadly as “GSC messages purporting to report problems that either have no existence outside G’s fevered imagination, or that are not ‘problems’ in the first place” the answer is: Extremely common.

JamesSC

7:18 pm on Dec 7, 2018 (gmt 0)

A seemingly passive-aggressive fevered imagination to boot: "Because we have been unable to fetch your robots.txt file, clearly through no fault of our own, I'm afraid you leave us no choice but to break off our relationship".

But, no, there weren't really any seconds immediately following. Yesterday I began to notice the absence of Google in my access logs except for regular periodic 200 GET requests for my robots.txt, thought aha, they're doing this again, and checked GSC...no apparent problems. Today, still no Google crawlers, checked GSC - the results detailed in my post above, with the additional datum that Fetch as Google returned "temporarily unreachable".

Just now: Google still 200-requesting robots.txt, main page, and 304-requesting previously crawled pages; a robots.txt now exists where earlier today the existing one had mysteriously absented itself and its attempted replacement "It didn't go through. Try again later."; and now Fetch as Google is returning "request indexing". And there is now another triangular stegosaurus dorsal plate in my otherwise flat for years robots.txt fetch errors plot.

Will Google and I begin dating again? Too soon to say. I should add to this data blob I'm leaving here that I was recently the beneficiary of one of Google's crawl mobile first notices, or whatever they call them, and this previously completely absent phenomenon, i.e., only Google claiming robots.txt unfetchable and site subsequently uncrawlable, is falling squarely within that window, not preceding it.

Anyway, thanks for reassuring me that the unknowable mortal sin may not be entirely mine. I'll update if I can add more useful data to this narrative.

JamesSC

3:29 pm on Dec 8, 2018 (gmt 0)

And now back to where this thread began, with no interference from me.

Odd to think I would be the only one experiencing this pattern of behavior. For example, the Google-responsible "temporarily unreachable" phenomenon has already been documented, and perhaps that, whatever it is, is the true root of the problem.

lucy24

6:33 pm on Dec 8, 2018 (gmt 0)

At the outset, I mentioned cross-checking access logs. I asked about this because it would be interesting to see some hard proof that Google tried and failed to reach robots.txt: How, exactly? They can't mean a simple 404, because they will happily crawl sites that don't have a robots.txt. And you'd think that if they met a 403, they would say so unambiguously. Some other 400-class error, then? A 500-class error? All of those ought to be visible in logs.

Or are they blaming us for some technical issue that is in fact happening within their own computers, having nothing to do with our servers?

:: detour to check something ::

Well, I'm not going to check back to the dawn of time, but within the present calendar year, there are no (zero) legitimate robots.txt requests receiving anything but a 200 (sometimes 301* on non-https sites, sometimes 304** on test site). Facebook likes to receive a 206, which I think here means “send us a new file if it has changed, and otherwise just verify that it still exists”. (I looked it up once, but forget the exact explanation.) A few malign robots got a 403, which shouldn't happen, but who the heck cares what happens to a malign robot. And one lone 500 ... from my own IP, where I was confirming that an htaccess change was valid. (Apparently it wasn't :))

* Domain-name-canonicalization redirect. Once a site goes https, I add an exemption for robots.txt, so they all get the file at the originally requested protocol and hostname.
** Until earlier this year, my test site had a genuine hard-coded robots.txt, which is capable of returning a 304; everyone else gets rewritten to a robots.php.

JamesSC

11:53 pm on Dec 8, 2018 (gmt 0)

This afternoon one IP+UA-verified Googlebot GET-200-requested my robots.txt (what is that if not a fetch successfully fulfilled on my end?) and then literally four minutes later its sibling logged a 301 redirect from an IP only 28 digits greater in the final octet.

Meanwhile, bingbot and others are happily gobbling their way through like baleen whales through a krill ball.

JamesSC

11:55 pm on Dec 8, 2018 (gmt 0)

Later, I did get a 404 returned from a since expired and replaced cache file (which robots.txt disallow syntax I've apparently not yet mastered). I've also often seen 304s from Googlebot.

JamesSC

12:09 am on Dec 9, 2018 (gmt 0)

Here's what I'm talking about (sorry, it was eight minutes later, not four); because it's Googlebot, I haven't redacted the IP:

66.249.64.92 - - [08/Dec/2018:14:50:58 -0800] "GET /robots.txt HTTP/1.1" 200 4395 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.64.120 - - [08/Dec/2018:14:58:10 -0800] "GET /robots.txt HTTP/1.1" 301 4001 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

JamesSC

12:17 am on Dec 9, 2018 (gmt 0)

Here's a typical Google 304 from earlier (same IP, mobile crawler):

66.249.64.92 - - [08/Dec/2018:11:54:34 -0800] "GET /2015/05/19/URL/ HTTP/1.1" 304 3559 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

lucy24

1:00 am on Dec 9, 2018 (gmt 0)

A 301 on robots.txt means that your site is example.com and they requested www.example.com/robots.txt -- or vice versa. Search engines do this habitually, even if they know perfectly well which form a given site prefers.

:: back to raw logs ::

Legitimate googlebot: robots.txt requests run about 6:1 200 vs. 301, meaning that they know which form is correct for the site, but they’ll check periodically anyway.
Yandex: even more skewed, over 8:1
bingbot:
. . .
yowzuh. I make it 500:1 200 vs. 301. I was NOT expecting to see that. Unfortunately there's only one site I can do this particular test on. It happens to be without-www, so I don't know whether some search engines have a built-in preference and that's what I'm seeing.

When I mentioned 304 I meant specifically on robots.txt. On my sites, this can no longer occur, because dynamic files (regardless of surface URL) can't return a 304. Or, at least, they can't on my server. Is
/2015/05/19/URL/
on your site a static file?

JamesSC

1:19 am on Dec 9, 2018 (gmt 0)

Is /2015/05/19/URL/ on your site a static file?

It's a standard WordPress site. It shouldn't be, although I do use a caching plugin, so perhaps it is.

lucy24

2:04 am on Dec 9, 2018 (gmt 0)

Oh, that would explain it. Normally a WP file is anything but static--but if it's cached, then yeah, that could be a 304.

JamesSC

5:50 am on Dec 9, 2018 (gmt 0)

I just noticed the little drop down arrow on the old version of the GSC robots.txt tester. Here's the OCR of a delayed screen capture of today's antics:

robots.txt Tester
Edit your robots.txt and check for errors. Learn more.

Latest version seen on 12/8/18, 8:21 PM Failure -
Version seen on 12/8/18, 7:05 PM OK (200) 1,045 Bytes
No content 12/8/18, 6:35 PM Failure
Version seen on 12/8/18, 5:24 PM OK (200) 1,045 Bytes
No content 12/8/18, 3:53 PM Failure
Version seen on 12/8/18, 2:50 PM OK (200) 1,045 Bytes
No content 12/8/18, 1:43 PM Failure
Version seen on 12/8/18, 12:41 PM OK (200) 1,045 Bytes
No Content 12/8/18, 11:55 AM Failure
Version seen on 12/8/18, 10:45 AM OK (200) 1,045 Bytes
No content 12/8/18, 10:11 AM Failure
Version seen on 1218/18, 9:08 AM OK (200) 1.045 Bytes
No content 12/8/18, 8:51 AM Failure
Version seen on 1218/18. 7:48 AM OK (200) 1,045 Bytes
No content 12/8/18, 7:25 AM Failure
Version seen on 1218/18, 7:22 AM OK (200) 1,045 Bytes
No content 12/8/18, 7:07 AM Failure
Version seen on 12/7/18, 12:58 PM OK (200) 1,045 Bytes
Version seen on 12/7/18, 9:16 AM OK (200) 1,043 Bytes
No content 12/6/18, 6:43 AM Failure
Version seen on 12/4/18, 5:13 AM OK (200) 1.043 Bytes

I also reread

robots.txt fetch failed
You have a robots.txt file that we are currently unable to fetch. In such cases we stop crawling your site until we get hold of a robots.txt, or fall back to the last known good robots.txt file. Learn more.

and remain wondering why Google in fact did not fall back "to the last known good robots.txt file" and why therefore there is currently no robots.txt file at all on file at Google.

lucy24

6:18 am on Dec 9, 2018 (gmt 0)

I think the correct procedure at this point is for you and me--and anyone else following this thread--to chant “wtf?” in close harmony.

JamesSC

4:04 pm on Dec 9, 2018 (gmt 0)

Yep. About all I can think of at this point is hitting Amazon up for rum, a cigar, a straight razor, and one terrified chicken before the Christmas crunch really kicks in.

JamesSC

10:15 pm on Dec 9, 2018 (gmt 0)

Just more of the same, Success/Failure/Success/Failure, cycling with automatonic regularity, occasionally a page fetched during the Success minutes. Same "temporarily unreachable" during the Failure minutes, although Bingbot and others can both retrieve robots.txt and crawl at will. Successful header check looks normal. Successful header check as Googlebot looks normal. Successful header check as Googlebot of Example.com/robots.txt looks normal. Active human malice in the form of this Google-only-specific poltergeist defies reason.

And yet it moves.

JamesSC

12:58 am on Dec 11, 2018 (gmt 0)

And now this evening everything is back to normal, as if it never happened.