New GSC feature (bug?) - emailing incorrect "warning". - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

New GSC feature (bug?) - emailing incorrect "warning".

Wilburforce

5:45 pm on Sep 15, 2019 (gmt 0)

Today I received an email from Google:

"Search Console has identified that your site is affected by 1 Coverage issues:

Top Warnings
Warnings are suggestions for improvement. Some warnings can affect your appearance on Search; some might be reclassified as errors in the future. The following warnings were found on your site:
Indexed, though blocked by robots.txt
We recommend that you fix these issues when possible to enable the best experience and coverage in Google Search."

I checked on GSC, and it reports the issue on one page in a folder blocked to robots.txt. All pages in that folder - including the one in the warning - have <meta name="robots" content="noindex"> in the <head> section.

The issue is flagged in GSC on 11, 12, 13 and 14 September.

I'm not at all worried about it, but curious about what might be happening. Any thoughts?

NickMNS

9:35 pm on Sep 15, 2019 (gmt 0)

Indexed, though blocked by robots.txt

You blocked the folder and added the meta tag. Since the folder is blocked from crawling, Google doesn't/hasnt't seen the "noindex" directive and so your page remains indexed despite being blocked, just as the message suggests. To solve the problem unblock the folder, let Google crawl and remove the pages from the index and then at some point in the future if need you can block the folder again.

Wilburforce

10:24 pm on Sep 15, 2019 (gmt 0)

@NickMNS - Thanks, I'll try that.

However, what has brought this about in the first place? The pages were never meant to be indexed: internal links to them are nofollow, and they don't contain anything of SEO value (they request consent for users proceeding to site-sections that collect personal data). They were last modified in April (I don't remember when they were first posted), and robots.txt was last modified in March. I realise from what you say how Google can't now see that they are noindex, but what happened on 11 September? I suppose someone somewhere might have linked to the page, but it really doesn't have anything on it anyone would want to link to.

_{Edit: I have now allowed Google access to the folder in robots.txt, and hit the Validation button in GSC, so we'll see what happens.}

lucy24

12:28 am on Sep 16, 2019 (gmt 0)

internal links to them are nofollow

In spite of what G said when it first instituted the �nofollow� label (there's a recent thread somewhere hereabouts), �nofollow" never really meant �Pretend you have not seen this link�. It only means �Don�t tell them I sent you.� That�s why GSC's �who links to you� list shows a random mix of follow and nofollow links.

Editorial comment: �Indexed� is a very fuzzy term. The choice is between
(a) letting search engines crawl where you don�t want them to crawl--using part of their crawl budget to do so--in order to see your "noindex" directive, and
(b) the hypothetical possibility that some combination of search terms might cause a page to show up in some SERP somewhere--with the boilerplate about �robots.txt won�t let us show you�--even though you wanted it to remain private.

Only you can decide which of the two is a greater concern.

DixonJones

8:37 am on Sep 17, 2019 (gmt 0)

Further to this, Google have just announced that nofollow will now be considered a suggestion to Google, not a directive.. Google also discovers URLs through multiple other routes, from site maps to links from external sites... including random scrapers. Lucy24's solution of opening up the page for "NoIndexing", though, makes sense,

Wilburforce

10:35 am on Sep 17, 2019 (gmt 0)

@lucy24 - I think I go for (a).

@DixonJones - Thanks for that update.

However, it still leaves the question of how you tell Google "I don't want you to visit or index this page" when you still need to link to it from a user-accessible page, and I'm still a bit hazy about why Google would index a page that robots.txt has never let it see. What, exactly. is in the index?

not2easy

12:48 pm on Sep 17, 2019 (gmt 0)

why Google would index a page that robots.txt has never let it see

That's because robots.txt does not prevent Google from following links but it does prevent Google from crawling the page to evaluate the noindex header on the page. It probably is not actually indexed in serps, just not noindexed. Confusing? Yes.

lucy24

6:00 pm on Sep 17, 2019 (gmt 0)

I'm still a bit hazy about why Google would index a page that robots.txt has never let it see. What, exactly. is in the index?

That's why I believe most of the time it's a non-issue: sure, in some abstract hypothetical sense the unseen page is �indexed�--but will it ever crop up in any actual SERP seen by any actual human? Consider your Contact page. There are millions (literally) of pages on the internet whose link is the word �Contact� or similar, so it is not likely that anyone searching for �contact� will be offered pages the search engine has not seen. But if your linking text is some extremely unusual phrase, then a person searching for that phrase might arrive at �we are unable to show you�.

Real-life example: My test site is 100% roboted-out, because it�s a test site. Today's internal link might be gone tomorrow. Now, the site's domain name happens to be a phrase that someone might really utter. Because of this, I do get the rare human visitor who is curious enough to go beyond the �we are unable�. (And they are sorry when they do, because--being a test site--I have made it as ugly as possible.) I could have chosen to permit crawling while sticking noindex tags on all pages. But instead I chose not to permit crawling; the rare human visit is just so much white noise.

Wilburforce

8:03 am on Sep 18, 2019 (gmt 0)

That's because robots.txt does not prevent Google from following links but it does prevent Google from crawling the page

That may well be so, but Google's advice is clear and specific:

"If you need to prevent Google from following a link to a page on your own site, use the robots.txt Disallow rule" ([support.google.com ]).

According to Google, robots.txt should prevent Google from following links.

My own link, therefore, should not have given rise to this, unless - a possibility - that blocking the folder isn't enough (i.e. the page itself, rather than the folder, must be the subject of Disallow). However, similar links elsewhwere on the website to other pages in the blocked folder haven't suffered the same fate, and on balance my view is that my own link is not the culprit.

On thinking further, therefore, about why anyone else might link to the blocked page I decided to run a Copyscape check, and find that most of my linking page (including my own internal links) has been copied and pasted to a page on anohter website. If there was an easy way to monetise the theft of my intellectual property I would give up everything else.

The sentence including that particular link has - uniquely - been edited, and the link is no longer there, but in the context of the plagiarist's use of it I think the sentence was probably edited some time after the copied content was originally posted, and that Google has followed the link from there. DixonJones' observation that "Google have just announced that nofollow will now be considered a suggestion to Google, not a directive" is pertinent.

Unblocking the folder on 15 September should get the Google warning removed eventually, although further reports of the problem on 16 and 17 September are now showing in GSC, so it looks like a cached version of robots.txt is still in play. I'll deal with the plagiarist later today.

lucy24

5:06 pm on Sep 18, 2019 (gmt 0)

I think when not2easy said �following links� she meant �learning that an URL exists�. (It threw me too.)