'Indexed, though blocked by robots.txt' warning - how to approach it? - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

'Indexed, though blocked by robots.txt' warning - how to approach it?

andreicomoti

9:06 am on Oct 16, 2019 (gmt 0)

Top Contributors Of The Month

Hello,

A fairly big e-commerce website that I administrate went to a redesigning process that changed the structured of the URLs belonging to the faceted navigation (filters, prices, brands etc).

Right after the change, Google started to increase the number of URLs in the "Indexed, not submitted in sitemap" section within Google search console. The big part of the URLs were faceted navigation, which we didn't want indexed, so we decided to exclude them with Robots.txt. We made the rules and submitted them in the robots.txt file. The number of URLs in the "Indexed, not submitted in sitemap" started to drop, however, we got more than 8mil URLs in "Valid with warnings" section with the warning "Indexed, though blocked by robots.txt". That's 8mil URLs, kind of scary.

We also noticed a drop in the number of impressions that started at the moment GSC warned us about the 8mil URLs that were indexed, though blocked by robots.txt.

Questions:
1. Is this warning a real threat? Why did the number of impressions started to drop?
2. Are there any steps to be taken to fix this? Or should we ignore the message since it is only a warning?
3. What is the best approach to exclude these faceted navigation URLs from being indexed and crawled?

Thanks in advance for your help!

Wilburforce

12:31 pm on Oct 16, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I had the same problem recently ([webmasterworld.com ]) on a single page. The page has always had the Noindex Meta Tag, so how it ever became "indexed" is a mystery. However, it looks likely that some pages that have never been crawled can nevertheless be indexed.

My "solution", suggested by NickMNS, was to remove the block in robots.txt so that Google could find the Noindex tag.

I did that on 15 September, and Google is still reporting the problem. Just in case I missed something, I have checked both the function of robots.txt and the page iteself in GSC since removing the block, so the page definitely isn't blocked. How Google is still finding the problem is therefore also a mystery.

Blocking pages you don't want crawled in robots.txt conforms with Google guidelines ([support.google.com ]), as does using the Noindex Meta Tag, so my view is that this is a problem with how GSC warnings are generated - not with the pages themselves - and my own intention is to ignore it.

As you don't want the pages with warnings indexed anyway, I wouldn't get too worried about it. I suppose it is possible that some site-wide negative weighting could apply if such "warnings" are too common and never heeded, but I think it unlikely.

NickMNS

12:33 pm on Oct 16, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Is this warning a real threat?

Yes, as these pages will/may continue to appear in the SERPs with a description to the effect "there is no information about this page because it is blocked from crawling".

Are there any steps to be taken to fix this?

Yes, add the meta noindex tag to each page and remove the pages from robots.txt to allow Google to crawl the pages. If the pages aren't removed from robots.txt, Google will not crawl and never see the noindex tag so nothing will happen. If at some point in the future you see that the pages are no longer indexed you can then block them from crawling in robots.txt, but this could take time 6 month to a year.

What is the best approach to exclude these faceted navigation URLs from being indexed and crawled?

See above.

not2easy

12:52 pm on Oct 16, 2019 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

That can be happening because disallowing crawls does not remove URLs from being listed as indexed. In order to remove URLs from indexing you would need to allow crawls and add noindexing. Every URL is subject to be indexed unless it is noindexed. Adding noindexing after blocking bots will never be read/understood. The bots need to be able to crawl to understand that the URLs should not be indexed.

If this is the same site that you had previously asked about in September: [webmasterworld.com...] then the canonical URLs will work eventually as they are consumed and sorted by Google's bots but not if you have disallowed crawling. There is an older thread here that spells out the best practices for faceted navigation: [webmasterworld.com...]

Wilburforce

1:22 pm on Oct 16, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The bots need to be able to crawl to understand that the URLs should not be indexed.

I'm curious about how Google treats robots.txt, as it clearly isn't in real time. If Google is using a cached version, how do we get Google to refresh it?

Obviously we can't allow the bots to crawl pages - other than "eventually" - if they are still reading historical blocks.

not2easy

1:32 pm on Oct 16, 2019 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

In the old GSC it was a simple matter of uploading a new file, submit it and fetch. Now I have no clue of how to get them to refresh.

Wilburforce

4:49 pm on Oct 16, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Now I have no clue of how to get them to refresh.

Well, you're supposed to do this:

[support.google.com ]

However, having found the testing tool (nowhere obvious in GSC: I found it using Help > Search and clicking on the link in the results) you are supposed to amend the file there, download it, upload it to your server and then click Submit.

I did all that weeks ago, but GSC is still reporting "Indexed but blocked" - on multiple occasions since it was fixed - on a noindex page that isn't blocked. My suspicion, fuelled by other erroneous warnings I have received and several other threads here, is that the new GSC is buggy, and "warnings" should be viewed in that light.

these pages will/may continue to appear in the SERPs with a description to the effect "there is no information about this page because it is blocked from crawling"

Yes, they might, but I wouldn't expect them to be anywhere near page 1 for any search term, and if they haven't been crawled it is difficult to predict what seacrh terms they might appear for.

lucy24

5:45 pm on Oct 16, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Oh, good grief. This question again?

they might, but I wouldn't expect them to be anywhere near page 1 for any search term, and if they haven't been crawled it is difficult to predict what seacrh terms they might appear for

... and that's why I count this particular warning among the many, many GSC warnings that are best dealt with by ignoring them.

Exercise: When next you see an �indexed though blocked in robots.txt� warning, experiment with the search engine and try to come up with a search query that will bring up that page within, say, the first 100 results without resorting to the site: operator.

Google does not own your site. You do not have to let them see anything you don't want them to see. Full stop.

tangor

5:00 am on Oct 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This is just a g tactic to see all your stuff, even the parts you do not want visible. As lucy24 indicates, play only if you wish, else ignore it.

These "dire warnings" are merely a method to get webmasters panicked.

andreicomoti

11:11 am on Oct 17, 2019 (gmt 0)

Top Contributors Of The Month

Just so you understand the context.

Until the new design was implemented, we had nearly 50k pages in the index and there was no warning about indexed pages that were blocked by robots.txt.

One month after the new site was launched, we noticed an increase in the section "Indexed, not submitted in sitemap" in GSC (about 500k pages). Most of these pages were faceted navigation links. We had excluded them through robots.txt, and the number began to drop, but we received another message in the "Valid with warnings" section that nearly 50mil pages were being "Indexed, though blocked by robots.txt".

Now, is it wise to unblock the pages from robots.txt and set noindex on them?

Wilburforce

11:43 am on Oct 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

One month after the new site was launched, we noticed an increase in the section "Indexed, not submitted in sitemap" in GSC

A question you might address is how/why Google indexed the pages - for either type of warning - if they are blocked in robots.txt. Either:

1. Google has changed the way it is following links and/or treating robots.txt, or
2. Something about links to those pages has changed,

As Google has recently announced that nofollow will henceforth be treated as "advisory" it is quite possible that nofollow links to those pages are now being treated as plain ("dofollow") links, and the pages are being indexed for that reason.

However, as you have recently launched a new site, it is also possible that something about the links has changed, which would be worth looking into if you haven't done it already.

As for remedial action, removing the block in robots.txt and using Noindex is about all you can do. I wouldn't have any worries about removing the block: not all bots respect robots.txt, so it doesn't prevent access to pages in any meaningful way, and it isn't your problem if it means googlebot has more to do.

andreicomoti

3:00 pm on Oct 17, 2019 (gmt 0)

Top Contributors Of The Month

@Wilburforce

The URLs that appeared in the "Indexed, not submitted in sitemap" section in GSC weren't blocked by robots.txt. We did the rule to block them after we notice the warning in GSC. And after we completed the robots.txt file, the number from "Indexed, not submitted in sitemap" began to drop, but a new warning appeared in "Valid with warnings" section that nearly 50mil pages were being "Indexed, though blocked by robots.txt".

So, I am guessing that there is a cycle here: first Google noticed the URLs and started to index them (they placed the URLs in their index). Then, after blocking the crawl with robots.txt, they did not crawl the URLs any longer, but they were still in the index. And now, they continue to index faceted navigation links even though they are blocked by robots.txt.

My concern is, how do they index the pages if they don't crawl them?

Wilburforce

4:05 pm on Oct 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

My concern is, how do they index the pages if they don't crawl them?

Google's explanation ([google.com ]) isn't particularly illuminating, but if

1. There is a link to a URL
2. That URL responds with a 200
3. There is a searchable term in link anchor-text and/or the URL

then there is at least something to index.

Why they might index pages with such scant information we can only guess, but the inference I draw from the fact that they do is that they'll index anything that isn't behind a bolted door.

I'm assuming - if you can excuse the multiple negatives - that the not-submitted-in-sitemap pages weren't originally Noindex. If they were Noindex that raises other questions.

What I'm at a loss to work out is how to get GSC to read the revised (block removed) robots.txt, which I submitted a month ago and have resubmitted a couple of times since. I'm still getting the warning, both daily on the "Valid with warnings" page, and on checking the page manually.

seo21

4:07 pm on Oct 17, 2019 (gmt 0)

Faceted navigation is a pain. Its unlikely you can add noindex to those urls if they are generated by the faceted nav. So the solution is...

1. Remove the block from Robots.txt so Google can crawl them.
2. Use mod_rewrite.c in your .htaccess to add noindex X-Robots Tag to faceted navigation urls. For example:

<IfModule mod_rewrite.c>
RewriteCond %{QUERY_STRING} ^(limit|mode|dir|SID)=([a-zA-Z0-9&+%_=�.*]*)$
RewriteRule .* - [E=NOINDEX_HEADER:1]
</IfModule>

<IfModule mod_headers.c>
Header set X-Robots-Tag "noindex" env=NOINDEX_HEADER
</IfModule>

3. Use a server header checker tool (such as tools.seobook.com) to make sure those pages now have an x-robots tag of noindex.

4. Some time later once urls are de-indexed add rel="nofollow" to your faceted nav links.

lucy24

4:34 pm on Oct 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

2. That URL responds with a 200

But how can they know the response if they haven't crawled (requested) the page? We'd have to postulate a plainclothes Googlebot, operating from a non-Googlebot range, checking up on new URLs while keeping G's visible hands clean.

Wilburforce

6:11 pm on Oct 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

But how can they know the response if they haven't crawled (requested) the page?

By using HEAD instead of GET.

lucy24

9:11 pm on Oct 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

HEAD instead of GET

Good idea: that lets you verify that a page exists, without seeing its content. But if you're roboted-out, you're not supposed to be making HEAD requests either.

:: detour to raw logs to see if Googlebot has ever made a HEAD request ::

Nope, HEAD.+Googlebot turns up only fakers, except for that weird string of

"HEAD /amp_preconnect_polyfill_404_or_other_error_expected._Do_not_worry_about_it?1550188800000 HTTP/1.1"

from last February that I'm sure must have been discussed hereabouts somewhere.

Steven29

9:49 pm on Oct 17, 2019 (gmt 0)

Google webmaster tools needs a way to mark "i dont want these indexed" if an idiot descides to backlink all of your links that are denied by robots.txt you will start having hundreds of pages indexed with no information. The only solution seems to remove the robots.txt line and let the google bot slam tuns of irrelevant links. This is not optimal to defend.

Wilburforce

9:51 pm on Oct 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Nope, HEAD.+Googlebot turns up only fakers

On checking, my own logs concur with that.

Interestingly, a search for my indexed but blocked page doesn't turn up anything from Googlebot either (no requests at all, GET, HEAD or otherwise), so - on the face of it - it looks like they are indexing pages without even checking they are there.

On that basis my guess is that this "warning" derives from the recent change to nofollow, and that "indexed" doesn't mean indexed (at least, not as anyone here would understand it), but linked.

@andreicomoti - what do your logs tell us? Has Googlebot requested all the pages for which either warning was generated? Some (most/only a few) of them? None of them?

lucy24

11:34 pm on Oct 17, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Honestly I don�t think �indexed� means anything beyond �we know that this page exists�.

tangor

12:23 am on Oct 18, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

In recent reaffirmations that g will pay attention to robots.txt it appears they are now trying to frighten sites into removing anything in robots.txt that will prevent them from indexing EVERYTHING YOU HAVE ... (wearing a bright, shiny, newly-folded tin foil hat).

"blocked by robots.txt" makes you, the webmaster, the villain.
:)

andreicomoti

11:53 am on Oct 18, 2019 (gmt 0)

Top Contributors Of The Month

Hello everybody,

Here is my approach on this, and your opinions are all welcome:

All faceted navigation links currently have canonical to the main category URL and index, follow directives. They are blocked by robots.txt from crawling. Here are the steps:

1. Remove canonical tag so that it does not intervene with meta robots
2. Add noindex to all pages so that google will remove them from the index
3. Remove the rule that prevents google from crawling the urls from robots.txt
4. After google deindexes the pages, add the rule again in robots.txt to prevent crawling the pages

What do you think?

seo21

12:17 pm on Oct 18, 2019 (gmt 0)

1. Yes.
2. Yes.
3. Yes.
4. Yes

And add something like this (as an example colour/size/type are your faceted links) to .htaccess:

<IfModule mod_rewrite.c>
RewriteCond %{QUERY_STRING} ^(colour|size|type)=([a-zA-Z0-9&+%_=�.*]*)$
RewriteRule .* - [E=NOINDEX_HEADER:1]
</IfModule>

<IfModule mod_headers.c>
Header set X-Robots-Tag "noindex" env=NOINDEX_HEADER
</IfModule>

Maybe also create a sitemap for them all and add to GSC so Google can maybe discover them quicker.

[edited by: seo21 at 1:19 pm (utc) on Oct 18, 2019]

casperb

1:13 pm on Oct 18, 2019 (gmt 0)

follow, noindex does the trick. got rid of all these problems.

andreicomoti

7:40 am on Oct 19, 2019 (gmt 0)

Top Contributors Of The Month

Hello guys, here is an update:

We went on a new design on desktop, but the mobile version of the site remained on the old structure. So, some of the old faceted navigation links remained in the index and google now combines the old faceted navigation links with the new ones (with a completely new structure).

I think that the reason Google did not indexed all the old faceted navigation links is because of the canonical tag present on the page that points to the main category URL. So, since canonical is preventing all old faceted navigation links from being indexed, we can't remove the canonical in order to noindex, nofollow the new faceted navigation links because I am afraid that it will solve the problems for the new URLs, but will give google the opportunity to index all the old faceted navigation links (from the old site, which are still accessible through the mobile version of the site). The new faceted navigation links have common parameters that we can use in order to add the noindex, nofollow tag, but the old faceted navigation links are mixed and they can generate millions of URLs without any common parameters (random filters).

What do you think is the best way to approach this problem? The number of URLs reported with warning in GSC is still rising.

tangor

8:20 am on Oct 19, 2019 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Er .... RESPONSIVE? One rule for them all?

The days of maintaining two versions of a site disappeared YEARS ago.

A redirect might be in order. :)

Steven29

11:30 pm on Oct 20, 2019 (gmt 0)

Andre, thats what i did. Get ready for 5x the server load and even after you put the robots rule back it will take some days. It did affect traffic as well. Then soon as its back those pages start appearing again because they are backlinked by these like virus links. Doesnt anybody else see the thousands and thousands of spun articles on what appear to be hacked sites? Ill past a few here if its allowed. Everyday my niche i can find 200 of them all scraping our articles adding bad links like to pages in robots.txt with keyword tags.