Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Duplicate content issues - crawl budget optimization

         

andreicomoti

8:33 am on Oct 30, 2019 (gmt 0)

5+ Year Member Top Contributors Of The Month



Hello,

I have an ecommerce website and multiple phisical stores in different cities. To cope with all the stock problems, I have a general store that I index in Google, and multiple stores that are blocked from index with robots.txt. Here is an example so that you can better understand the issue:

1. https://www.example.com/countryabbreviation/extendible-sofas/c/12 (allowed in robots.txt, meta robots tag index/follow, canonical itself)

2. https://www.example.com/city1/extendible-sofas/c/12 (blocked by robots.txt, meta robots tag index/follow, canonical to number 1)

3. https://www.example.com/city2/extendible-sofas/c/12 (blocked by robots.txt, meta robots tag index/follow, canonical to number 1)

So, as you can see, all cities are blocked by robots.txt and canonical to the country abbreviation page (that we want to index). After reaching our site, the users are asked to select a country in order to make a purchase.

My questions are:

What do you think about this strategy?

What do you think about the crawl budget, since we have nearly 50 cities and all pages on our site are duplicated 50 times, the original version of a page + 50 stores. Even though we are blocking the 50 stores through robots.txt, I believe that we are wasting crawl budget (Google still crawls a page even if its blocked by robots.txt).

Would it be ok to "noindex/nofollow" the URLs that contain stores? Would this optimiza crawl budget? Or a better strategy is needed?

Thanks for you opinnions.

phranque

10:42 am on Oct 30, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



(blocked by robots.txt, meta robots tag index/follow, canonical to number 1)

What do you think about this strategy?

the meta robots and link rel canonical elements are irrelevant if googlebot is excluded from crawling that url.

(Google still crawls a page even if its blocked by robots.txt).

what evidence do you have of googlebot crawling urls excluded by robots.txt?

Would this optimiza crawl budget? Or a better strategy is needed?

how many urls overall?

andreicomoti

11:20 am on Oct 30, 2019 (gmt 0)

5+ Year Member Top Contributors Of The Month



@phranque

Hello, thanks for your interest.

It's a strange situation. The URL were not in the index before we redesigned our website. After the redesigning process, Google started to show the URLs with cities in the "Valid with warnings" section in crawling (Indexed, though blocked by robots.txt). This was not available previous to the redesigning process.

The urls with cities were always excluded through robots.txt, however, somehow, I don't know how, Google started picking those URL up after the redesigning process.

We are takling about 160mil URL in total.

lucy24

4:33 pm on Oct 30, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"Valid with warnings"
That makes it sound as if they are not crawling. “Indexed though blocked by robots.txt” simply means “We know that this URL exists, although we haven’t officially visited”. There have been a number of recent threads about this particular issue. Or non-issue, depending on your viewpoint.

The canonical tag would only become an issue if the URL you name as canonical is one that you don't allow in robots.txt. Otherwise the search engine crawls its authorized page, glances at the canonical and says something more like “Well, yeah, it’s the only URL for this content that I’ve ever visited, so I should hope it’s the canonical.”

notoriusbean

4:45 pm on Nov 1, 2019 (gmt 0)

5+ Year Member



Don't canonical an alternate page to a main page if you're also going to put noindex on the alternate page. This sends mixed signals to Google, because they can interpret it as a noindex on the canonical url. You are risking Google noindexing your canonical. See John Mueller's explanation: [searchenginejournal.com...]

If it's regional differences in the url and the content is the same, you can look into using re=alternate tags: [support.google.com...]

I can't guess as to why exactly you're getting "Indexed though blocked by robots.txt" warnings. Did you submit the urls in your sitemap?

lucy24

6:04 pm on Nov 1, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can't guess as to why exactly you're getting "Indexed though blocked by robots.txt" warnings.
If an URL exists, search engines will find it. If they couldn’t, how would your human visitors ever get there? And that’s all the word “indexed” means.

NickMNS

6:35 pm on Nov 1, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That makes it sound as if they are not crawling. “Indexed though blocked by robots.txt” simply means “We know that this URL exists, although we haven’t officially visited”

I read this differently, to me it means, "At some point in the past we (Google) added this page to our index, now you asked us not to go there, so we haven't gone there (and do not plan to go) to see if anything has changed". The error message in GSC is provided as a courtesy (a potentially annoying one!) to confirm that this is really what you intended.

Why is this coming up again here as this was already discussed at length in this thread: [webmasterworld.com...]

The bottom line is that if you block something in robots.txt your basically telling Google don't go there. One cannot then add a directive to those blocked pages and expect Google to follow the directive (canonical links or noindex tags, etc.). If they're not going go there the could never see the directive, so how can one expect them to be followed. It would be worrisome if they did follow the directive.

If you want:
- content removed from the index, then no-index,
- Google to show one page over another use the canonical link (Google does not guarantee that this will be followed),
- Google to stop crawling a page, block it in robots.txt

Choose the one you want, but you can't do all three at once.

lucy24

8:26 pm on Nov 1, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



to me it means, "At some point in the past we (Google) added this page to our index, now you asked us not to go there, so we haven't gone there (and do not plan to go) to see if anything has changed"
I re-checked GSC. Every single item on the list--not just pages, but extensions like .midi that I don't want crawled--has always been disallowed in robots.txt. What I find more mysterious is that every one of those listed items has a “Last crawled” date--generally in September of this year--which is manifestly untrue. They're not sneaking around in disguise; a few of the listed items are so obscure, nobody at all visited on the specified date.

Choose the one you want, but you can't do all three at once.
My position is that 999 times out of 1000, it doesn’t matter if uncrawled content is putatively indexed.

phranque

11:02 pm on Nov 1, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



What I find more mysterious is that every one of those listed items has a “Last crawled” date--generally in September of this year--which is manifestly untrue. They're not sneaking around in disguise; a few of the listed items are so obscure, nobody at all visited on the specified date.

under the circumstances, i might take "last crawled" to mean the last time that url was checked for robots.txt exclusion of googlebot.

lucy24

4:55 pm on Nov 2, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



i might take "last crawled" to mean the last time that url was checked for robots.txt exclusion
If so, that sheds an interesting light on how G uses robots.txt: they don't just crawl it and stash its information in a database, but instead visit with a specific shopping list and make notes about what robots.txt has to say about items on that specific list. G may not be as robots.txt-obsessed as some crawlers one could name, but they have certainly read it more recently than September 2019!

Otherwise, “last crawled” means “last date on which we would have crawled, were we allowed to do so” which is a pretty goofy definition of “crawled”. Not that, er, “goofy” and “Google” are necessarily mutually exclusive.