Welcome to WebmasterWorld Guest from 3.81.29.226

Forum Moderators: Robert Charlton & goodroi

Duplicate content issues - crawl budget optimization

     
8:33 am on Oct 30, 2019 (gmt 0)

New User

joined:Jan 10, 2018
posts: 25
votes: 0


Hello,

I have an ecommerce website and multiple phisical stores in different cities. To cope with all the stock problems, I have a general store that I index in Google, and multiple stores that are blocked from index with robots.txt. Here is an example so that you can better understand the issue:

1. https://www.example.com/countryabbreviation/extendible-sofas/c/12 (allowed in robots.txt, meta robots tag index/follow, canonical itself)

2. https://www.example.com/city1/extendible-sofas/c/12 (blocked by robots.txt, meta robots tag index/follow, canonical to number 1)

3. https://www.example.com/city2/extendible-sofas/c/12 (blocked by robots.txt, meta robots tag index/follow, canonical to number 1)

So, as you can see, all cities are blocked by robots.txt and canonical to the country abbreviation page (that we want to index). After reaching our site, the users are asked to select a country in order to make a purchase.

My questions are:

What do you think about this strategy?

What do you think about the crawl budget, since we have nearly 50 cities and all pages on our site are duplicated 50 times, the original version of a page + 50 stores. Even though we are blocking the 50 stores through robots.txt, I believe that we are wasting crawl budget (Google still crawls a page even if its blocked by robots.txt).

Would it be ok to "noindex/nofollow" the URLs that contain stores? Would this optimiza crawl budget? Or a better strategy is needed?

Thanks for you opinnions.
10:42 am on Oct 30, 2019 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11870
votes: 244


(blocked by robots.txt, meta robots tag index/follow, canonical to number 1)

What do you think about this strategy?

the meta robots and link rel canonical elements are irrelevant if googlebot is excluded from crawling that url.

(Google still crawls a page even if its blocked by robots.txt).

what evidence do you have of googlebot crawling urls excluded by robots.txt?

Would this optimiza crawl budget? Or a better strategy is needed?

how many urls overall?
11:20 am on Oct 30, 2019 (gmt 0)

New User

joined:Jan 10, 2018
posts: 25
votes: 0


@phranque

Hello, thanks for your interest.

It's a strange situation. The URL were not in the index before we redesigned our website. After the redesigning process, Google started to show the URLs with cities in the "Valid with warnings" section in crawling (Indexed, though blocked by robots.txt). This was not available previous to the redesigning process.

The urls with cities were always excluded through robots.txt, however, somehow, I don't know how, Google started picking those URL up after the redesigning process.

We are takling about 160mil URL in total.
4:33 pm on Oct 30, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


"Valid with warnings"
That makes it sound as if they are not crawling. “Indexed though blocked by robots.txt” simply means “We know that this URL exists, although we haven’t officially visited”. There have been a number of recent threads about this particular issue. Or non-issue, depending on your viewpoint.

The canonical tag would only become an issue if the URL you name as canonical is one that you don't allow in robots.txt. Otherwise the search engine crawls its authorized page, glances at the canonical and says something more like “Well, yeah, it’s the only URL for this content that I’ve ever visited, so I should hope it’s the canonical.”
4:45 pm on Nov 1, 2019 (gmt 0)

New User

joined:Aug 30, 2019
posts: 2
votes: 0


Don't canonical an alternate page to a main page if you're also going to put noindex on the alternate page. This sends mixed signals to Google, because they can interpret it as a noindex on the canonical url. You are risking Google noindexing your canonical. See John Mueller's explanation: [searchenginejournal.com...]

If it's regional differences in the url and the content is the same, you can look into using re=alternate tags: [support.google.com...]

I can't guess as to why exactly you're getting "Indexed though blocked by robots.txt" warnings. Did you submit the urls in your sitemap?
6:04 pm on Nov 1, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


I can't guess as to why exactly you're getting "Indexed though blocked by robots.txt" warnings.
If an URL exists, search engines will find it. If they couldn’t, how would your human visitors ever get there? And that’s all the word “indexed” means.
6:35 pm on Nov 1, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:2738
votes: 837


That makes it sound as if they are not crawling. “Indexed though blocked by robots.txt” simply means “We know that this URL exists, although we haven’t officially visited”

I read this differently, to me it means, "At some point in the past we (Google) added this page to our index, now you asked us not to go there, so we haven't gone there (and do not plan to go) to see if anything has changed". The error message in GSC is provided as a courtesy (a potentially annoying one!) to confirm that this is really what you intended.

Why is this coming up again here as this was already discussed at length in this thread: [webmasterworld.com...]

The bottom line is that if you block something in robots.txt your basically telling Google don't go there. One cannot then add a directive to those blocked pages and expect Google to follow the directive (canonical links or noindex tags, etc.). If they're not going go there the could never see the directive, so how can one expect them to be followed. It would be worrisome if they did follow the directive.

If you want:
- content removed from the index, then no-index,
- Google to show one page over another use the canonical link (Google does not guarantee that this will be followed),
- Google to stop crawling a page, block it in robots.txt

Choose the one you want, but you can't do all three at once.
8:26 pm on Nov 1, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


to me it means, "At some point in the past we (Google) added this page to our index, now you asked us not to go there, so we haven't gone there (and do not plan to go) to see if anything has changed"
I re-checked GSC. Every single item on the list--not just pages, but extensions like .midi that I don't want crawled--has always been disallowed in robots.txt. What I find more mysterious is that every one of those listed items has a “Last crawled” date--generally in September of this year--which is manifestly untrue. They're not sneaking around in disguise; a few of the listed items are so obscure, nobody at all visited on the specified date.

Choose the one you want, but you can't do all three at once.
My position is that 999 times out of 1000, it doesn’t matter if uncrawled content is putatively indexed.
11:02 pm on Nov 1, 2019 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11870
votes: 244


What I find more mysterious is that every one of those listed items has a “Last crawled” date--generally in September of this year--which is manifestly untrue. They're not sneaking around in disguise; a few of the listed items are so obscure, nobody at all visited on the specified date.

under the circumstances, i might take "last crawled" to mean the last time that url was checked for robots.txt exclusion of googlebot.
4:55 pm on Nov 2, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15934
votes: 887


i might take "last crawled" to mean the last time that url was checked for robots.txt exclusion
If so, that sheds an interesting light on how G uses robots.txt: they don't just crawl it and stash its information in a database, but instead visit with a specific shopping list and make notes about what robots.txt has to say about items on that specific list. G may not be as robots.txt-obsessed as some crawlers one could name, but they have certainly read it more recently than September 2019!

Otherwise, “last crawled” means “last date on which we would have crawled, were we allowed to do so” which is a pretty goofy definition of “crawled”. Not that, er, “goofy” and “Google” are necessarily mutually exclusive.