Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Time to insert the old URLs into robots.txt after 301 consolidation?

         

guarriman3

8:31 am on Jun 7, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi there,

I manage a website with 200,000+ products stored in a database. Initially, I had 5 different URLs per product (one with general description, another one with photos, etc.), with the consequent risk of thin content, crawl budget, etc.

Some months ago I started my content consolidation by merging all the 5 different URLs into a single one, by using 301 directions:
- mydomain/photos/productname --> mydomain/main/productname#photos
- mydomain/comments/productname --> mydomain/main/productname#comments


I assumed that, after so many months, Google would have already decided "hey, these URLs no longer exist and the information is in the new ones". However, I'm still finding that Googlebot is crawling the old URLs and I'm afraid that this activity may damage my crawl budget. Currently, within 'Google Search Console' (Coverage > Excluded), I've got 380k URLs under "Page with redirect" section.

My question is: is it time to insert all the old URLs into robots.txt to tell Googlebot to stop crawling them?

User-Agent: *
Disallow: /photos/
Disallow: /comments/


Thank you.

not2easy

11:12 am on Jun 7, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Do your pages include a canonical meta tag? If so, did you make any adjustments in that tag for consolidating the pages to a single URL?

Have you verified that the 301 is functioning as expected? Not that you land on the 'right' URL when you enter the old version but that it is returning a 301 server response? Have your sitemaps excluded the old URLs?

I would not use robots.txt to prevent crawling until at least verifying why they continue to crawl those URLs - although Google never forgets any URL it has seen once and will keep on trying it from time to time.

Google offers tips on the best way to consolidate content: [developers.google.com...]

guarriman3

1:56 pm on Jun 7, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi @not2easy, thank you very much for your nice answer.

Have you verified that the 301 is functioning as expected?


Yes, I verified the URLs by using 'curl' on my Linux shell
# curl -iL https://mydomain/photos/productname


HTTP/1.1 301 Moved Permanently
Date: Mon, 07 Jun 2021 13:46:12 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Location: https://mydomain/main/productname#photos


Have your sitemaps excluded the old URLs?


Yes, they are excluded

Do your pages include a canonical meta tag?


I do not understand very well this point. The canonical metatag is included in the new URLs, but the old URL shows just the 301 contents to redirect to the new URL. As far as I know, there is no option to include HTML with metatags, right?

not2easy

2:26 pm on Jun 7, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I do not understand very well this point. The canonical metatag is included in the new URLs, but the old URL shows just the 301 contents to redirect to the new URL.
I was asking whether (if) there were canonical meta tags on the old pages (URLs) if they had been updated to show the new canonical URL. If the old pages had a canonical tag, and if those pages no longer contain any tag, then no, the meta tag was not updated to the new URL.

If you read the information on that Google page I linked to about consolidating content, it might explain that situation better.

lucy24

4:31 pm on Jun 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I assumed that, after so many months, Google would have already decided "hey, these URLs no longer exist and the information is in the new ones".
Google never forgets an URL. But they do slow down eventually.

As not2easy says, do not use robots.txt to deny access to redirected pages. They won't see the redirect unless they're allowed to make the request. Do make sure that all your internal links are up-to-date--and also any external links that are in your power to change--so the incorrect URLs aren't being reinforced.

Now, you might eventually decide that all those redirects are more trouble than they're worth--but that's a subject for many years in the future. For example, in late 2013 I spun-off three-quarters of one site to a new site, where I continued tweaking some URLpaths, so additional redirects had to be added in both places. In mid-2019--i.e. five and a half years later--I decided enough was enough and started returning 410s on requests for moved material. Result: htaccess dropped to 1/4 of its former size, making correspondingly less work for the server. Again, that's five and a half years: more than enough time for any and all humans to have updated their bookmarks. And yes, search engines do still send in a scattering of requests every day, but at this point all I can say is ### ’em, they know where to find the material.

guarriman3

12:49 pm on Jun 8, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



@not2easy, thank you again for your nice answer. You did explained the situation, but I did not even click on the Google tips you provided. My apologies.

I've been reading Google's document and your comments. Yes, the 'rel=canonical' meta tag does not make sense in this case and I should use 301-redirects because I "want to get rid of existing duplicate pages, but need to ensure a smooth transition before you retire the old URLs".

I did want to remove the old pages. 60% of them (e.g. photos of a very unpopular product) had not received a single visit in years, but they were being crawled by Googlebot and generating duplicate content ("photos of very-similar-product-name"). So I consider that merging all the information into a single page is better for users (one page with lots of data) and for my crawl budget.

I've got two questions:

1) For next similar situations, would it be a good idea to insert first the canonical metatags for (e.g) two weeks? Later, I would remove the old URLs from the sitemaps and include the 301-redirects.

2) I read on the Google document that it's possible to include the canonical URL within the HTTP header. Does it make sense if I include the canonical URL before the 301 direct?

Link: <https://mydomain/main/productname#photos>; rel="canonical"
HTTP/1.1 301 Moved Permanently
Location: https://mydomain/main/productname#photos


@lucy24, thank you too very much for your answer. I will not use 'robots.txt' to prevent Googlebot from crawling old pages. I will wait a few more years :-P