Disallow in robots.txt or rely on noindex?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Disallow in robots.txt or rely on noindex?

LinkedUp

8:51 am on Jul 11, 2023 (gmt 0)

A lot of product-pages lead to our booking formular which then have a different url for each product & different selections for the product, e.G. www.example.com/product-1/selection-1, www.example.com/product-1/selection-2 etc., these pages have only inputs and no content, the URLs are still static.

We always blocked these pages in the robots.txt from being crawled to safe up some crawling budget, but they have to be linked on our product pages. So G still finds them & indexes them, because due to the settings in the robots.txt it can't read the noindex tag on the booking formular pages.

So I was wondering: How would you handle these pages? Leave them disallowed in the robots.txt to safe crawling budget or remove them from robots.txt to make sure they are crawled & no in the index of G?

Plus, is there a way to still find out the crawl budget? I've seen Screenshots from the old Search Console where it was shown, but can't seem to find the information in the newer version of it.

RareBit

1:55 pm on Jul 11, 2023 (gmt 0)

Settings > crawl stats in GSC will show your crawl requests. In regard to your other question we just canonicalise back to the original URL, G now only sees this as a hint but in our case they never get indexed.

lucy24

6:14 pm on Jul 11, 2023 (gmt 0)

You’re right that in most cases Disallow and noindex are mutually exclusive. (Exception: As a belt-and-suspenders, I put noindex tags on my error documents, although they are located in a Disallowed directory that will never be crawled on purpose.) But...

So G still finds them & indexes them

They're indexed in theory, and will be listed that way in GSC. But realistically, it's very rare--though not unheard-of--for a roboted-out page to show up in actual human search results.

To some extent it's a judgment call. How many pages are involved, and what proportion of your site does this number represent? And, conversely: Of the pages that you do want indexed, how often do they change, and would this be impaired if G### had to spend more time crawling other pages?

Finally: “crawl budget” is a nebulous thing. They don’t allocate a flat number of crawls to every site; if a site gets bigger, there will be more crawling. If all pages in a given directory carry a noindex, then after a while they will not be crawled very often.