Blog Tag and Author page blocking with robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Blog Tag and Author page blocking with robots.txt

seojustin

11:30 pm on Dec 2, 2016 (gmt 0)

I have a blog on my site with author (/blog/author/name) and tag (/blog/tag/tagname) pages starting to add up.There are close to 200 out of about 1200 total pages. My opinion is that these are adding little value and taking up crawl budget.

What do you think about blocking those subfolders in the robots.txt file? And to do that would I just use Disallow: /blog/author/ and Disallow: /blog/tag/ in the robots.txt file?

In addition to blocking them from the crawl I would also NoIndex them from Wordpress so they stopped showing up in SERPs.

How do you all handle Author and Tag pages?

Any thoughts would be greatly appreciated. Thanks

keyplyr

12:14 am on Dec 3, 2016 (gmt 0)

Hi seojustin and welcome to WebmasterWorld [webmasterworld.com]

would I just use Disallow: /blog/author/ and Disallow: /blog/tag/ in the robots.txt file?

Yes, that's correct; which ever works towards your needs.

What "crawl budget." are you referring to?

As for the WP noindex question, I'll yield to those more knowledgeable with WP.

not2easy

2:37 am on Dec 3, 2016 (gmt 0)

Is this a WordPress blog? It's a good idea to limit the types of pages that are indexed since WP offers so many ways to find the same content. The problem with disallowing the pages you don't want to index is that robots will follow links from one part of your site to another. So you can tell Google not to crawl your /tag/ directories but that does not prevent them from being indexed.

Are you using a plugin to create your sitemaps and if so does your plugin give you control of what taxonomy is being submitted?

phranque

5:29 am on Dec 3, 2016 (gmt 0)

The problem with disallowing the pages you don't want to index is that robots will follow links from one part of your site to another. So you can tell Google not to crawl your /tag/ directories but that does not prevent them from being indexed.

another way of stating this is that when the crawler is blocked by robots.txt it won't see the noindex.
it does however know the url and perhaps some anchor text and context...

keyplyr

6:27 am on Dec 3, 2016 (gmt 0)

Which is why access to robots.txt should always be allowed. So don't block the crawler by name, only noindex the specific files/pages.