Forum Moderators: goodroi

Message Too Old, No Replies

Blog Tag and Author page blocking with robots.txt

         

seojustin

11:30 pm on Dec 2, 2016 (gmt 0)

5+ Year Member



I have a blog on my site with author (/blog/author/name) and tag (/blog/tag/tagname) pages starting to add up.There are close to 200 out of about 1200 total pages. My opinion is that these are adding little value and taking up crawl budget.

What do you think about blocking those subfolders in the robots.txt file? And to do that would I just use Disallow: /blog/author/ and Disallow: /blog/tag/ in the robots.txt file?

In addition to blocking them from the crawl I would also NoIndex them from Wordpress so they stopped showing up in SERPs.

How do you all handle Author and Tag pages?

Any thoughts would be greatly appreciated. Thanks

keyplyr

12:14 am on Dec 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi seojustin and welcome to WebmasterWorld [webmasterworld.com]
would I just use Disallow: /blog/author/ and Disallow: /blog/tag/ in the robots.txt file?
Yes, that's correct; which ever works towards your needs.

What "crawl budget." are you referring to?

As for the WP noindex question, I'll yield to those more knowledgeable with WP.

not2easy

2:37 am on Dec 3, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Is this a WordPress blog? It's a good idea to limit the types of pages that are indexed since WP offers so many ways to find the same content. The problem with disallowing the pages you don't want to index is that robots will follow links from one part of your site to another. So you can tell Google not to crawl your /tag/ directories but that does not prevent them from being indexed.

Are you using a plugin to create your sitemaps and if so does your plugin give you control of what taxonomy is being submitted?

phranque

5:29 am on Dec 3, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



The problem with disallowing the pages you don't want to index is that robots will follow links from one part of your site to another. So you can tell Google not to crawl your /tag/ directories but that does not prevent them from being indexed.

another way of stating this is that when the crawler is blocked by robots.txt it won't see the noindex.
it does however know the url and perhaps some anchor text and context...

keyplyr

6:27 am on Dec 3, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Which is why access to robots.txt should always be allowed. So don't block the crawler by name, only noindex the specific files/pages.