There was already an old thread about this but it's now closed, and I thought the matter is still not very straight forward, so I would like to start a new one.
Responses in the old thread state pretty much that you should include in sitemap.xml only those URLs you want indexed by Google. I find no such statement from Google on their help pages about simteaps: [
support.google.com...]
In fact, the closest they get to explaining how sitemaps are used is this:
Google doesn't guarantee that we'll crawl or index all of your URLs. However, we use the data in your Sitemap to learn about your site's structure, which will allow us to improve our crawler schedule and do a better job crawling your site in the future
Arguably the no-indexed pages are still a part of the site's structure (they exist for
some reason) and the particular focus on "structure" makes me a bit worried because the specific issue I'm dealing with is this: I work on a site whereby all of the content pages are linked from category listing pages and only occasionally from other content pages, as well as temporary (until they become too old) from the homepage.
So, the most reliable (meaning, every content page has it) internal linking structure is this:
home -> category listing -> content
All
category listing pages are
no-indexed but then still included in the sitemap.xml This isn't a WP install but I believe this is also the default (or a recommended, anyway) setting in some of the WP SEO plugins.
The rationale here being (my understanding, limited as it may be) that the
category listing pages do not have their own content hence they are "low quality". Panda and perhaps Penguin do not like low quality, so you'd get site-wide slapped for having considerable amount of those indexed. However, they are still needed for discovering links to content pages because some of the content pages have internal links ONLY from the
category listing pages.
Does this rationale make any sense to you guys? I went into the sitemap.xml to remove some of the content pages that may be considered low quality and were no-indexed, and that looks like a no-brainer to me. However, once I stumbled upon those
category listing pages, no-indexed but still in the sitemap, I got myself thoroughly confused and now looking for advice here.
Actually, I'm not even so sure now that those category pages should even be no-indexed in the first place. It looks like doing it this way breaks the flow of PR internally in a pretty bad way since there is a number of pages on this site that have no internal link from an indexed page. And yet they really don't have any content of their own - it's pretty much just a list of titles from the content pages. This cyclical logic hurts my brain :(
I would appreciate any response or comment you can offer of the matter.