|Should I include noindex/noarchive pages in sitemap.xml?|
There was already an old thread about this but it's now closed, and I thought the matter is still not very straight forward, so I would like to start a new one.
Responses in the old thread state pretty much that you should include in sitemap.xml only those URLs you want indexed by Google. I find no such statement from Google on their help pages about simteaps: [support.google.com...]
In fact, the closest they get to explaining how sitemaps are used is this:
|Google doesn't guarantee that we'll crawl or index all of your URLs. However, we use the data in your Sitemap to learn about your site's structure, which will allow us to improve our crawler schedule and do a better job crawling your site in the future |
Arguably the no-indexed pages are still a part of the site's structure (they exist for some reason) and the particular focus on "structure" makes me a bit worried because the specific issue I'm dealing with is this: I work on a site whereby all of the content pages are linked from category listing pages and only occasionally from other content pages, as well as temporary (until they become too old) from the homepage.
So, the most reliable (meaning, every content page has it) internal linking structure is this:
home -> category listing -> content
All category listing pages are no-indexed but then still included in the sitemap.xml This isn't a WP install but I believe this is also the default (or a recommended, anyway) setting in some of the WP SEO plugins.
The rationale here being (my understanding, limited as it may be) that the category listing pages do not have their own content hence they are "low quality". Panda and perhaps Penguin do not like low quality, so you'd get site-wide slapped for having considerable amount of those indexed. However, they are still needed for discovering links to content pages because some of the content pages have internal links ONLY from the category listing pages.
Does this rationale make any sense to you guys? I went into the sitemap.xml to remove some of the content pages that may be considered low quality and were no-indexed, and that looks like a no-brainer to me. However, once I stumbled upon those category listing pages, no-indexed but still in the sitemap, I got myself thoroughly confused and now looking for advice here.
Actually, I'm not even so sure now that those category pages should even be no-indexed in the first place. It looks like doing it this way breaks the flow of PR internally in a pretty bad way since there is a number of pages on this site that have no internal link from an indexed page. And yet they really don't have any content of their own - it's pretty much just a list of titles from the content pages. This cyclical logic hurts my brain :(
I would appreciate any response or comment you can offer of the matter.
You've got two unrelated questions.
#1 Should I no-index such-and-such page?
#2 Does a no-index page belong on the sitemap?
The answer to #2 is: You need it on the sitemap if and only if the sitemap is the only way to find those no-indexed pages, and your no-indexed pages are in turn the only way to find indexed pages deeper in the site. That's assuming that search engines continue to treat "noindex" and "nofollow" as unrelated concepts. Meanwhile it's quite a lot of ifs.
Thanks, lucy24. I think I get the idea. It would be difficult for me to imagine the SE not being able to get to these no-index pages (category listings) given that the first of them is linked from the homepage and after that they are all linked from neighboring category pages. In other word, it does not look like they should be in the sitemap only to be discovered.
The question #1, although not directly related to sitmaps.xml, is still unclear to me though. Am I hurting the site by no-indexing those category listings (even though they by themselves are "low quality" content) thus insulating some of the pages from any possible PR down-flow from the homepage? I should clarify: there are no tag listings, date listings or other alternative listings - those no-indexed category listings are the *only* way to get to some (considerable amount) of content pages.
If those noindex "category" pages could also be called "index" or "table of contents" for the linked pages, then you probably should not restrict them. My experience has been noindex,follow pages lose PR but if you change to index,follow the pages show PR quickly.
I don't no-index category pages (but I do try to add content to them where I can) and I don't put no-indexed pages in my sitemaps.
@1script, You haven't mentioned how how you are handling sub pages of these category pages.
But here is my take.
Adding noindex to category pages don't mean they block flow of PR thro. them. They can still pass on PR. You are confusing noindex with nofollow. It is nofollow meta tag that essentially prevents flow of PR. If you just added the noindex meta tag, then PR can continue to flow thro. those category pages. The deafult is "follow".
But whether to noindex or not is your own take. I would suggest if the category pages have some great collection of topic relevant products/articles, you can allow them to index. But if I remember right,google somewhere suggested not to get tag pages or search result pages indexed. Do you find these category pages different from those tag or search result pages? if yes, allow them to index.
Coming to sitemaps, it doesn't really matter if you have noindex pages on them. Yes i know that one wp seo plugin spread that myth but in my exp., it is no harm to include them or leave them alone.
Thanks, indyank. As far as subpages (?) - the category listings are paginated - is this what you meant? - and there is a navigational bar to page 1, 2, 3 ... etc. but as far as structure, the last level down from the category listing pages are the content pages themselves.
I guess you are right, I am a bit confused about PR flow here - seems like if the page is not even in the index, no PR can flow through a page that does not exist. I was under impression that the "follow" attribute (as in lack of "nofollow") refers to discovery of links, not passing PR to them.
I do see one disadvantage in having the noindexed URLs in sitemap - seems like you would not want to waste your crawl budget on URLs you don't want indexed anyhow. My guess is, however, that the sitemap is not quite what Google uses for discovery of new URLs and if there's a link to it somewhere, it follows the link and has to read and process the URL just to see the noindex meta in its code, so it all may just be moot point.
I am personally leaning towards removing the noindexed URLs from the sitemap at least to see how much of the ones that I actually want indexed are actually indexed (as reported in WMT)
There's a common saying in website design (named after a book) "Don't Make Me Think" (or alternately, Keep It Simple, Stupid (KISS)).
That's kind of how I deal with Google. I don't want to make them have to think - so that's why I keep noindexed URLs out of my sitemaps.
|All category listing pages are no-indexed but then still included in the sitemap.xml This isn't a WP install but I believe this is also the default (or a recommended, anyway) setting in some of the WP SEO plugins. |
The reason WP SEO plugins allow the option to disable indexing category pages is because that often is only an alternate method of reaching those pages or posts and they could be seen as duplicate content, not because they are thin. Categories in WP are used differently than the way you are using them, generally a WP category is a virtual page listing posts or pages that have been tagged with that category and they may also be listed separately under monthly or yearly archives. The default behavior of WP gives many ways to find the same content, but a static site doesn't usually have that problem, unless it was structured to find the same content multiple ways.
As mentioned, if your category pages are the way visitors find the pages they list, then you probably do want them indexed.
As with most things SEO related... it depends.
I've never been a big fan of sitemap.xml files except for large sites (10s of thousands of pages or more typically). I would rather Google infer the importance/priority of the various URLs on my site by evaluating my internal linking structures and external inbound links. They are quite good at this.
If I encounter a site with crawlability issues, I would rather fix the issues preventing the site from being crawled... instead of putting a bandaid on it with a sitemap.xml.
If I have a brand new site, getting it indexed w/ a sitemap.xml is pretty much worthless. It's not going to rank for anything significant beyond any initial honeymoon period without links. New sites with no links that get indexed are like being all dressed up with nowhere to go. Rather than wasting time creating, managing priorities for, and submitting sitemap.xml files for a new site to get it indexed... I'd rather spend that time building links which will both get the site indexed naturally AND provide it backlinks to assist with rankings once it is indexed.
However, if you chose to use a sitemap.xml file, there are times when you might want to include in your sitemap.xml certain URLs flagged as NOINDEX, NOFOLLOW, or NOARCHIVE using a meta robots element.
A page flagged NOINDEX can still accumulate and pass PageRank/link juice out to other pages to which it links. It simply won't be shown in the SERPs. There may be times when you want to make sure such pages are crawled so that the engines will discover other pages ONLY linked to from such NOINDEXed pages by giving the NOINDEX URL a high priority.
You may want a page flagged NOFOLLOW to still be indexed though you don't want crawlers to follow and pass link juice to its outbound links.
You may want a page flagged NOARCHIVE to still be indexed though you do not want a cached copy to be maintained at Google or a history of your pages to be maintained by the Wayback Machine at archive.org.
If there was an issue with submitting URLs containing meta robots NOINDEX or meta robots NOARCHIVE elements, when you submitted it to Google then you'd likely get a warning... similar to the way they warn you when you submit URLs that redirect to other URLs. I doubt seriously if including them will ever hurt your site, and may in certain situations be useful in getting your site crawled and/or indexed better.