Downside to noindex?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Downside to noindex?

realmaverick

4:28 am on Mar 16, 2011 (gmt 0)

After revisiting the structure of my website after several years, I'm finding more and more absolutely useless stuff indexed. I want them deindexed and claim back the link juice and crawl allowance.

An example, new user signs up, it appears in the timeline with their username hyperlinked, Google follows this to their empty profile. That then links to several empty pages, where their content will be, should they create any. Of course many many users don't.

I've now un-hyperlinked the username in this instance, removed the links in the profile if no content exists and I've added noindex, follow to the proceeding pages.

I hate using these tags, just as I hated using nofollow way back when. But sometimes, it seems necessary.

Is their any downside to what I've done with the noindex, follow? Is Google likely to give a crap that I've just told it, that it's not to index half a million pages?

falsepositive

5:18 am on Mar 16, 2011 (gmt 0)

Good question. Over the last few days, I've used noindex, follow on several pages I don't want indexed also, such as category and archive pages I'd like to refurbish first before indexing. I expect that the rest of my site will be indexed since the hub pages (categories, etc) will be followed even if they are not indexed.

realmaverick

5:56 am on Mar 16, 2011 (gmt 0)

Yes that's the theory. I'm concerned that I've been making such huge changes on a several year old domain with several million indexed pages. I'm pretty sure the changes I've been making are for the better. But because of the grand scale, I'm having doubts.

tedster

6:54 am on Mar 16, 2011 (gmt 0)

I want them deindexed and claim back the link juice and crawl allowance.

Deindexed - Yes
That's the exact function of the noindex meta tag.

Link juice - No
Link juice still accumulates for the URL as long as there is a link pointing to it. Even if you make the link nofollow, the other links do not reclaim any share of link juice whatsoever.

Crawl allowance - Just a bit
In order to even see the noindex meta tag, googlebot must crawl the page. It may crawl less frequently after it verifies the noindex a few times, but it must continue to crawl.

I doubt there will be any downside to this action whatsoever, unless for some reason those pages were getting significant search impressions or traffic. And from what Google has been saying, there may well be a positive effect of all these profiles were somehow tagged as low quality and that bled over into the rest of the site.

If that did happen for such a standard thing as a profile page, then this algo has more trouble than I thought.

realmaverick

10:06 am on Mar 16, 2011 (gmt 0)

Tedster that's true regarding the fact G will still have to crawl te pages. I wonder if I can utilise robots.txt a little here.

It makes sense that this change would be a positive one. I get a little anxious when playing with architecture on a big established site.

However I feel much better after reading your advice. Thanks so much as always. It always amazes me at the quality of the replies and seeds of wisdom you leave members on a daily basis.

aakk9999

1:46 pm on Mar 16, 2011 (gmt 0)

Crawl allowance - Just a bit
In order to even see the noindex meta tag, googlebot must crawl the page. It may crawl less frequently after it verifies the noindex a few times, but it must continue to crawl.

What would be the best way to recover crawling allowance? Is it robots.txt exclusion, or perhaps returning 301 redirect or perhaps returning 410 Gone (if appropriate)?

Lets say that there was a mistake where a large number of dynamic URLs have been exposed to Google (e.g. development script error). This is subsequently fixed, however, Google knows about these URLs and will be requesting them periodically. What would be the best way to "recover crawl budget"?

Where I am heading to is that on this forum we mostly talk about "crawling / indexing / ranking" but I see another step, which is "URLs TO DO" list somewhere.

So the accidental leakage of dynamic URLs may substantially increase this "URLs TODO" list for the site. Would G. ever drop URLs from "TODO" list once it knows about it, regardless whether it actually requests the page, crawls the page or not?

The way I see it is:

You may have URLs on that "TODO" list that

a) should not be even requested (because of robots.txt exclusion) and will not be crawled

b) Will be requested, but will not be crawled (e.g. the response is 404, 301, 302, 410, 5xx etc), but it seems these will remain on "TODO" list for later.

I am wondering the size of this "TODO" list also affects the depth and frequency of crawling important pages and if so, how to reduce this list? Or in the case of mistakenly flooding the site with URLs that are subsequently removed/redirected, how to tell Google: stop checking these, do not waste your time? Or it perhaps does not matter at all how big this "TODO" list is?

For instance, providing there are no links (internal or external) to a page that returns 410, will G. drop that URL for good from "TODO" list? Similarly, would 404 be dropped if no links to that page (even though it may take longer to be dropped). With regards to 301 - perhaps 301 will be requested less and less often the longer it keeps returning 301 response? Or even less frequent if no links pointing to URL that responds with 301? Any chances for this URL to be completely dropped from G. "TODO" list?

Would be interested in any thoughts on all this.

indyank

3:04 pm on Mar 16, 2011 (gmt 0)

aakk9999, The URLs on the todo list will drop off, if there are no internal or external links to them.The dynamic urls that accidentally got indexed is a good example.

For those pages that return 404 or 410, I don't see google crawling them as they don't exist anymore.So, they too wouldn't remain on the to-do list unless there are links pointing to them.

But noindex pages are a different case.Google need to crawl them to see the noindex meta tag.

You can instead block them via robots.txt. But the real issue in blocking them via robots.txt is whatever incoming links that those pages had accumulated would be lost.At-least that is my understanding.Not sure what others feel on this.Would love to know.

So I do prefer the noindex, follow meta tag as whatever juice that these pages accumulate can still be spread to urls followed on those pages.You can at the same time work on improving them and then eventually release them into google's index.

aakk9999

3:19 pm on Mar 16, 2011 (gmt 0)

But noindex pages are a different case.Google need to crawl them to see the noindex meta tag.

I know that, and I know this will still impact crawl budget (i.e. not recover it significantly, as per Tedsteer's reply above.

My question was with what I called "URLs TODO" list.

For those pages that return 404 or 410, I don't see google crawling them as they don't exist anymore.So, they too wouldn't remain on the to-do list.

I am not sure the above is correct because I am seeing some URLs that return 404 being requested for a very long time. It seems that if there is external link to them, they will be re-requested "forever". Without any links pointing to them, I am not sure how long would they be kept in "TODO" list.

I know they can be blocked by robots.txt, the questions are - will they drop off "TODO" list if there is no any links being pointed to them, and if so, what is the best way, robots.txt, returning 404, returning 410 or returning 301 ?

I have to say it makes me a bit uncomfortable to se nnn URLs blocked by robots.txt in WMT!

indyank

3:24 pm on Mar 16, 2011 (gmt 0)

aakk9999, i edited my above post, when you were probably replying to me.

I am seeing some URLs that return 404 being requested for a very long time

yes, they will, if there are links pointing to them.May be not so frequently as suggested by Tedster.

So one has to be very sure that a page neither gets search traffic not does it has good number of external links, before they decide to retun 404 or 410. If not, I would prefer a noindex, follow in such cases.You can then work on them.

TheMadScientist

3:32 pm on Mar 16, 2011 (gmt 0)

For those pages that return 404 or 410, I don't see google crawling them as they don't exist anymore.

Google operates a fairly compliant bot and 404 and 410 are two different things.

A 404 is a Not Found error, and is default server behavior, meaning the situation may be temporary or may be permanent, so the URLs will likely be requested for years into the future while serving a 404, especially if there are links to the page(s).

Example of Temporary: The FTP program that does your uploading deletes the file on your server and then saves your local copy there ... Your upload stalls out after the file is deleted ... Anyone who requests the file will get a 404 Not Found error until the page is re-uploaded ... A 404 does not in anyway indicate 'removed' or 'permanent' or 'no longer exists' it means exactly what it says: Not Found.

A 410 Gone does indicate Permanent and is NOT a default behavior. It MUST be intentionally set, so if a page is intentionally removed and will not be replaced it is the correct code to use to slow down GBot from crawling the page. If I remember correctly, when it was first introduced they treated it much like a 404 in terms of request frequency, but have since adjusted GBot to not request the page as often, even though they will still occasionally check to see if it is still 'Gone', because, yes, it's tough to believe, but webmasters do make mistakes, and sometimes just plain change their mind, so they don't want to 'write off' a URL, even if it's Gone. (They always double check, repeatedly.)

Anyway, I think the short answer to the above question is: 410 Gone if they are gone and you don't not ever want them indexed.

indyank

3:42 pm on Mar 16, 2011 (gmt 0)

TheMadScientist, I agree with you.But people have to be careful in not returning a 410 for pages that have obtained substantial number of external links.

They may have obtained those links, despite being thin content pages, as they give away the answers straight. But, google seem to be not favoring those pages any more because they are thin pages.

So, one has to think about working on them to make the story longer :)

TheMadScientist

3:49 pm on Mar 16, 2011 (gmt 0)

Yeah, I think we're talking about user profiles that were intentionally removed though ... I only skimmed the original post before now, so I agree with you in terms of 'a regular page' and would not recommend using Gone for those, but if the pages were removed for a reason, will not ever exist and they were (most likely only) linked from within the site, Gone would seem to be the right answer.

In reading the OPs situation with a bit more detail, Gone may or may not be correct, personally, unless they want those pages indexed at some time in the future, I would probably use it ... Even if someone adds content to the pages, what value do they add to the results? I think they would be for 'on-site-use-only' if they do ever exist and Google probably doesn't need to index those, so I'd personally be inclined to lean toward the use of Gone in this situation.

Good point on 'regular pages' and again I would not recommend using it for pages with significant in-bound links, I'd definitely find a way to redirect those and 'capture' the link weight if I could.

aakk9999

4:13 pm on Mar 16, 2011 (gmt 0)

I agree with both of you and this is pretty much what I am seeing.

But my question is whether the long "URLs TODO" list has impact on crawling budget? And if so, how to reduce it?

Or to give an example:

Lets say you made a mistake and exposed thousands dynamic URLs to Google. Now, lets assume that these dynamic URLs have a friendly URL version, and that a mistake was so bad that there are 10-20 dynamic URLs that resolve to the same friendly (I am inventing a really bad case here!). Obviously, you never wanted that G. come across these dynamic URLs, but a mistake has been made and now you need to find the best way to fix it (here comes Shadow's saying in another thread "You cannot uncook the egg..", but lets try to do the best we can here).

Lets assume these URLs have been exposed for a short time, have not gained external links, then you fixed the problem, and they are not any more interlinked from anywhere within the site. However, lets assume they were visible long enough for G. to find them and put them in its "URLs TODO" list for crawling.

If such mistakenly exposed URLs resolve, then G. will want to crawl them, therefore reducing your crawling budget.

You can:

a) noindex, follow (will be crawled)
b) noindex, nofollow (will be crawled)
c) set canonical to friendly (will be crawled)
d) stop them via robots.txt (will not be crawled, but you will end up with a long list of URLs stopped by robots in WMT)
e) set up 301 redirect to its friendly version
f) return 404 (not recommended)
g) return 410 Gone

In the cases of a, b, c, they will definetely stay in "TODO" list and will impact crawl budget.

In cases of d, e, f, g these will not be crawled. However, I am wondering:
1)whether the large "URLs TODO" list impacts crawling or impacts site negatively in any way, even if they are stopped from being crawled?
2) whether in any of these cases G. will drop them from "URLs TODO" list eventually?

indyank

4:22 pm on Mar 16, 2011 (gmt 0)

aakk9999,

the same thing did happen to our wordpress site because of using the simple tags plugin. The new wordress release had somehow broken the plugin which in turn broke our category navigation and we didn't even notice it until march 3. A big mistake on our part for not testing the changes thoroughly. All the category pages resolved to the home page creating some sort of infinite url space and duplicate content.

we don't index those category pages but we do follow the links in them.

since this happened at the same time as the panda update, it made our situation somewhat complex.

aakk9999

4:36 pm on Mar 16, 2011 (gmt 0)

Indyank, unfortunately exposing URLs that should not have been exposed is fairly common mistake to be made. I think that when something like this happens, there is one big difference that impact how these could be potentially handled and this is whether the URL should still be exposed to user or not.

If such URL should be exposed to website visitor, then I think that noindex,follow is the best aproach.

But if these URLs should not be exposed to website visitor either, then I think that d, e, f, g from my post above is better.

What I am wondering though is which of d, e, f, g (see my email above) will bring me the closest to the point of "uncooking the egg" !

indyank

4:45 pm on Mar 16, 2011 (gmt 0)

If we were to listen to google employees, john Mu had already given a good response to your questions.

[google.com...]

pageoneresults

4:46 pm on Mar 16, 2011 (gmt 0)

I hate using these tags, just as I hated using nofollow way back when. But sometimes, it seems necessary. Is their any downside to what I've done with the noindex, follow? Is Google likely to give a crap that I've just told it, that it's not to index half a million pages?

I strongly advocate the use of noindex or, nofollow or, noindex, nofollow. The follow directive is the default behavior and is not part of the protocol. You use the metadata to prevent something from happening, not allowing it to happen.

In the case of aakk9999, I would think you'd want to 301 the majority of this back to an upper level where it belongs? Like to the point where the bot SHOULD have stopped? This way you are able to maintain some of the equity but I wouldn't think there is much there if the indexing was brief. My educated guess would be that you'll just go through another recalculation process after a new indexing reveals the updated directives.

There also might be further delays in the short term while whatever trickle down effects are worked out. Technical glitches can wreak pure havoc on indexing and crawling. In fact, I've seen Googlebot do its best to warn site owners that there is a problem (via GWT) and then boom, their pages start to disappear. In the mean time, GWT shows the bot activity and the early warning signs but no one knew what they were seeing. You have to pay close attention to all this stuff. ;)

301 for content that has an equivalent replacement.
410 for content that is Gone.

Those are really your only two options. Google will treat that 410 like a URI Removal Request and it will be Gone shortly after the bot receives instructions. It is rather quick. In your case, you really need to instruct that bot to forget about the previous indexing directives and to permanently change those instructions. That means a 301 for most of what you have going on.

Simsi

7:58 pm on Mar 16, 2011 (gmt 0)

Fwiw, in my experience, the use of the NOINDEX META tag doesn't actually stop the pages being indexed, they just won't show in the first set of results shown to a searcher. They will however often appear in the "repeat the search with the omitted results included" results.

pageoneresults

8:40 pm on Mar 16, 2011 (gmt 0)

Fwiw, in my experience, the use of the NOINDEX META tag doesn't actually stop the pages being indexed, they just won't show in the first set of results shown to a searcher. They will however often appear in the "repeat the search with the omitted results included" results.

Hmmm, I've not seen that behavior. I just double checked too because you never know when Google will break protocol. But, they haven't in this instance. Using noindex will keep that document out of their index, no matter what you use as a query to try and find it. Or at least that has been my experience up until about 5 minutes ago. It does just as it says on the tin. ;)

Show me a search that displays/reveals a document with a noindex directive and I'll change my thinking on this.

Added: This is what I get when attempting to locate documents that I have which contain a noindex directive...

Your search - site:example.com/noindexed-document - did not match any documents.

Suggestions:

Make sure all words are spelled correctly.
Try different keywords.
Try more general keywords.

It has been like that for as long as I can remember.

aakk9999

3:43 am on Mar 17, 2011 (gmt 0)

Thanks P1R and Indyank on your opinion and reasoning. I usually use noindex if the page is intended for the visitor to see but do not want it indexed (e.g. product category pagination pages past page 1) and 301 or 410 in case of disaster recovery on leaked unwanted URLs (depending on circumstances), but there was this nagging doubt on whether my choice was the best one.

Sgt_Kickaxe

5:17 am on Mar 17, 2011 (gmt 0)

If you have these pages to begin with it stands to reason that Google will find them and that the pages which link to them will pass value to them. A three part remedy is in order to minimize the effects.

1) Minimize the number of such pages, if you cannot add value to them somehow.
2) Minimize the number of links to these pages.
3) Block flow as high up the chain as you can. If, for example, the only way to reach these pages is through a "user tools" page you could noindex/nofollow etc that page as well.

Another trick to ensuring that these pages have a low value and don't hog rank is to place links on them back to all the important places of your site.

Simsi

5:44 pm on Mar 17, 2011 (gmt 0)

Fwiw, in my experience, the use of the NOINDEX META tag doesn't actually stop the pages being indexed, they just won't show in the first set of results shown to a searcher. They will however often appear in the "repeat the search with the omitted results included" results.

Pageone has kindly pointed out that what I am seeing is a result of me disallowing the 'offending' page in the robots.txt file. I must admit I had previously thought that robots.txt stopped Google crawling the 'disallowed' pages.

pageoneresults

5:53 pm on Mar 17, 2011 (gmt 0)

I must admit I had previously thought that robots.txt stopped Google crawling the 'disallowed' pages.

It does. What you are seeing is the result of a robots.txt entry discovery by Google. It is a URI only entry and is usually invoked through specific search queries such as site:example.com. This is one of the reasons why I feel using robots.txt is not the best option to prevent URIs from getting indexed. I say URIs because in this instance, that is all it is, a URI only entry. I've seen sites with tens of thousands of them when performing site: searches.

The document in this example contained the noindex, nofollow directive. Unfortunately it was Disallowed via robots.txt. Anything Disallowed via robots.txt will override that which resides at the page level. In this case, the robots.txt entry needs to be removed so Googlebot can get to the page in question and see the noindex, nofollow directive.

TheMadScientist

5:57 pm on Mar 17, 2011 (gmt 0)

I must admit I had previously thought that robots.txt stopped Google crawling the 'disallowed' pages.

It does ... It stops them from crawling the page, so if you use a noindex directive on the page they don't ever know it's there ... Robots.txt does not remove pages from the index.

Robots.txt exclusion and noindex are two totally different things and mutually exclusive ... When you disallow in robots.txt Google DOES NOT (contrary to popular belief) crawl the pages, which means they do not know what is on the page, or whether the page contains a noindex directive or not, so they use external information, such as links and link text to try to determine the topic of the page and generally include the page(s) in the index, which is especially noticeable when conducting a site: search.

NoIndex is the only directive which tells them to not index the page, but it cannont be used for disallowed pages, because when a page is disallowed in robots.txt they follow the instructions and Do Not crawl the page to see the noindex directive.

You can only use one or the other effectively, and if you try to use both the robots.txt disallow will take precedents and the page will often be included in the index, usually as URL only.

TheMadScientist

6:00 pm on Mar 17, 2011 (gmt 0)

LMAO ... I think we agree on this one P1R ... How often do two members post at the same time, quote exactly the same text and start a reply with exactly the same response? LOL

pageoneresults

6:06 pm on Mar 17, 2011 (gmt 0)

How often do two members post at the same time, quote exactly the same text and start a reply with exactly the same response?

You know the old saying... "Great minds think alike."

We've been hanging around each other too long. :)

Also, there is only one right answer for this.

TheMadScientist

6:12 pm on Mar 17, 2011 (gmt 0)

Also, there is only one right answer for this. :)

True enough, and we may not always agree when there are nuances or multiple approaches to a solution, but I think the discussions we have on occasion probably provoke some thought and possibly new insight for people, so I think they're a good thing. I know you always make me think a bit, and contrary to popular belief I'm not quite as hard-headed and un-teachable as I may come across.

Okay, well maybe I really am that hard headed, but I do still learn, generally on a daily basis. ;)

Simsi

7:26 pm on Mar 17, 2011 (gmt 0)

Thanks Guys ... but I'm confused LOL:

When you disallow in robots.txt Google DOES NOT (contrary to popular belief) crawl the pages, which means they do not know what is on the page, or whether the page contains a noindex directive or not, so they use external information

If robots.txt stopped Google crawling the page how come it is in Google's index with a proper page title? Thats the bit I don't follow.

TheMadScientist

7:45 pm on Mar 17, 2011 (gmt 0)

Most likely because the title text appears in a link to the page somewhere.

When they can't crawl a page they take whatever resources they can find and try to figure out what it's about. Example: If there are 10 links to your page across the web and 7 are unique, but 3 are your title, the most likely text for the page is the title of the page ... If there's only one link and that happens to be the title of the page, then that's probably the title they're going to use for it.

aakk9999

8:10 pm on Mar 17, 2011 (gmt 0)

If robots.txt stopped Google crawling the page how come it is in Google's index with a proper page title? Thats the bit I don't follow.

I have also seen this and do suspect that on occassion pages are fetched.

At one point last year my WMT data showed so many "Duplicate titles" for the pages stopped by robots.txt that it would be impossible someone linked to each of them with this title.

Anyway, there was a discussion about this in the past [webmasterworld.com ]

I also saw this question being asked on Google Webmaster's Forum, but cannot find a link to this right now.

_{Would you expect that URI stopped via robots.txt is not even requested by the bot? Eg, why should it be requested if the access to it by bots is stopped via robots.txt?}

This 67 message thread spans 3 pages: 67