| 1:07 pm on Nov 8, 2012 (gmt 0)|
How about Pages with query parameters ? Mobile switches ? Print / Email ? Check for www, or no-www. Check for no-trailing slash or trailing slash - one should return 200, the other 301.
My approach has been to get rid of all that, or replace it with a solution that doesn't add to # of pages Googlebot has to crawl.
Currently I have 7,000 pages not selected. I trying to get it under 1,000 which is approximately how many pages that should be indexed. So far it's been dropping very slowly. Only about 100 pages per week over the last 8 weeks.
| 1:25 pm on Nov 8, 2012 (gmt 0)|
Lovely. Will try this and let you know. Thanks!
| 1:47 pm on Nov 8, 2012 (gmt 0)|
I have 790,000 pages "Not Selected" and only 700 pages indexed. WTF?
| 2:54 pm on Nov 8, 2012 (gmt 0)|
Do all of you with a high ratio of "not selected" vs "indexed" have Panda problems? Can someone with a high ratio jump in here and say "I don't have a Panda problem and my ratio is really high"?
The biggest problem with the "not selected" is that if Google would just show us some of those URL's, the fix I'm sure would be easy. But since they don't, I've spent months trying to reduce this ratio looking in hidden corners of my site, url parameters etc., and found nothing. I continue to spent way too much time doing so, and google could help so much on this front.
| 3:32 pm on Nov 8, 2012 (gmt 0)|
Not sure what you consider high ratio, but I have a site with 13,000 not selected, 3000 indexed, no Panda problems
| 5:11 pm on Nov 8, 2012 (gmt 0)|
I agree. That would be extremely helpful.
Looking at the other thread as well, I think we might on to something when it comes to "not selected" and Panda.
from the image above, you can see where the not selected started. Right after that I was hit in early June Panda update.
My idea is if I can get the green line back down to normal, as it was in May ..then I recover.
| 12:11 am on Nov 9, 2012 (gmt 0)|
It's not only about canonicalization. It's also about duplicate content. Any pages that show in the "similar pages" link on page 100 go into the not selected index. At least, that's what I've noticed for my site. As well, it includes all redirects. So if you redirect from www to no-www then it counts as 2 pages. Same thing with any redirect. I'm not exactly sure how much of an effect it has on anything unless you have huge similar pages problems.
If you want to try it (which I don't think you do), simply take off your redirects and the number will go down.
However, I'd be more aware of the "total indexed", especially if it is higher than the total number of pages you have. That number seems the hardest to bring down.
| 1:55 am on Nov 9, 2012 (gmt 0)|
My ratio is pretty high. I have 10x more "not selected" than selected. I wasnt affected by Panda but with Penguin I lost a lot of traffic. I have a lot of duplicate content since I use an open source cms and articles/comments/etc.. can be viewed a number of different ways but the articles do have canonical links.
| 2:40 am on Nov 9, 2012 (gmt 0)|
The "not selected" list doesn't refer to your current pages. It's the difference between everything they've ever crawled, since the dawn of time, and your currently indexed pages.
The Official Docs specifically say that "not selected" includes redirects. So if you're redirecting anything non-canonical to canonical-- which is exactly what you should be doing-- that would make the list shoot through the ceiling right there.
So the list will never get shorter. Just ignore it-- deselect, haha, the option of viewing it at all --and make sure your current pages are all in order.
| 3:09 am on Nov 9, 2012 (gmt 0)|
The more I think about it, I think Lucy is pretty accurate... although the "not selected" number can decline. But at the same time my "not selected" number was rising, I was also working to clean up a bunch of bad URL's in the index. So as my indexed pages declined by design (through 301's 404's etc), I guess those moved to the "not selected" column causing an increase.
If this is the case, the term "not selected" is a poor choice of words, as it implies something of greater consequence. A little more transparency in this area would go a long way too.
| 3:33 am on Nov 9, 2012 (gmt 0)|
So has anyone seen major changes in panda/penguin after total index went up or down (in a major way)?
I've constantly had ranking changes due to that. Anyone else?
| 1:32 pm on Nov 9, 2012 (gmt 0)|
|However, I'd be more aware of the "total indexed", especially if it is higher than the total number of pages you have. That number seems the hardest to bring down. |
On one of my sites, my submitted sitemap has 38 URLs but Webmaster Tools shows the "total indexed" as 41. Those three extra indexed pages are on the server, but from the beginning I've always blocked them from being crawled in robots.txt and omitted them from the submitted sitemap, because I didn't want them crawled or indexed. But it seems that Google has indexed them anyway. I just did a site:domain check and Google shows them in the results, but says
"A description for this result is not available because of this site's robots.txt – learn more."
So despite my efforts to prevent these pages from being indexed, Google indexed them anyway. I think it's because people have pointed some external backlinks to them from other sites.
Edit P.S. I forgot to say that those three extra indexed pages also have noindex metatags in the header, but since they are blocked from being crawled, Google can't see the noindex tags.
| 4:43 pm on Nov 9, 2012 (gmt 0)|
I feel the same. I took a year to clean up my Panda infested site. But I have come to realize that the cause is that my "No Index" Meta Tag and Robot.txt is not being taking into consideration when they are evaluating for Panda.
Everything else is super clean and metrics have increased greatly. Still Panadalised though.
| 5:14 pm on Nov 9, 2012 (gmt 0)|
Please see the very thorough thread:
Pages are indexed even after blocking in robots.txt [webmasterworld.com]
| 5:16 pm on Nov 9, 2012 (gmt 0)|
|"No Index" Meta Tag and Robot.txt |
You can use one or the other, not both. Again, see thread referenced above.
| 6:02 pm on Nov 9, 2012 (gmt 0)|
I understand all of that. But in this case I intentionally used both robots.txt and the noindex tag on purpose. I used robots.txt to block googlebot because I don't want it to even see what's on these pages at all (they have some content that duplicates some content on other pages.). But I also used the noindex tag as extra insurance just in case the robots.txt file somehow accidentally got deleted or was corrupted.
| 6:20 pm on Nov 9, 2012 (gmt 0)|
I don't think my site was affected by Panda, but from discussions last year, I do think it's possible that the Google algorithm takes content on noindexed pages into account in its overall evaluation a site's quality.
| 10:09 pm on Nov 9, 2012 (gmt 0)|
|I also used the noindex tag as extra insurance just in case the robots.txt file somehow accidentally got deleted or was corrupted |
You would have been better off if it had been deleted, because g### would have seen the noindex tag. The point of the long discussion in shaddows's linked thread is that blocking a page in robots.txt does not prevent it from being indexed. It only prevents its content from being displayed.
Read the thread and you will see you are not the only one who had trouble wrapping their brain around this idea ;)
robots.txt + noindex is NOT the same as belt + suspenders
| 11:04 pm on Nov 9, 2012 (gmt 0)|
I don't want to get into any more discussion about this. I know exactly what I'm doing. Because I blocked googlebot with robots.txt, it's never crawled the page. Therefore, it's never seen the noindex tag. If in the unlikely event the robots.txt is accidentally deleted in the future, then googlebot will crawl the page and see the noindex tag. That could happen in the future, though unlikely, but even if it does, the noindex tag will still prevent the page from being indexed even then. That's what I meant by insurance.
| 12:59 am on Nov 10, 2012 (gmt 0)|
@aristotle, but you see ..you said previously that the pages are still getting indexed. If you removed them from robots.txt and leave the noindex, they will drop out of the index and will not show "A description for this result is not available because of this site's robots.txt – learn more."
| 1:21 am on Nov 10, 2012 (gmt 0)|
Yes I know they are indexed, probably because of external links that other people pointed at them. But that doesn't matter because they still haven't been crawled, and so the google algorithm doesn't know their content. That's my basic goal -- to prevent the Google algorithm from knowing their content, so at this point I've achieved my purpose.
If in the future, they do get crawled, then the noindex tag will be found. So in that case, the google algorithm will learn their content, but hopefully the noindex tag will prevent a duplication penalty. (Parts of these pages contain the same content as parts of other pages on the site). So that's the reason for the noindex tag.
| 2:01 am on Nov 10, 2012 (gmt 0)|
|and so the google algorithm doesn't know their content. That's my basic goal -- to prevent the Google algorithm from knowing their content |
That's a whole nother thread there. And it's a thread I don't remember seeing. Insert boilerplate about waning memory.
If a page is marked noindex, does any area of google -- including the right hand that notoriously doesn't know what the left hand is doing -- do anything with its content? Other than follow its links, which would be a separate tag.
| 4:02 am on Nov 10, 2012 (gmt 0)|
If your site uses extensive redirects, as many affiliate sites do, the number of "not selected" urls can be high naturally. Robots.txt files do nothing to help that and a noindex meta tag on the redirect page is of no help since it won't be read.
I'd be interested to hear how you redirect your links to an affiliate site if GWT is NOT reporting a high number of unchosen urls.
Otherwise I wouldn't worry about it.
| 4:29 am on Nov 10, 2012 (gmt 0)|
Another big contributor to not selected:
If you run wordpress and have alot of comments with replies enabled, that will generate alot of URLs with the "replytocom" parameter. For example, if you have a post with 50 comments ..that creates 50 unique URLs that Google will crawl and assign as "not selected".
| 10:53 pm on Nov 11, 2012 (gmt 0)|
^^^ You should either block pages with replytocom parameter or put noindex on these otherwise you are creating lots of thin pages