|Help with "Not Selected" please?|
Hi everyone, long time lurker here, just signed up. This is a great community, congratulations to everyone involved.
I am the webmaster for a 10 year old website, it has approximately 250.000 unique pages but over 1.2 million not selected url's.
Can you please point me to how to discover what these url's are? Why doesn't GWT tell us which url's they are? I would like to fix them, to improve googlebot's crawl of that website, I think it'd be helpful if they let us know what URl's they were! Can you experts give me any hints on how to find out what url's are not selected so we can fix them?
This site is 100% in-house, no php forums or wordpress installations. In fact there is a blog but it's hosted outside the main domain, so it's probably not listed there.
Thanks again and best wishes to all.
welcome to WebmasterWorld, johnsirella!
you could get a list of urls crawled by googlebot from your server access log file and compare that to urls reported in GWT.
You could use something like Xenu and see what it uncovers...
Thanks so much for your replies.
phranque: thanks for the welcome, glad to be here!
You mention urls reported on GWT, where do I look for that? I can certainly get report on the log googlebot hits, but how do I cross that with some GWT data?
lexipixel: thanks for that, had never heard of it. is it safe to use against sites?
you can get some urls from the Traffic/Search Queries/Top Pages list but i'm not sure what the upper limit is on that - i doubt they will show you 250,000 urls.
are you using sitemaps?
it's really a matter of getting a list of "good" or canonical urls and comparing those to what googlebot is crawling.
Xenu Link Sleuth is a good tool for crawling your site.
you might also trying Screaming Frog but the free version will only crawl a limited number of urls.
Hi everyone. Still dwelling with the "Not Selected" dilemma here.
Following the advice I got here I am studying the logs to see what exactly Googlebot is pulling from the server, so as to try and identify the source of the 1.2 million unselected pages(in a site with 250k to 300k max unique pages). Still no success but I'm not giving up yet.
Here's a question for those of you with more experience: do the Not Selected URL's ever decline, or is that a cumulative number that only grows?
If I happen to fix whatever is wrong, will that graph decline? Or will it always list "maximum not selected"? It seems to me that no matter what I do, every week a few thousand URLs are added to it!? If I fix whatever is wrong, will it decline? Sorry if this is a dumb question.
Thanks in advance for your wisdom.
[edited by: tedster at 5:33 pm (utc) on Dec 3, 2012]
|do the Not Selected URL's ever decline, or is that a cumulative number that only grows? |
The only way they should decline is if the URLs get indexed.
I would very definitely not worry about them, except that the constant increase may indicate some URLs are being generated that shouldn't be, so I might try to find the source of that, but the not selected number itself is not something I would lose any sleep over at all, personally.
I have a wordpress install that somehow generated tons of automatically generated URLs that were "not indexed" by google. I still cannot find the source of the bug but I have now added that string of automatically generated URL's to robots.txt as disallow just past weekend. Now those pages appear with "A description for this result is not available because of this site's robots.txt"
The gibberish url is something like ?gibberish/page2/page3/page2 and continues on and on. And for some reason my wordpress install recognizes it as a valid URL with robots tag = index and everything. Although they are "not indexed" by google (supplemental index) because they have exactly the same content as my archive pages. I cannot seem to remove the gibberish generated pages that because it is out of my capability.
It will take a while to see if Google recognizes it as a bug and remove those accordingly. I just hope. I will report if my "not selected" count go down in the future. Or at least if it will stop rising.
I do think that you may have to worry if the "not indexed" count continue to rise, it may be a bug with any of your code that generated and feed gibberish url to google.
A couple weeks ago, we removed a big chunk of URLs from the site. Since the removal, the number of "Not Selected" URLs has plummeted from about 15,000,000 to about 1,000,000. I'm not certain what it was on those pages that was causing the issue, but I know those pages were the source.
We have not seen any change in our organic traffic since, but this change was made the day of the last Panda refresh. Fingers crossed.
I got a bit panicky about this not selected issue. I heard of it, dug around a bit and I consider ultimately worthless information on troubleshooting possible issues. At first I was hopeful it would provide insight into possible Panda etc issues, but at the end of the day I consider it a red herring. If your head hurts, you have a headache. If you tell me you have a headache, I can tell you your head hurts. That's not helpful information I just gave you. That's my view of "not selected". If I missed something during my investigation, I'm all ears. Just my experience with this says it's not something I'm analyzing further or even checking for that matter. Again, just one geeks opinion for what it's worth (or not worth).
My situation was similar to frankleeceo except that it was not a wordpress blog but a static html site. Somehow Google and other bots created 1,000,000 rubbish URLs out of my 1600 pages that my server recognized as valid. My "not selected" line spiked and my rankings dropped. I wouldn't say they tanked. Once the htaccess file was reconfigured to make the server return 410s, the "not selected" line stopped rising. Now, it's dropping by about 10 pages a week. Rankings have still not returned.
|The only way they should decline is if the URLs get indexed. |
And I think there are few other ways for this number to decline. Here is my experience:
- If you return 404/410 for "Not selected" page, the number will decline
- If the "Not selected" page gets indexed, the "Not selected" number will decline (as TMS said)
- If you redirect "Not Selected" pages, it will NOT decline
- If you noindex page previously indexed, it will increase
- If you noindex page that is "Not selected", the "Not selected will stay the same
- I am guessing that if the page is blocked by robots, the "Not selected" should decrease
- I am not sure what happens if the page has canonical link element set to point to another page, but if it is treated the same as Redirects, then this will not have an impact on "Not Selected"
I wish google would break "Not Selected" into two buckets, as I would like to see a number for "the page not blocked in robots, not redirecting, not noindexed, does not have canonical pointing elsewhere, but we have not selected it because we do not like it". That would be a really useful figure.
Ah, I thought it was all URLs, but 404/410 & NoIndex/robots.txt might do it.
AFAIK redirecting and some of the others should not ... (I'm fairly certain redirecting will actually increase the number.) ... I'd have to look into in some more to know for sure, but I really don't have time to spend on something I don't care much about right now lol.
According to Google article, the "Not selected" includes 301 redirects so it will certainly not decrease the number and I agree, in some cases may increase it.
So theoretically, if you redirect one of "not selected" URLs to existing "indexed" URL, there should be no change to the number of "Not selected".
But if you introduce new URL structure and then redirect a "not selected" URL to a new URL then if new URL ends up also being "Not selected" then "Not selected" would increase.
I am pretty certain about robots noindex pages increasing "Not selected" pot as I saw this 6 months ago when I dropped 6000 pages by noindexing them from one of sites - the "indexed" went down and "Not selected" went up in parallel - the graph was symetric.
I am also pretty certain about 404/410 reducing "Not selected" as I am in the process of sorting out the mess of 80,000 "Not selected" URLs on the site with 8000 pages indexed but with only 1500 unique pages worth indexing (huge number of duplication owing to dates in URL, capitalisation, parameters order and other classic URL mistakes). We are redirecting only about 2000 URLs and letting all others go 404/410.
The new URL structure went live last week and Google has already dropped 1K URLs from "Not Selected" count.
|the page not blocked in robots, not redirecting, not noindexed, does not have canonical pointing elsewhere, but we have not selected it because we do not like it |
"blocked by robots.txt" is already a separate category. Everything else ... yah. At a minimum, there's a difference between pages that can't be indexed (noindex meta, redirect) and pages that could be indexed but aren't ("I dunno, there's just something about this page we don't like").
Anyone see any over optimization penalties in the past 2 days? Is this something that happens randomly to sites, or do many see it happen at once?
Hello everyone, I was on the road and could not log(yes I'm old style, I only log in from my PC!) in to thank you all for your help. I still have to digest all this, but my questions have been answered: the "not selected" nr. is not cumulative, that is, if we make the correct changes, it WILL go down.
Thanks so much for the help, appreciate it.