For one, my sitemap shows that 303 webpages were submitted but just 291 were indexed. So, I'm assuming that means that these 12 pages are blocking Googlebot or Googlebot simply can't index them?
These could also be just filtered out by Google.
How can I see which 12 pages these are, so I can check the settings and make sure that is not the case. I just want to make sure that I am not blocking an important page.
The easiest is to crawl the list of URLs in your sitemap with a tool that honours robots.txt and upon crawling, it also returns index and canonical options. Then you can check if all pages are crawled by the tool and if they are, what their meta and canonical are (i.e. are they allowed by Google).
How Google crawls Google does not crawl the page by following the links the way you have described. What Google does is crawl your page, collect all links on that page and puts them into "TODO" list for later, sorted by domain the links go to. So for example, if your page has 10 internal links and 3 external links pointing to three external websites, it will add 10 links to your website "TODO" list and one link to "TODO" list to each of external website.
When Google returns, it takex x URLs from TODO lilst for your site, crawls each and for each link found it puts it again at the "TODO" list for the domain.
How often each URL from TODO list is crawled would depend on a number of factors such as (my guess as only Google knows):
- the history of how often the page changes
- how deeply the page is burried within the site (clickpath)
- how many/importance of other pages internally link to it
- how many/importance of external links to the page
[Google did say PR is a crawl factor] - perhaps the authority of the domain
- perhaps some visitor usage statistics
- perhaps the value of the sitemap.xml priority element as some kind of hint
- and almost certainly some other factors
The canonical issue you have may be the reason why you are not seeing all of your pages in sitemap indexed. For example, if your sitemap contains:
/categories/
but Google decides to index this one:
/categories/?items_per_page=10
Then your "indexed" number for your sitemap in WMT will be one less, even though Google indexed the content of your page, but under a different (non-canonical) URL.
How to find out which one Google missed? As you have only 303 pages in your sitemap, you *could* use the "site:example.com" command to see which ones turn up when you paginate through search results. But this method is not practical for any bigger site.