|Understanding Sitemaps and Crawl Stats in GWT|
I'm a bit baffled, as I try to make sense of certain features in Google Webmaster Tools.
For one, my sitemap shows that 303 webpages were submitted but just 291 were indexed. So, I'm assuming that means that these 12 pages are blocking Googlebot or Googlebot simply can't index them? How can I see which 12 pages these are, so I can check the settings and make sure that is not the case. I just want to make sure that I am not blocking an important page.
I can't figure out how to do that in GWT.
Also, under Crawl > Crawl Stats, it shows me Googlebot Activity in the past 90 days. Looks like on 10/16 836 pages were crawled, on 11/19 429 and on 12/7 569 pages were crawled. I see this up and down rollercoaster of activity with Googlebot. Why? It's my understand that the way crawling works is, Google crawls one page with links and then that page spots a link (even external) and then crawls the linked page's content, where it finds more links, and so on. Before you know it, it has made it's way around your site from top to bottom. Does this up and down crawl rate, mean that Google is landing on particular pages some days and getting stuck after just a few crawls, limiting it's number of crawled pages?
I will admit some of my articles and posts don't have external links in the body of the article but this is a shopping cart which has a main category menu at the top of the site which repeats itself on every page. So, in other words, every page has links which should allow Google to bounce across my site from to to bottom. Again, I'm a bit baffled. :)
I can't figure out how to look into GWT and see where the problem might lie? I want to see where the crawl gets stuck and maybe that page again has a Googlebot block on it or something else is wrong?
I did write in another post here about how I was having issues getting Canonical URLs to work for some reason. I don't know if it was related to Cloud Flare or something else. The post was here: [webmasterworld.com ].
I explained that the cart created extensions such as:
The ?items_per_page=10 is activated when somebody wants to view 10 products per category page. There are other options for 20 products, 30, etc. Instead of viewing ?items_per_page=10 as the same page, Google sees it as two pages. The canonical url should be fixing this but it is not.
That being said, COULD this stop the crawl and create the Crawl Stats issue?
|For one, my sitemap shows that 303 webpages were submitted but just 291 were indexed. So, I'm assuming that means that these 12 pages are blocking Googlebot or Googlebot simply can't index them? |
These could also be just filtered out by Google.
|How can I see which 12 pages these are, so I can check the settings and make sure that is not the case. I just want to make sure that I am not blocking an important page. |
The easiest is to crawl the list of URLs in your sitemap with a tool that honours robots.txt and upon crawling, it also returns index and canonical options. Then you can check if all pages are crawled by the tool and if they are, what their meta and canonical are (i.e. are they allowed by Google).
How Google crawls
Google does not crawl the page by following the links the way you have described. What Google does is crawl your page, collect all links on that page and puts them into "TODO" list for later, sorted by domain the links go to. So for example, if your page has 10 internal links and 3 external links pointing to three external websites, it will add 10 links to your website "TODO" list and one link to "TODO" list to each of external website.
When Google returns, it takex x URLs from TODO lilst for your site, crawls each and for each link found it puts it again at the "TODO" list for the domain.
How often each URL from TODO list is crawled would depend on a number of factors such as (my guess as only Google knows):
- the history of how often the page changes
- how deeply the page is burried within the site (clickpath)
- how many/importance of other pages internally link to it
- how many/importance of external links to the page [Google did say PR is a crawl factor]
- perhaps the authority of the domain
- perhaps some visitor usage statistics
- perhaps the value of the sitemap.xml priority element as some kind of hint
- and almost certainly some other factors
The canonical issue you have may be the reason why you are not seeing all of your pages in sitemap indexed. For example, if your sitemap contains:
but Google decides to index this one:
Then your "indexed" number for your sitemap in WMT will be one less, even though Google indexed the content of your page, but under a different (non-canonical) URL.
How to find out which one Google missed? As you have only 303 pages in your sitemap, you *could* use the "site:example.com" command to see which ones turn up when you paginate through search results. But this method is not practical for any bigger site.
|The ?items_per_page=10 is activated when somebody wants to view 10 products per category page. There are other options for 20 products, 30, etc. Instead of viewing ?items_per_page=10 as the same page, Google sees it as two pages. The canonical url should be fixing this but it is not. |
Go to the "parameters" area of wmt. There are two options:
--ignore certain parameters. Your "items per page" is a good example: the parameter value doesn't fundamentally affect the page content.
--don't crawl URLs that contain a certain parameter at all. This one is less common, but "printer-friendly" is a likely example.
You will probably find that your main parameters are already present, even if you've never set foot in this area before. This is also a good chance to look for parameters the search engine was never supposed to know about, as when you rewrite a long messy URL to a short pretty one.
You could also, er, ahem, nag at google to swipe an idea from Yandex's wmt: list un-indexed URLs, along with standard reasons for not indexing. (One is "unsupported language", which neatly takes care of my whole list ;))
I'm in URL Parameters and I see what your talking about. Thank you.
It shows me that items_per_page shows up on 2,488 monitored urls. As far as crawl it says, "Let Googlbot decide". Should I change that to say that Yes, it changes, reorders narrows or page content?
Just because you include a URL in your sitemap.xml does not guarantee that the URL wlll be indexed. 96% of your submitted URLs are indexed.
Straight from Google Webmaster Tools "About Sitemaps" help page regarding sitemap.xml files:
|Google doesn't guarantee that we'll crawl or index all of your URLs. However, we use the data in your Sitemap to learn about your site's structure, which will allow us to improve our crawler schedule and do a better job crawling your site in the future. In most cases, webmasters will benefit from Sitemap submission, and in no case will you be penalized for it. |
Oh yes and: If the number of indexed pages is lower than the number in your sitemap, it really doesn't sound as if google is indexing multiple versions of the same content.
I think the sitemap is best seen as "In case you'd overlooked any of these..." On some well-established sites it might give more meaning. For example if you say something changes weekly, and they know from experience that this is accurate for your site, they may go by what the sitemap says. If, like me, you just say "I dunno, maybe every month or so" they'll probably decide for themselves how often to crawl.
I do find the number of pages indexed in the sitemap an interesting metric to follow. If there is a big gap, then pages you think are important are either not important, not crawled yet or Google choose a different URL with the same content to index instead.
Yes, Google does not guarantee that they will crawl or index all of URLs in the sitemap, but if the number of URLs in the Sitemap that is indexed is low then this could indicate some problems (either with the sitemap or with the site).