Forum Moderators: Robert Charlton & goodroi
Questions:
1. Does a very complex interlinking structure such as described, potentially limit the number of pages that google will spider and index?
2. Does having google sitemaps accessible via webmaster tools and robots.txt mitigate any negative impact of the site's interlinking structure as described above?
3. What is the lastest on the virtues of using sitemaps with google? We manage content for a number of independently owned sites, and seems that those without sitemaps are more fully indexed than those with them. Sites are very similar in content and structure.
4. What is the best method these days for finding out number of pages in the google index? The numbers never match up from webmaster tools and the site:operator Somtimes the number is higher in webmaster tools, and other times using site:operator Any thoughts on this?
We have two XML Sitemap templates:
Sitemap.xml = categories, subcategories, any store pages (manufacture pages only)
Sitemap_index.xml - show all the sitemap links within site.
Sitemap1,2,3... Includes all our Product pages, starting per subcategory level. Prioritized by the item number (so it's 1,2,3,4... etc.)
166,262 URLS submitted Jul 24, 2008 – last downloaded Sept 7. 63,900 Indexed. Roughly 38% indexed
Average index refresh time 6-7 days
7 days between downloads is 41% of our indexed amount.
My basic assumptions:
1. URL duplication is not a problem. Pages are clearly lined with one category/sub-category. We solved these problems about a year ago.
2. Content sucks.... we do have a number of issues with similar content around the web. We get a lot of our content from eTilize - a data subscription. I'd say its 90%. But in our defense so is everyone else in our game. The world of fresh content is a challenge we face daily! So I'm not going to rule out this is the key point of reference when Google chooses to index our page.
3. We have multiple sitemap files and use a sitemap index file. So we’re following size restrictions and are safely within the bounds of XML rules.
4. Crawl rate is set to “Recommended” – I fear touching this value.
5. We have the allocated 15 parameter handling rules set. We use canonicalized metas as well.
Questions: 1 & 2
You’re basically asking if G Bot first recognize the data from the XML feed? Or is it independent? Is spoon feeding happened for real? I suspect there’s a correlation but our data doesn’t give enough to prove either way.
Question 3
YES! I it’s a good point! What do the sitemaps in Google do for us these days? Before the sitemap xml scan, we had traditional spidering? Ask yourself, technically are you penalized for not having a sitemap? I don’t’ think so.… none the less, I don’t have the guts to risk pulling off my sitemaps.
Question 4
I would also like to validate our process for counting indexed sites on Google. I don’t think there is a real easy way. We count the total indexed XMLs, but we do a random test to validate their feed data. I can’t go through 520 sitemaps manually. Any other thoughts?
My Conclusion
I think the sitemap is more complicated when dealing with larger dynamic websites. Past a certain point Google doesn’t care… Based on our issues I bet there’s similar organic traffic correlation issues with BASE as well. But hey, that’s another conversation.
Right. I’m going to try an experiment with our site map structure. Basically rewrite our XML feed to accommodate these 3 options based on our product catalog:
Option 1 – Mixed Tiers
Option 2 – Hierarchy Based
Option 3 – Treasure Map Based
And I'll get back to you with some results.
More specifics on sites we manage content for:
Main site:
121,140 total urls in 8 total sitemaps - 12,550 indexed = 10%
maps 1 & 2 = unique health conditions and ingredient pages 4,375 urls 1,290 indexed
static cold fusion pages woven into site navigation based on relevancy to pages.
map 3 = subcategory pages 2,128 urls 1,592 indexed
static cold fusion pages woven into site navigation based on relevancy to pages.
map 4 = sku level product pages 14,068 urls 9,127 indexed
(according to G-sitemap data, can only find 4,060 when using site: operator)
static cold fusion pages with links to other relevant content
maps 5, 6 & 7 = relevant mix product pages 100,200 urls 170 indexed
(dynamic .php pages interlinked to relevant product pages)
map 8 = vendor pages 412 urls 263 indexed
static cold fusion pages with relevant links
All maps are updated monthly. We had set revisit date for google at monthly, but increased that to daily to see if that would impact the indexing. Google hits them almost daily, but as yet no impact on indexing.
We know google states that having sitemaps doesn't assure the pages will be crawled or indexed, but are trying to understand why other sites with similar content and structure without the sitemaps are coming in with higher numbers for the #4 sku product type pages.
One thing that is different for the main site above (the largest that we manage content for) is that it has the 100k dynamic pages woven into the architecture. These pages have been refined recently and there was some quality issues with them, so hoping that the refinements may make a difference. The other sites don't have the internal menu with these links built in.
A few of the numbers for the sku product pages on sites without the sitemaps:
site 1 = 29,900 urls (several of the sites have more than one version of each page resulting in total exceeding the actual number available)
site 2 = 11,100 urls
site 3 = 9,730 urls
site 4 = 12,800
site 5 = 14,300
We are working on improvements to the site, such as the refinement to the relevant product mix pages, and hope this will improve our G-quality and trust profile and thus more pages get indexed. So, will continue on with the maps for now, but we are considering pulling them if things don't improve.
[edited by: Robert_Charlton at 8:59 pm (utc) on Sep. 30, 2009]
[edit reason] fixed formatting [/edit]
Also begs the question of whether g is going to index the INVENTORIES of millions of commercial websites selling the same products SANS USER BENEFICIAL ARTICLES. Just a few thoughts. We concentrate on top level entry to articles that lead to sales rather than having the obviously manufacturer cookie cutter descriptions of product rank at the search engines.