Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Site structure and sitemaps - effect on indexing

         

latimer

8:55 pm on Sep 16, 2009 (gmt 0)

10+ Year Member



We have 14,000 products spread over hundreds of product categories, many of which overlap. Our efforts to provide a rich user experience and focused page content with keywords in links, results in a very complex site structure that interlinks various levels of the site.

Questions:

1. Does a very complex interlinking structure such as described, potentially limit the number of pages that google will spider and index?

2. Does having google sitemaps accessible via webmaster tools and robots.txt mitigate any negative impact of the site's interlinking structure as described above?

3. What is the lastest on the virtues of using sitemaps with google? We manage content for a number of independently owned sites, and seems that those without sitemaps are more fully indexed than those with them. Sites are very similar in content and structure.

4. What is the best method these days for finding out number of pages in the google index? The numbers never match up from webmaster tools and the site:operator Somtimes the number is higher in webmaster tools, and other times using site:operator Any thoughts on this?

seogio

12:43 am on Sep 17, 2009 (gmt 0)

10+ Year Member



Good Topic. I'll join with my own issues on this topic and hopefully we're create some buzz around this enterprise level discussion. I’m working with 160,000 skus (mines bigger than yours babay!) Long-term player in ecommerce. Established site been around since 1999. I'm seeing poor performance on our current sitemap. I have good data in excel trying to decipher the crawl rates. Here’s my info.

We have two XML Sitemap templates:

Sitemap.xml = categories, subcategories, any store pages (manufacture pages only)

Sitemap_index.xml - show all the sitemap links within site.

Sitemap1,2,3... Includes all our Product pages, starting per subcategory level. Prioritized by the item number (so it's 1,2,3,4... etc.)

166,262 URLS submitted Jul 24, 2008 – last downloaded Sept 7. 63,900 Indexed. Roughly 38% indexed

Average index refresh time 6-7 days

7 days between downloads is 41% of our indexed amount.

My basic assumptions:
1. URL duplication is not a problem. Pages are clearly lined with one category/sub-category. We solved these problems about a year ago.
2. Content sucks.... we do have a number of issues with similar content around the web. We get a lot of our content from eTilize - a data subscription. I'd say its 90%. But in our defense so is everyone else in our game. The world of fresh content is a challenge we face daily! So I'm not going to rule out this is the key point of reference when Google chooses to index our page.
3. We have multiple sitemap files and use a sitemap index file. So we’re following size restrictions and are safely within the bounds of XML rules.
4. Crawl rate is set to “Recommended” – I fear touching this value.
5. We have the allocated 15 parameter handling rules set. We use canonicalized metas as well.

Questions: 1 & 2
You’re basically asking if G Bot first recognize the data from the XML feed? Or is it independent? Is spoon feeding happened for real? I suspect there’s a correlation but our data doesn’t give enough to prove either way.

Question 3
YES! I it’s a good point! What do the sitemaps in Google do for us these days? Before the sitemap xml scan, we had traditional spidering? Ask yourself, technically are you penalized for not having a sitemap? I don’t’ think so.… none the less, I don’t have the guts to risk pulling off my sitemaps.

Question 4
I would also like to validate our process for counting indexed sites on Google. I don’t think there is a real easy way. We count the total indexed XMLs, but we do a random test to validate their feed data. I can’t go through 520 sitemaps manually. Any other thoughts?

My Conclusion
I think the sitemap is more complicated when dealing with larger dynamic websites. Past a certain point Google doesn’t care… Based on our issues I bet there’s similar organic traffic correlation issues with BASE as well. But hey, that’s another conversation.

Right. I’m going to try an experiment with our site map structure. Basically rewrite our XML feed to accommodate these 3 options based on our product catalog:

Option 1 – Mixed Tiers
Option 2 – Hierarchy Based
Option 3 – Treasure Map Based

And I'll get back to you with some results.

latimer

4:51 am on Sep 17, 2009 (gmt 0)

10+ Year Member



Thanks for jumping in seogio.

More specifics on sites we manage content for:

Main site:

121,140 total urls in 8 total sitemaps - 12,550 indexed = 10%

maps 1 & 2 = unique health conditions and ingredient pages 4,375 urls 1,290 indexed
static cold fusion pages woven into site navigation based on relevancy to pages.

map 3 = subcategory pages 2,128 urls 1,592 indexed
static cold fusion pages woven into site navigation based on relevancy to pages.

map 4 = sku level product pages 14,068 urls 9,127 indexed
(according to G-sitemap data, can only find 4,060 when using site: operator)
static cold fusion pages with links to other relevant content

maps 5, 6 & 7 = relevant mix product pages 100,200 urls 170 indexed
(dynamic .php pages interlinked to relevant product pages)

map 8 = vendor pages 412 urls 263 indexed
static cold fusion pages with relevant links

All maps are updated monthly. We had set revisit date for google at monthly, but increased that to daily to see if that would impact the indexing. Google hits them almost daily, but as yet no impact on indexing.

We know google states that having sitemaps doesn't assure the pages will be crawled or indexed, but are trying to understand why other sites with similar content and structure without the sitemaps are coming in with higher numbers for the #4 sku product type pages.

One thing that is different for the main site above (the largest that we manage content for) is that it has the 100k dynamic pages woven into the architecture. These pages have been refined recently and there was some quality issues with them, so hoping that the refinements may make a difference. The other sites don't have the internal menu with these links built in.

A few of the numbers for the sku product pages on sites without the sitemaps:

site 1 = 29,900 urls (several of the sites have more than one version of each page resulting in total exceeding the actual number available)

site 2 = 11,100 urls

site 3 = 9,730 urls

site 4 = 12,800

site 5 = 14,300

We are working on improvements to the site, such as the refinement to the relevant product mix pages, and hope this will improve our G-quality and trust profile and thus more pages get indexed. So, will continue on with the maps for now, but we are considering pulling them if things don't improve.

[edited by: Robert_Charlton at 8:59 pm (utc) on Sep. 30, 2009]
[edit reason] fixed formatting [/edit]

tangor

5:08 am on Sep 17, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What is the end goal? Displaying a product checkout page or having visitors to the site who, if interested, will find that product page all by themselves? Not suggesting that google looks at "creamy smooth widgets" on umpteen billion pages as "duplicated all over the place" content, though that might indicate a why "we (g) do not crawl any deeper".

Also begs the question of whether g is going to index the INVENTORIES of millions of commercial websites selling the same products SANS USER BENEFICIAL ARTICLES. Just a few thoughts. We concentrate on top level entry to articles that lead to sales rather than having the obviously manufacturer cookie cutter descriptions of product rank at the search engines.

seogio

8:03 pm on Sep 30, 2009 (gmt 0)

10+ Year Member



Sorry its taking long. I'll get back with results in the near future. Your all right about content exclusivity. Before i start changing our content in any big way, We'll test the sitemap change and at the very least we'll see what happens.