homepage Welcome to WebmasterWorld Guest from 54.166.173.147
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
HTML sitemaps for very large sites
DiscoStu




msg:3967943
 6:13 pm on Aug 7, 2009 (gmt 0)

So you can have a maximum of 50 K links on an xml sitemap. But what about the html sitemap? Do the search engines use the html sitemap at all if there's an xml sitemap in place? I.e. should an html sitemap be strictly for users, or should it be used as a way to get spiders to crawl the site more thoroughly?

For smaller sites (say <100 pages) xml & html sitemaps can look the same = list everything on one page. This works both for search engines and for users. But let's say you have a site with 20 K pages. That works fine for the xml sitemap, but not an html sitemap (not user friendly - and having a html page with 20 K links on it *seems* like a bad idea in general?).

My understanding is that the xml sitemaps always should be exhaustive, covering every single URL on the site. But on very large sites ( 1 mil+), does it make sense to even try to cover every url in html sitemaps? Or should you forget about the search engines here and just make a user friendly directory of your site (and let the search engines use the exhaustive xml sitemaps to aid them in indexing, and leave the html sitemap(s) for humans only)?

 

pageoneresults




msg:3967952
 6:36 pm on Aug 7, 2009 (gmt 0)

Do the search engines use the html sitemap at all if there's an xml sitemap in place? I.e. should an html sitemap be strictly for users, or should it be used as a way to get spiders to crawl the site more thoroughly?

The search engines are going to use any links they find for discovery. If the HTML Sitemap is discovered during crawling, which it most likely will be if linked to, then anything leading from the Sitemap will also be discovered barring any restrictions on indexing via Robots META or other methods. I'd probably noindex the Sitemap in some instances. That allows the links to be crawled and keeps the page out of indices. Most Sitemaps are typically browsed to by a user from a link that is recognizable e.g. <a href="/sitemap.htm">Sitemap</a> in a footer element, etc.

For smaller sites (say <100 pages) xml & html sitemaps can look the same = list everything on one page. This works both for search engines and for users. But let's say you have a site with 20 K pages. That works fine for the xml sitemap, but not an html sitemap (not user friendly - and having a html page with 20 K links on it *seems* like a bad idea in general?).

I've seen quite a bit of Ajax and other technologies being used to dynamically generate Sitemaps. I surely wouldn't rely on a Sitemap to serve the average consumer, not with 20k links. I think most rely on a OneBox approach when serving that much content to visitors. On site search is the key here.

My understanding is that the xml sitemaps always should be exhaustive, covering every single URL on the site. But on very large sites ( 1 mil+), does it make sense to even try to cover every url in html sitemaps?

Yes, it does. You want to use whatever protocols are available to you and influence the crawling activities of those that adhere to the Sitemaps Protocol. I put them in the Metadata category. They are information about information, same concept.

In a perfect world, you WOULD need Sitemaps. It doesn't matter that the architecture of your site is perfect and allows proper indexing. Those Sitemaps are a map for the bots and are also great on saving crawl resources. Ya, the bots are going to crawl everything. If I have a GWT account and I connect a Sitemap, I'm going to be giving a little more information and assistance in crawling my website. At the same time, I'm going to get feedback from the bot which is of great use in many instances.

Or should you forget about the search engines here and just make a user friendly directory of your site (and let the search engines use the exhaustive xml sitemaps to aid them in indexing, and leave the html sitemap(s) for humans only)?

Let's take your 20k page example. I'll break it down into 10 categories, 2k pages per cat. I'm going to have 10 category specific Sitemaps. Now, if the 10 sub categories break down further into 10 more cats of 200 pages each, I'm going to have sub-category specific Sitemaps.

Sitting at the root level of the site will be the Mother of ALL Sitemaps. It will contain just links to the sub-category and sub-sub-category Sitemaps. You'll create one solid indexing haven for the bots at the same time providing user friendly category specific Sitemaps for your visitors.

I still feel strongly that the OneBox or OmniBox (search) is the best approach for consumers finding information on larger sites.

P.S. You don't need to serve two. Just serve the XML Sitemaps with a stylesheet and then go extensionless. Visitors and bots alike will be jazzed. :)

P.S.S. I'm also an amateur when it comes to this Sitemap stuff so I'm hoping one of my Peers will come to my rescue if I've given any incorrect information. I've read and follow the Sitemaps Protocol. There may be something I'm missing. ;)

DiscoStu




msg:3969344
 6:55 pm on Aug 10, 2009 (gmt 0)

You don't need to serve two. Just serve the XML Sitemaps with a stylesheet and then go extensionless.

Thanks for the reply. But if you look at big sites like Yelp and yellowpages etc their html and xml sitemaps look massively different. They use an index xml sitemap to point to a straightforward list of sitemaps with 50 K links per sitemap, whereas the html sitemaps are much more based on intuitive navigation.

In a perfect world, you WOULD need Sitemaps. It doesn't matter that the architecture of your site is perfect and allows proper indexing. Those Sitemaps are a map for the bots and are also great on saving crawl resources.

I understand that sitemaps are useful, my point is just that given that an exhaustive xml sitemap is already in place, won't an html sitemap be redundant for indexing purposes? Is there anything wrong in using an xml sitemap strictly to aid indexing/crawling, and use an html sitemap strictly for usability?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved