|Bing Publishes Sitemaps Best Practice, Including Large Sites|
That's a useful reminder for those that know, and an education for those creating their first sitemap.
|Interestingly some sites these days, are large… really large… with millions to billions of URLs. Sitemap index files or sitemap files can link up to 50,000 links, so with one sitemap index file, you can list 50,000 x 50,000 links = 2,500,000,000 links. If you have more than 2.5 Billion links… think first if you really need so many links on your site. In general search engines will not crawl and index all of that. It’s highly preferable that you link only to the most relevant web pages to make sure that at least these relevant web pages are discovered, crawled and indexed. Just in case, if you have more than 2.5 billion links, you can use 2 sitemap index files, or you can use a sitemap index file linking to sitemap index files offering now up to 125 trillion links: so far that’s still definitely more than the number of fake profiles on some social sites, so you’ll be covered.Bing Publishes Sitemaps Best Practice, Including Large Sites [blogs.bing.com] |
|Best practices if you want to enable a sitemaps |
They could at least grammar check the blog posts before publishing, sheesh.
|Sitemaps are a waste of time. They won't help you with indexing. |
They aren't supposed to help you with indexing.
They're supposed to help you with CRAWLING to expose pages that the spider might not be able to find on it's own. Just because you give them a list of every page on your site doesn't make those pages any higher in the crawl queue or give them more weight in the index, it's just that they now know where they are in the event the crawler decides it wants them.
This was before headless browsers and the current easy ability to access a site exactly how a human sees it which eliminates all that guess work.
I've never needed a sitemap as I never use wacky menuing schemes or over the top site architecture that confuses the crawler. Everything is crawled without issue.
Like post above I wonder the true value of sitemaps (never had to use one). However, IF one has a very large site (millions of urls) a sitemap to the 1,000 ESSENTIAL urls might have value in directing crawlers more accurately.
|Smart sitemap downloading checks that and does not hammer sites by downloading the same damned sitemaps over and over again. |
Remember, this is bing-- the same people who request robots.txt 35 times in a row, who request pages that were 410'd in 2006 not just a few times a year but week in and week out, who never give up hope that a page 301'd in 2011 might yet recur at its old URL, who can be trusted to find a typo in a link three minutes after you posted it and two minutes before you correct it...
But that's all about crawling. It's got nothing to do with their algorithm.
|Everyone uses Google. No one uses Bing. |
Therefore, anyone who is not google should put up their shutters, close up shop and go home?
Query: Under what conceivable circumstances would someone have two and a half billion pages, each with unique content that is best discovered by using a general-purpose search engine covering the entire Internet? ("I'm looking for a page devoted exclusively to white furry single-use left-handed size-12 metric-calibrated Brazilian-made three-pronged widgets. If I wanted a white furry single-use left-handed size-14 metric-calibrated Brazilian-made three-pronged widget, I'd say so.") I don't believe even Amazon has a billion unique pages.
Some sites are quite deep and could have that many pages. My own has the hosting history of gTLD domains back to 2000 and the stats for over 5.5 million hosters. It would have about 400 million pages on the domain name pages alone. Amazon has pages for books in print, eBooks and books out of print. It also has product pages so it is possible that it is that deep. Facebook would also, theoretically, have large numbers of pages.
|I don't believe even Amazon has a billion unique pages. |
Bing does seem to have problems. The thing about large websites and sitemaps is that site operators do tend to work on the sitemaps issue to ensure that only the most recent sitemaps are updated. However when search engines ignore the sitemaps and the lastupdated fields, it causes unnecessary downloads and increases costs for the site. But that doesn't seem to matter to Bing.
I nuked about 8 msgs in this thread.
Ok, lets leave the flaming and bs for another time-n-place. This thread is about Bings new SiteMap post.
This is where I think that the blog post is wrong about the use of sitemaps by large websites.
Operators of large websites typically have a well defined sitemaps strategy that prioritises changed content and additions over unchanged sitemaps. The sitemap index files are used, in effect, to signal to search engines which sitemaps have changed. Thus after the initial download of sitemaps, the search engine only needs to download and process the the changed sitemaps.
|The main problem with extra-large sitemaps is that search engines are often not able to discover all links in them as it takes time to download all these sitemaps each day. |
A Social Science number. Sounds impressive but it is not based on reality. When you have a large site with large numbers of sitemaps, you count bytes.
|Search engines cannot download thousands of sitemaps in a few seconds or minutes to avoid over crawling web sites; the total size of sitemap XML files can reach more than 100 Giga-Bytes. |
These are the important things in a sitemap file:
|Between the time we download the sitemaps index file to discover sitemaps files URLs, and the time we downloaded these sitemap files, these sitemaps may have expired or be over-written. |
They tell the search engine when the sitemap was last updated and when to check again for updates. Both are optional and the changefreq is a hint rather than a demand. With a large site, there is a reliance on sitemap index files and they prioritse the use of lastmod. That means that they already tell the search engine which sitemap files have changed so that all the search engine has to do is to hit the sitemap index file to find out which file(s) to download. Unless the search engine has completely banjaxed its parsing and misunderstood the sitemap protocol, this works well for the site owner and the search engine.
This is what the important data in a sitemap index file looks like:
The whole concept of lastmod time is crucial for large websites.
|Additionally search engines don’t download sitemaps at specific time of the day; they are so often not in sync with web sites sitemaps generation process. |
Huh? One of the key aspects of building large sitemaps is to split the pages into sitemap files and maintain them in those files. That way there is a continuity and new content gets included in new files where necessary. The sitemaps grow in synch with the architecture of the websites.
|Having fixed names for sitemaps files does not often solve the issue as files, and so URLs listed, can be overwritten during the download process. |
The Bing sitemap post is not a reliable guide to sitemap practices for large websites and it is quite wrong in critical places. It would be better for people to read and understand the sitemap protocol rather than relying on Bing's sitemap post.
And the cluelessness continues with Bing hammering away at sitemaps almost daily even though the lastmod has not been updated. The sooner Bing gets a clue, the sooner webmasters will take it seriously as a competitor to Google.