homepage Welcome to WebmasterWorld Guest from 107.20.25.215
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Microsoft / Bing Search Engine News
Forum Library, Charter, Moderators: mack

Bing Search Engine News Forum

    
Bing Publishes Sitemaps Best Practice, Including Large Sites
engine




msg:4678724
 11:47 am on Jun 10, 2014 (gmt 0)

That's a useful reminder for those that know, and an education for those creating their first sitemap.

Interestingly some sites these days, are large… really large… with millions to billions of URLs. Sitemap index files or sitemap files can link up to 50,000 links, so with one sitemap index file, you can list 50,000 x 50,000 links = 2,500,000,000 links. If you have more than 2.5 Billion links… think first if you really need so many links on your site. In general search engines will not crawl and index all of that. It’s highly preferable that you link only to the most relevant web pages to make sure that at least these relevant web pages are discovered, crawled and indexed. Just in case, if you have more than 2.5 billion links, you can use 2 sitemap index files, or you can use a sitemap index file linking to sitemap index files offering now up to 125 trillion links: so far that’s still definitely more than the number of fake profiles on some social sites, so you’ll be covered.Bing Publishes Sitemaps Best Practice, Including Large Sites [blogs.bing.com]

 

incrediBILL




msg:4678811
 6:18 pm on Jun 10, 2014 (gmt 0)

Best practices if you want to enable a sitemaps


They could at least grammar check the blog posts before publishing, sheesh.

Sitemaps are a waste of time. They won't help you with indexing.


They aren't supposed to help you with indexing.

They're supposed to help you with CRAWLING to expose pages that the spider might not be able to find on it's own. Just because you give them a list of every page on your site doesn't make those pages any higher in the crawl queue or give them more weight in the index, it's just that they now know where they are in the event the crawler decides it wants them.

I think it was originally a hack to get around issues with menus in javascript, AJAX data, etc. so that you could tell the spider where everything is located so they don't miss anything trying to guess.

This was before headless browsers and the current easy ability to access a site exactly how a human sees it which eliminates all that guess work.

I've never needed a sitemap as I never use wacky menuing schemes or over the top site architecture that confuses the crawler. Everything is crawled without issue.

tangor




msg:4678987
 5:28 am on Jun 11, 2014 (gmt 0)

Like post above I wonder the true value of sitemaps (never had to use one). However, IF one has a very large site (millions of urls) a sitemap to the 1,000 ESSENTIAL urls might have value in directing crawlers more accurately.

lucy24




msg:4678993
 6:15 am on Jun 11, 2014 (gmt 0)

Smart sitemap downloading checks that and does not hammer sites by downloading the same damned sitemaps over and over again.

Remember, this is bing-- the same people who request robots.txt 35 times in a row, who request pages that were 410'd in 2006 not just a few times a year but week in and week out, who never give up hope that a page 301'd in 2011 might yet recur at its old URL, who can be trusted to find a typo in a link three minutes after you posted it and two minutes before you correct it...

But that's all about crawling. It's got nothing to do with their algorithm.

Everyone uses Google. No one uses Bing.

Therefore, anyone who is not google should put up their shutters, close up shop and go home?

Query: Under what conceivable circumstances would someone have two and a half billion pages, each with unique content that is best discovered by using a general-purpose search engine covering the entire Internet? ("I'm looking for a page devoted exclusively to white furry single-use left-handed size-12 metric-calibrated Brazilian-made three-pronged widgets. If I wanted a white furry single-use left-handed size-14 metric-calibrated Brazilian-made three-pronged widget, I'd say so.") I don't believe even Amazon has a billion unique pages.

jmccormac




msg:4679021
 9:15 am on Jun 11, 2014 (gmt 0)

I don't believe even Amazon has a billion unique pages.
Some sites are quite deep and could have that many pages. My own has the hosting history of gTLD domains back to 2000 and the stats for over 5.5 million hosters. It would have about 400 million pages on the domain name pages alone. Amazon has pages for books in print, eBooks and books out of print. It also has product pages so it is possible that it is that deep. Facebook would also, theoretically, have large numbers of pages.

Bing does seem to have problems. The thing about large websites and sitemaps is that site operators do tend to work on the sitemaps issue to ensure that only the most recent sitemaps are updated. However when search engines ignore the sitemaps and the lastupdated fields, it causes unnecessary downloads and increases costs for the site. But that doesn't seem to matter to Bing.

Regards...jmcc

Brett_Tabke




msg:4679036
 10:40 am on Jun 11, 2014 (gmt 0)

I nuked about 8 msgs in this thread.

Ok, lets leave the flaming and bs for another time-n-place. This thread is about Bings new SiteMap post.

jmccormac




msg:4679051
 11:09 am on Jun 11, 2014 (gmt 0)

This is where I think that the blog post is wrong about the use of sitemaps by large websites.
The main problem with extra-large sitemaps is that search engines are often not able to discover all links in them as it takes time to download all these sitemaps each day.
Operators of large websites typically have a well defined sitemaps strategy that prioritises changed content and additions over unchanged sitemaps. The sitemap index files are used, in effect, to signal to search engines which sitemaps have changed. Thus after the initial download of sitemaps, the search engine only needs to download and process the the changed sitemaps.

Search engines cannot download thousands of sitemaps in a few seconds or minutes to avoid over crawling web sites; the total size of sitemap XML files can reach more than 100 Giga-Bytes.
A Social Science number. Sounds impressive but it is not based on reality. When you have a large site with large numbers of sitemaps, you count bytes.

Between the time we download the sitemaps index file to discover sitemaps files URLs, and the time we downloaded these sitemap files, these sitemaps may have expired or be over-written.
These are the important things in a sitemap file:

<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>

[sitemaps.org...]

They tell the search engine when the sitemap was last updated and when to check again for updates. Both are optional and the changefreq is a hint rather than a demand. With a large site, there is a reliance on sitemap index files and they prioritse the use of lastmod. That means that they already tell the search engine which sitemap files have changed so that all the search engine has to do is to hit the sitemap index file to find out which file(s) to download. Unless the search engine has completely banjaxed its parsing and misunderstood the sitemap protocol, this works well for the site owner and the search engine.

This is what the important data in a sitemap index file looks like:

<sitemap>
<loc>http://www.example.com/sitemap1.xml.gz</loc>
<lastmod>2004-10-01T18:23:17+00:00</lastmod>
</sitemap>


Additionally search engines don’t download sitemaps at specific time of the day; they are so often not in sync with web sites sitemaps generation process.
The whole concept of lastmod time is crucial for large websites.

Having fixed names for sitemaps files does not often solve the issue as files, and so URLs listed, can be overwritten during the download process.
Huh? One of the key aspects of building large sitemaps is to split the pages into sitemap files and maintain them in those files. That way there is a continuity and new content gets included in new files where necessary. The sitemaps grow in synch with the architecture of the websites.

The Bing sitemap post is not a reliable guide to sitemap practices for large websites and it is quite wrong in critical places. It would be better for people to read and understand the sitemap protocol rather than relying on Bing's sitemap post.

Regards...jmcc

jmccormac




msg:4681707
 7:02 pm on Jun 21, 2014 (gmt 0)

And the cluelessness continues with Bing hammering away at sitemaps almost daily even though the lastmod has not been updated. The sooner Bing gets a clue, the sooner webmasters will take it seriously as a competitor to Google.

Regards...jmcc

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Microsoft / Bing Search Engine News
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved