|How do huge sites get such complete index coverage?|
| 1:01 am on Apr 5, 2012 (gmt 0)|
Is there a preferred method for getting truly huge sites, in terms of page volume, fully indexed and up-to-date? How do the big boys do it (IMDB, Facebook, Reddit, StackOverflow, et al)? Are we presuming some sort of relationship with Google? some higher-than-10-per-second crawl rate special? Or some other technique?
One thing I considered with, say, StackOverflow is that the StackExchange network has many different sites running essentially the same product - and they just wisely broke up the site among different URLs that Google might crawl as though they were independent and at their own respective rates.
Any thoughts our resources?
| 2:13 am on Apr 5, 2012 (gmt 0)|
The main asset that the big boys have is incredible user engagement. That includes their natural link profile, yes, but user engagement is a lot more than that. It's not an easy thing to build quickly, but it really is what a business needs to generate to not only get indexed but to have those pages get search traffic.
| 11:01 am on Apr 5, 2012 (gmt 0)|
I do SEO for such huge sites. Here is my take:
- You don't need a special relationship with Google. I haven't worked on a big site yet that has one other than a large adwords budget (and that doesn't get you beans in terms of SEO).
- If you have millions of pages, Googlebot can do a lot of crawling. On a site with 10s of millions of page, at one point, 40% of the pages we served were going to Googlebot.
- Google seems to be moving away from pagerank as a ranking signal, but googlebot still uses it to determine what to crawl. If you want to have a large site, you need high page rank to get it all crawled. If you have an important page, you need to have enough internal links to it that it gets recrawled frequently.
- Googlebot has at least two crawl modes. "Freshbot" will greedily crawl all new pages. I experimented with creating chains of pages starting from a PR 5 page (each page links to the next and so on). In this mode, Googlebot may crawl a chain thousands of pages deep.
- "Recrawlbot" will come back and recrawl pages with a frequency based on pagerank. A PR 7 page may be crawled hourly. A PR6 page may be crawled twice a day. A PR 5 page may be crawled every day. A PR 4 page every couple days. A PR 3 page every week. A PR 2 page every two weeks.
- Sitemaps can help, but use internal links to highlight your best content.
- If you are dealing with user generated content in large volumes you will have to separate your good content from your bad content. If you highlight all your content, even the poor quality stuff, you won't have a very compelling site. Wikipedia spends a lot of effort interlinking their articles such that every good article has many links from other articles. Stack overflow has a robust voting system such that important questions filter up to the top.
| 2:05 pm on Apr 5, 2012 (gmt 0)|
In my experience the crawl rate is also heavily influenced by the frequency pages are updated or new pages are added to a site.
Googlebot can crawl a massive number of pages though, I've seen it get up to 130,000 a day. But like any site, if the page authority is low and/or the page isn't updated very often googlebot is unlikely to indexed it very often.
| 3:04 pm on Apr 5, 2012 (gmt 0)|
I got the impression the OP was asking how they do it physically.
:: business with calculator ::
1 page/second = 86400 pages a day. So if you've got millions of pages, do you let them crawl like mad, as fast as they like, or resign yourself to some pages being several weeks out of date?
| 3:17 pm on Apr 5, 2012 (gmt 0)|
I've also worked on a number of larger sites, and would second deadsea that this isn't about a special relationship with Google. It's primarily about scale. Good sites with millions of pages also have a huge number of entry points - so they are being crawled from a number of different angles all at once. As long as internal link structure is reasonable, this can result in incredible amounts of crawler activity, and extensive coverage of pages.
Sites with a large userbase can also generate huge amounts of activity around new content very quickly, across all the mechanisms Google uses to find content - they have everything from new links, social mentions to toolbar activity.
But don't believe these sites are immune to the problems that smaller sites see. They're still agonising over analytics and monitoring indexing etc. etc. Typically, large amounts of the content they produce doesn't deliver any real search engine "return" because it doesn't get the engagement it needs - and no amount of users, links or social activity will get everything to rank. They're just have different scales of problems ;)
| 5:38 pm on Apr 5, 2012 (gmt 0)|
|deadsea: If you are dealing with user generated content in large volumes you will have to separate your good content from your bad content. If you highlight all your content, even the poor quality stuff, you won't have a very compelling site. Wikipedia spends a lot of effort interlinking their articles such that every good article has many links from other articles. Stack overflow has a robust voting system such that important questions filter up to the top. |
That last point seems pretty central to me (especially since - coming to some sort of crowdsourced determination of the best content (which of course, only works when the content is worthwhile and the product is engaging, as tedster pointed out).
|Andy Langton: As long as internal link structure is reasonable, this can result in incredible amounts of crawler activity, and extensive coverage of pages. |
I feel as though I frequently struggle with this concept. What would constitute an "unreasonable" internal link structure?
Thanks for all the help, guys! My career would look very different if not for the great folks here.
| 6:23 pm on Apr 5, 2012 (gmt 0)|
|What would constitute an "unreasonable" internal link structure? |
The value of external links is distributed from each entry point - and no site of any size has external links to every page.
So, the "job" of a good link structure is to distribute the available benefit in a way which:
- Reflects the relative importance of each page (there's no sense throwing as much weight at a lowly review page as at a top level category)
- Ensures each page receives "enough" benefit. "Enough" depending on the role of the pages. As far as Google is concerned, this means "enough to rank" - which might be as simple as not getting binned as low quality.
The two most common problems I've seen with link structure reflect those two points. Or to put another way, unreasonable link hierarchies can:
- Waste link juice on lowly pages that don't need that weight. This results in lower rankings, especially for competitive areas.
- Don't provide enough links, or enough link juice left once they get to longer tail content. This results in pages being regarded as of low quality.
To "visualise" somewhat, there are plenty of smaller sites that don't have well designed navigation, and rely on a large HTML (or even XML) sitemap. These sites tend to have a handful of very strong pages, linked to frequently, and then the rest are very low value, since all their juice arrives from a sitemap. This also occurs on larger sites, but tends to be a series of smaller "sitemaps" or pages with large numbers of links to content that isn't linked elsewhere. I sometimes call this the "running our of menu" problem ;)
Another example would be sites with huge drop down menus, that link to everything from everywhere. This results in equal weight being given to pages at the top and bottom of the hierarchy - when the distribution should instead follow the relative importance of each page. This is the "too much menu" problem ;)
Information hierarchy as it applies to internal links and navigation isn't the simplest area to get your head around, especially when lots of pages are involved. Hope that helps a little though!
| 6:32 pm on Apr 5, 2012 (gmt 0)|
If your site has 10 million pages and pagerank 7 then:
Googlebot will have crawled every single page on your site at least once.
Google will index several hundred thousand pages.
100,000 pages would get at least one referral from google search in any given month.
A few thousand pages might consistently get referrals from google every day.
Even the really big sites may not have as many pages indexed as you might think.
| 7:30 pm on Apr 5, 2012 (gmt 0)|
| 10:57 pm on Apr 5, 2012 (gmt 0)|
|The value of external links is distributed from each entry point - and no site of any size has external links to every page. |
This is a key point and is worth emphasizing. The nature of the site affects what those entry points are likely to be. The contextual linking that you see in Wikipedia or the New York Times, eg, works only because the articles that link out contextually are themselves entry points. They attract external links and thus have incoming link juice to redistribute.
One needs to be careful not to redistribute the link juice indiscriminately. In a large site, prioritization is essential. A large site generally needs multiple entry points and multiple types of prioritization to reach all important pages, and navigation should consider where the link juice is flowing from each entry point.
On sites where the home page is the most natural entry point, top down prioritization is a core structural consideration. In a top-down hierarchical structure, the trick is to make the site neither too flat nor too narrow. Too many deep links to individual pages or subcategories from home can often siphon off the link juice better left for major categories. Such a structure also makes prioritization impossible. In my minds eye, a well-designed navigation structure looks like an inverted tree, or a group of related inverted trees, with the link juice flowing along branches that get thinner as they "branch out" from the "root". You don't want to send too much or too little link juice to any one page.
All too often, I see smaller ecommerce sites randomly linking to a large number of product pages or subcategories from home, without consideration of how that juice is being distributed and of how much juice is needed for ranking. Considering (for discussion purposes) top-down linking only... what becomes a very difficult balancing act is that, in a hierarchical structure, the more link emphasis you provide directly from home (or from higher levels) to individual pages or categories, the less link juice you have left over for all the rest of your nav structure. If you have too many links from home, or from any entry page, you will in effect need to link to everything from home, and on a large site that simply doesn't scale.