| This 55 message thread spans 2 pages: < < 55 ( 1  ) || |
|For Mega-Site SEO, Structure is King - not Content|
Interesting WebProNews interview with Derrick Wheeler, in-house SEO for Microsoft. There's both a video and an article. [webpronews.com...]
I appreciate that many members here may not be able to relate to a mega-site, where changing one process can remove a couple million active URLs and make them 404. But sometimes blowing things up really big (considering SEO for microsoft.com can certainly do that) reveals details that are helpful for anyone.
"With mega SEO it's about making small strides over time that [when] grouped together they have a really big impact."
I've worked with a number of large sites - though not quite on the scale that microsoft.com deals with. I thought it was very interesting that even Microsoft has some awareness ticking away in the back of their mind about what may or may not trigger a Google penalty :)
Recursive semantic reinforcement, taxonomic siloing, with deep-level inline cross-silo interlinking. Think about it.
Also, there is a danger of thinking megasites are just normal sites, writ large. Nothing could be further from the truth.
microsoft.com does lots of different things. Lots. It has no focus- it is a site designed to meet particular disperate needs. Holistic approaches are non-sensical.
Want an example of an Internet behemoth, integral to every aspect of your online experience, a multi-billion dollar entity with huge cash reserves? An employer of tens of thousands worldwide, who was there at the beginning, but has singularly the WORST site I have ever seen, in regards of SEO. A manufacturer, a service provider, synonymous with Trust and Authority, whom you would assume could only achieve their dire web presence through active deoptimisation and wanton disregard for usablity. A shining example of disjointed thinking, adhoc development and labyrinthine navagation. I give you...
|disjointed thinking, adhoc development and labyrinthine navagation |
Exactly the challenge. I recently helped just one business unit of a mega-site combine 42 disconnected pieces into one "unit" that is hosted on one subdomain. Two years of work on that - and it still doesn't line up with any ideal. It is definitely improved, however, and the new structure and technical infrastructure is now more extensible as the business evolves.
The big challenge is that major enterprise website teams are no agile enough - they cannot quickly reflect new business developments and market needs. And so they just grow like kudzu vines instead of a well tended garden.
Top management usually does NOT consider the website with the same intensity that they would consider physical world properties. There's the need.
Are there really that many differences when dealing with 1,000 documents or 1,000,000,000?
The solutions at the 1000 level are not much different than those at the 1000000000 level. Or are they? I think the same concepts apply to any number of documents.
|A place for everything, everything in its place. |
For me, I focus mainly on the "final destination" documents. Anything in-between is subject to noindex or noindex, nofollow. Using just noindex allows me to direct the flow of equity between entry and final destination. I want the bot to follow all internal links (where applicable) but I only want a specific path indexed to the final destination.
Think of it like a map. You have your "shortest route" and that is the way I view this. There may be plenty of "other routes" that lead you to the same destination and those are "alternatives" to the "shortest route". But, the shortest is always the one that leads the bot to the final destination hence the use of noindex in many areas.
Typically most efficient structures are going to be somewhat shallow and very broad horizontally. Each entrance into a horizontal structure stands on its own. Think of it like having thousands of websites that all make up the "pyramid". Not only do you go horizontally, you also travel upwards and work at the host name level.
This is not a one size fits all solution although the basic concepts are the same. I'm looking at it from an ecommerce perspective and for me, it all comes down to how you manage the "equity" within the site. Heck, if you've got it just right, you can put a link on one of the top level horizontal (or host name) documents and that would probably have just as much, if not more importance than most external links.
<meta name="robots" content="noarchive">
The above is mandatory for all sites that we do. We typically serve it as a X-Robots-Tag in the server headers.
<meta name="robots" content="noindex">
<meta name="robots" content="noindex, nofollow">
The above are used judiciously throughout all the documents we work with. Most of the documents have default indexing directives. Some are controlled by the user based on the content being published.
This is all "basic stuff" too. There's much more at the "leeching level" that needs to be addressed, that's one thing I guess stands out between the 1,000 and 1,000,000,000 levels. You really have to control bot access to the site. Error reporting routines become much more robust at the larger document levels. But these are things you also need to worry about if you only have 100 documents. You need to protect and control access to everything. Ask IncrediBILL, he's the Master of grabbing a bot by the balls and giving them instructions on what they can and cannot do. :)
|<meta name="robots" content="noarchive"> |
The above is mandatory for all sites that we do.
Structure rejiggering has made a huge difference to me. I've seen it impact rate of crawl (this I believe) and ranking.
|Are there really that many differences when dealing with 1,000 documents or 1,000,000,000? |
Maybe not in theory - if you assume that somewhere someone in an enterprise starts out by saying "we've got these 3.87 million documents to organize and publish on our website. However, mega sites don't often launch knowing what the total will be, or even what the Information Architecture needs to be to support future needs.
Many times there are hundreds or thousands of stakeholders in "the website" and their particular needs can be all over the map. They may not even use the same CMS and no one is going to GET them to use a unified CMS very easily, because that project will have at least a 7 or 8 figure price tag. The business systems are so large and complex that unforeseen consequences are dripping from every decision or proposal.
And then there's training and employee churn - many times a web team finally gets up to speed only to be dismantled by personnel changes. This is another area where management can severely undervalue the potency of a web presence. It just seems too intangible a lot of the time.
So in a way you're right that the final goals are not so different. But the steps to get to those goals are quite challenging for the enterprise, whereas the entrepreneur or smaller team can just sit down and work out the changes over a couple of weeks or months.
A site that seems to work well since inception with around 40 million pages / URL's is the BBC [google.co.uk...]
This site also shows that you don't need to have TBPR to rank your articles if your brand and architecture is strong and consistent. Having said that it probably attracts a lot of links, but certainly not to every nook and cranny within the structure.
There's an interesting note here on " mothballing" which indicates how large sites can deal with legacy content and data. [bbc.co.uk...]
It would be an interesting case study to compare the BBC with Microsoft to demonstrate where the the challenges could be more easily aligned.
[edited by: Whitey at 11:31 pm (utc) on Dec 6, 2010]
|<meta name="robots" content="noarchive"> The above is mandatory for all sites that we do. > Why? |
Because I think it is one more area that if left unchecked can create fragmentation of your content not to mention a host of other issues that "I" think occur on a regular basis. There's no need for anyone to have access to a potentially older version of a web document. And the argument that "if the site is down" doesn't hold much water these days. If your site is down that much where you need to rely on cache, then your woes goes much deeper.
Take away cache and Archive.org and you've eliminated two major points of untethered access. Take control in those areas and add a robust bot blocking routine and you've taken a major step forward. It's a matter of protecting your assets.
|Are there really that many differences when dealing with 1,000 documents or 1,000,000,000 |
p1r, sounds like you've never tried to move a billion documents before. Trust me, there is a big difference. My site caching system creates a fully-functional HTML file for every page. Trying to move these files will choke a server for hours, trying to recreate them can take weeks.
How about checking for link-rot on 1,000 pages vs. 1,000,000,000 pages? Most methods which work on a 1,000 page site would take a full year on a billion page site.
Analytics only allows ads on 20,000 pages to be tracked. That's only 0.002% of a billion page site.
My guess is that most mega sites (like mine) are comprised mostly of user generated content. There are a whole host of issues that this introduces which can easily be monitored on a 1,000 page site, and not so easily on a million page site, much less a billion page site.
These are all issues which have to be taken into account on a mega site, and they are just the tip of the iceberg.
|p1r, sounds like you've never tried to move a billion documents before. |
Not even close. The most I can claim is approximately 45 million documents. It took a year of deep diving through 6 layers of programming logic.
|Trust me, there is a big difference. |
The only difference is the number of documents and the time involved to do what it is you are doing. Moving a billion documents would be a monumental task. Its the time factor that gets extended.
I would guess there are about a handful of folks around this neck of the woods who can lay claim to managing 1,000,000,000+ documents. That's a mouthful. :)
|How about checking for link-rot on 1,000 pages vs. 1,000,000,000 pages? Most methods which work on a 1,000 page site would take a full year on a billion page site. |
I would think all of that is dynamic and pretty much real time. It's not something you're going to do in large sweeping tasks, it is done proactively and on a regular basis.
|Analytics only allows ads on 20,000 pages to be tracked. That's only 0.002% of a billion page site. |
Ah, we were talking structure and taxonomy, analytics is another topic in itself.
|My guess is that most mega sites (like mine) are comprised mostly of user generated content. There are a whole host of issues that this introduces which can easily be monitored on a 1,000 page site, and not so easily on a million page site, much less a billion page site. |
I guess it all comes down to the system and the personnel who manage it. I fully understand the scalability of man/woman power in this instance. UGC is another factor that in itself requires a bit of human intervention to manage. Its all scalable, you just add zeros to the numbers. :)
From the original article...
|With mega SEO it's about making small strides over time that [when] grouped together they have a really big impact. |
If you're managing a billion documents, you are of course taking those small strides. In all the years I've been reading here at WebmasterWorld (since 1999), I don't think we've discussed the moving of 1,000,000,000 documents. Maybe millions, but not a billion. That's a very large number. :)
Thanks Robert, here is my lost post again.
>> Could you elaborate on what you would consider a "clear and unambiguous website structure"? <<
Planet13, this is a good website structure the way I see it.
First Level Page (Home Page)
The index page contains only the general description of website (or call it introduction) plus short descriptions of the most important departments (also called silos, parts, info blocks, etc.) of the website with links only to these specific departments.
www.yoursite.com/products/ | www.yoursite.com/services/ | www.yoursite.com/about-us/ | etc.
Second Level Page(s)
Each page on this level contains information describing and introducing visitors to the particular department (like line of products in this example) plus provides links only to the group of products pages which are part of this department.
www.yoursite.com/products/tools/ | www.yoursite.com/products/fastners/ | www.yoursite.com/products/lumber/ | etc.
Third Level Page(s)
Each page on this level describes groups of tools in the products department plus provides links only to the particular tools pages which are part of this department.
The website structure can be complicated issue so please anyone correct any details you think are not correct.
Here's a random and somewhat unique MEGA SEO issue.
Google webmaster tools reports 160k "not found" URLs on www.microsoft.com. These 160k URLs have about 1.5 million inbound links. :)
Embarking on large-scale 301 redirect mission that could lead nowhere or generate millions of visits. You never know for sure what the impact will be.
What is the street value of 1.5 million links?
Welcome to the forums, Derrick. Good to hear your experience with backlinks.
Backlink reclamation is usually one of the first SEO steps I take on when working with huge sites. One example I remember with shudders was a site that had 750,000 backlinks for a URL that was now 404. Seriously. Any small site SEO would break down and weep!
That was a relatively easy win, however. When the 404 backlinks point all over the place, then it can be a major challenge just to know where those URLs should redirect. It's even better if they can resolve 200 again, even if they use new content that explains why it changed. Then the internal links that appear in the template can do their "circulation" job.
|You never know for sure what the impact will be. |
Appreciate you stopping by Derrick. Surely Microsoft wouldn't be filtered or penalised, but then I believe even Google has had issues organizing their own site from time to time.
I wonder what they see in WMT.
44 million docs? a billion? Are these purely informational sites we're talking about?
Most of the time, these are multi-purpose sites for a major enterprises. As such they have millions of backlinks and expose millions of URLs worth of content. The "billion" discussion was pretty much hyperbole.
|44 million docs? Are these purely informational sites we're talking about? |
Informational - yes, e.g. a shopping comparison search engine.
|The "billion" discussion was pretty much hyperbole. |
Oh, I thought dataguy was serious. :|
You're the first person to mention 1000000000 in this thread, pageone. Have you dealt with a site that big, or were you just typing in lots of zeroes?
|You're the first person to mention 1000000000 in this thread, pageone. |
By golly I was. And then dataguy picked that number up in his response. :)
|Have you dealt with a site that big, or were you just typing in lots of zeroes? |
Heck no, I was just trying to create a pretty half pyramid thingy. 45 million URIs is the most I can lay claim to and that was a few years ago.
Just FYI... at one point we counted as many as 1.4 billion URLs on *.microsoft.com/*. Most of these were "junk" navigational URLs like I mentioned in the video. Many of these "junk" URLs are pages from various "solution finders" or "compatibility checkers" where you can select attributes via links to get search result style pages.
Some search engines are better than others at avoiding them during a crawl and/or filtering them out at indexing.
We do our best to robots them out but some of them are structured so it is difficult to exclude the bad stuff without also excluding the good stuff.
I have to convince the "site owner" (if I can find them) to let me block their stuff. This can be difficult so sometimes I block things then wait for someone to send me a nasty email :)
Assuming a CMS isn't generating thousands of pages of junk content (in which case, frankly, a broken CMS is a developer problem, not an SEO's), isn't the solution pretty much always the same?
a) Create sitemap.
b) Put sitemap in footer/submit to SE's.
c) Get the work experience monkey to create a list of 301 redirects.
If no work experience monkey is available, pay an outsourcing agency.
If you can't afford that, just 301 redirect everything to the sitemap. Ideal? No. But considering the amount of time spent doing thousands of redirects, you'd get more bang per hour accepting the reduced link juice and using the time to find more back links.
My experience thus far is that a huge number of indexed urls, without serious link power and Search engine recognised authority = really poor rankings
|My experience thus far is that a huge number of indexed urls, without serious link power and Search engine recognised authority = really poor rankings |
Do you mean really poor rankings for just those specifica pages?
Really poor rankings for the site in general?
Really poor rankings for the home page of that site?
Please elaborate, if possible.
|The index page contains... links only to these specific departments. |
That would be somewhat in disagreement with what Matt Cutts has stated in an official video (or two) about the importance of ecommerce sites to put some of their best selling / revenue generating products "front and center" on the home page.
Also, another point someone brought up was about the navigation being intuitive.
Unfortunately, people don't think the same way. Some people might think that shopping for shoes by brand would be intuitive. So people might think that shopping for shoes by function (dress shoes, running shoes, dancing shoes) would be intuitive.
Other people might think that shopping for shoes by price might be intuitive, while others might think that shopping for shoes by which celebrities are wearing them might be intuitive.
I mean poor rankings in general for any page without solid inbound link power, a huge number of urls indexed would often mean that the link power coming into the site is very diffuse, this probably has consequences
| This 55 message thread spans 2 pages: < < 55 ( 1  ) |