Forum Moderators: open
Themeing is all about context. I wrote the original article in late 97 after several long discussions with a programmer at Infoseek on the directions they were going to take after the switch to go.com. Those actions never materialized due to go.coms faltering. I updated the article in late 98 and early 99 after G came on the scene and seemed to confirm many of the propositions about themeing.
It's all about context (or Topic Distillation) and how easily a search engine could identify an appropriate set of keywords for your site. The original hypothesis was that it would all be done "on site". They'd take your entire site and index it as one giant page, density analyze it, rank the keywords found, and create a core group of keywords that would be appropriate for your site. You would only be found in the search engine results related to those words. Back then, that would address much of the problem they were having with bait and switch, and the early days of cloaking with in appropriate content.
Then came all the link and off page criteria theories. There was the growing realization that external data could define a site as much as on-the-page words. Directory listings and inbound links are the main off page data that could be used.
There are also the simple semantic relationships between words that can be used to define a site. Googles "sets" in the labs.google.com utilities is a prime example of how keywords can be related to one another. This is a working example of what was a very hot topic three years ago on WebmasterWorld: the infamous Term Vectors [www9.org]. It is the ability to make numeric associations between words.
Remember those fancy iq tests many of us took in highschool? Where you were asked to spot the odd man out:
Car is to truck, is to motor, is to battery, is to trees, is to leaves, is to plants, is to ecoshpere, is to pollution, is to green house gases, is to muffler, is to truck.
As we can see in the Google Sets, all those associations can be given a numeric score and either included in a list, or excluded in a list. I can't think of another thing Google has ever done that has tipped it's hand as to what it will do in the future than that utility. The only thing better would be if Google would print the actual numeric score between the words. (that and validate the html on the 'sets' results).
So how is Google using all that contextual data to rank your site? Details are unknown at this point of course, but a few techs at recent conferences have indicated contextual data such as page titles of linking pages maybe being used.
That use of context in its various forms could be very powerful is finally rooting out the dreaded "off context" results that plague other search engines. We've all seen some widely inappropriate listings in the middle of a results page on other search engines. By using various forms of "context" to make sure that query terms are an appropriate match for any page, se's can eliminate that occasional bad result.
I don't think we can under estimate how much that one bad listing can cost a search engine. If you are searching for "printers" and run into a page in the results from "vacations in California" because it happens to mention "printers" on the page, what do you do? How many of us do something different at that point? We change the search, hit the back button, or just go to another search engine. That one bad listing poisons the whole page. I still think this is primarily why other search engines have not be successful. Peoples patience and attention span with web work is very short.
I think it is a no brainer that context will play a greater and greater role with all the search engines. Every scrap of data they can get their hands on to help define your site will be used. The core group of contextual items: page title, inbound link text, directory listings, domain names, site directory names, dns information, whois information, toolbar data, voting data, referral strings, click through data, and proxy cache data are the major ones available to se's.
After that, we get into some of the real guru stuff with query relationships, search refinement relationships, predictive search terms, personalized search histories, follow up query prediction, and community identification. Some of that has already come to pass such as the predictive search terms we can see in the auto spell correction and the query relationships in the "sets" again.
The real challenge is going to be synthesizing all that data down into a usable tool. If you've ever worked with huge data sets, they can either be poetry or chaos. It takes serious and slow long term testing to synthesize a googol [webmasterworld.com] of data.
If you look at a few of the smaller moves Google has made over the last year such as the purchase of Outride and the "labs" stuff, I think it points to a major overhaul of Google that is in the works. All these little refinements to Google we've seen over the last year are evolutionary steps to a complete evolutionary overhaul of the ranking systems.
As each of those data sets mention above is implemented, adjusted, or inject into the mix, there will be small sets of results that change radically as a result. You'll see things like we saw this month, where a wide swath was cut through a group of like sites, and other site saw increases.
Watching and trying to come to terms with those changes is near impossible. Just because you can identify something, doesn't mean you will be able to adjust anything on your site to benefit you.
That's where the whole theme concept comes in to play. It's about staying on topic and on mission throughout everything you do for your site. That translates into two parts to themeing a site: it is part a philosophy that Content is the king and part pragmatic in the way you arrange your site. It's the realization that everything you do online with regard to your site can potentially effect its ranking in the future.
Google said it [google.com] best:
"#2: It's best to do one thing really, really well."
It's coming to terms with the fact that you can only temporarily drive the search engines and the only successful optimization is to let them come to you. That is done by building an excellent site that serves your visitors long term. Focus on the visitors, and the search engines will eventually follow.
In conclusion, although some of the specifics of the themeing theory such as whole site indexing never came to pass, the contextual heart of the theory is stronger than ever.
One problem I see for the moment is that for context ranking to become broad based impemented, is that it will need an index that is a factor larger than the already large Google index.
For context related ranking to become valuable you need links from external sites/pages (more independant and therefore mostly more authorative) to your page. This could already occur for pages in the English language, however smaller based languages simply have a too small base of interlinking to show the best results in my world.
I would say it makes little difference if your site has several "themes". (the do one thing good), as long as those several individual themes are all presented thouroughly. Standford or MIT university would probably be one of the best in several themes on their one site. Google or any other developed search engine should be able to distinguish these sub-themes within one site puerely on the importance and context of the interlinking of the pages within one theme.
Of course, all this ideas involve having a huge, multi-lingual thematical thesaurus, so equivalences between themes in different languages could be stablished. If necessary, another "link modifier" could be stablished among the different languages. P.e. the modifier for the relationship between two pages about the same theme, one in Spanish and another in Italian would be higher than the modifier for the same relation, but in English and French.
Uff, I think I've been reading Webmasterworld too much time ;)
I'm pretty sure Teoma is doing this, and to tell you the truth I'm a little distrustful of it. I had a site that inadvertently had a bunch of mirror sites up on Teoma for a while, and Teoma rankings skyrocketed for certain terms because the links pages on these mirrors, with the same titles, were all pointing to the same places.
Also, I have more confidence in link text and context on referring sites than I do in page titles. A lot of people don't have a clue about titles, whereas relevant link context is more likely to happen accidentally.
And what happens with internal links on a site where there are a thousand pages all having the same title?