|Is Google classifying sites by topic?|
I am wondering if anyone else has seen tectonic shifts in the distribution of traffic on their sites in the last month or so. This would apply mostly to sites with many categories or broad types of content, and not niche sites.
Google analytics is showing that traffic to most of the categories on my site has dropped 25%-60% across the board, except for the 'main' or most popular category on my site, which has increased traffic. The overall effect is about the same net traffic, however much less narrowly focused than before (I compared September 08 to April 08.) In my case the shift happened very quickly.
It strikes me that Google has associated my site with one main topic and is driving traffic based on that topic. Is Google now forcing us to pick one horse and ride it? is this the end of the broadly-focused website?
We had a thread last year that discussed the possibility of this evolution:
Is Google Classifying 'Types' of Websites and Search Terms? [webmasterworld.com]
There's no doubt at this point that search terms get classified. With this recently suspected "traffic throttling", it is beginning to look like we have something real going on with websites, too - although the details are not yet clear.
If this is what's going on, it certainly doesn't impact every site - not the giants like Wikipedia, at any rate.
I've blogged about this before - it seems clear to me that Google is performing some sort of categorizations of sites. Their classification methods could be various, although their general philosophy would point towards some sort of automated system to accomplish this.
One very good indicator of Google's work in classification is their Labs project, Google Sets:
Try typing in your brand name or site name and see what other words are reflected back. If what's reflected back include names of your competitors, you're likely correctly categorized. If not, you might want to alter your site's keyword content so that you could get pigeon-holed better.
For me, some ambiguity comes in because of phrase-based indexing and its interaction with the rest of the algo. If sites are indexed by the phrased-based processes (and apparently they are) then that automatically creates a kind of implicit website taxonomy through the co-occurence metrics that phrase-based indexing generates.
I'm not sure that Google needs to add any dedicated process beyond that for determining a website taxonomy.
Now taking things a bit firther, is there some added switch that says "rank this group of co-ocurring terms today, and this other group tomorrow?" In some cases at least, the data are quite suggestive.
Interesting...I had to backtrack and read some of the previous posts as I wasn't familiar with phrase-based indexing. I don't see how this would lead to a specific drop in traffic for just some parts of a site though, as long as those phrases appear in the text on the site and links pointing to the site. As long as the text is present wouldn't a large site with many categories simply be associated with multiple groups of co-occuring terms? In my case it seems that there is a fairly uniform drop in every category except for one.
Here's how it could work. If most of the site is over the threshold for co-ocurrence that is currently considered "over-optimization" or spam, then those queries would show a ranking drop. Other keywords and pages might be under that threshold.
Not to say that this IS the cause of what you're seeing, just that it is a factor that introduces some ambiguity into the analysis. Some of the other threads in this forum recently show similar reports - even to this degree:
So whatever the current situation for Google, let's look at your question, "is this the end of the broadly-focused website?"
I'd say no - but it might be challenging to run a broadly-based commercial website without enough PageRank. The "topical limitations" do not seem to be affecting everyone, and from what I see one difference is whether there is enough backlink strength to support the diversity that the braod-based site offers.
Site topic was unmistakeably the reason a site of mine had sitelinks showing certain pages and not others.
Yes I do remember that thread; thanks for bringing it up. Google also says that sitelinks are awarded partly on the basis of traffic. I wonder if that plays into your experience here, in addition to the possibility of topic.
So now a related question comes up for me: does Google sometimes give graybar PR for the parts of a site that are "off-topic", as it currently understands the topic. Or maybe discount internal link juice if it comes from another topical part of the site?
I've experienced an example of Sitelinks which suggests the possibility of topical awareness, but I'm doubting that Google has a classification list in the sense that the word "category" implies.
The Sitelinks example involves a general umbrella-term keyword... call it "gadgets". The site for which they appear isn't actually optimized all that well for "gadgets". What it's optimized for (and ranks well on) are phrases that fit into subcategories of "gadgets": "widgets", "gizmos", and "doodads". Yet the "gadget" phrase for which the site has Sitelinks is a very competitive phrase, and I was quite surprised when I first saw Sitelinks on this search.
More likely, it's due to due the implicit website taxonomy that tedster posits, plus the natural occurence of the term "gadgets" in inbound links describing the site as a whole, along with with several appearances of "gadgets" on the page.
The categories are likely to be effectively self-generating, to be somewhat fuzzy, maybe always shifting, and to come out of statistical analysis of the natural use of language (and perhaps of traffic patterns as well).
[edited by: Robert_Charlton at 8:44 pm (utc) on Oct. 4, 2008]
I'd suspect that taxonomies can be dynamically (re)constructed based on co-occurrence data, although the ranking algos aren't necessarily the same as the clustering factors involved in snippet generation.
What made me sit up and take notice in the case I've referred to (which incidentally, no longer has sitelinks since a change or two was made) was the PR distribution, which even though the site's PR is now down a notch, still retains the same pages showing TBPR or greybar - and they're topical.
Nothing has changed for ages with regard to inlinks, but there are periodic changes in outlinks; and there's a uniform internal linking with no concentration on any sub-topic. Figure something like this:
Main overall subject area: baked goods
Topic A: cake
Topic B: cookies
Topic C: pie
There may be primarily inbound links with anchor text for topics A & B, yet if there's a heavier site concentration of text, outlinks (and topics of the sites linked to), keyword-based site navigation and page titles on: pumpkin, apple, mince, cherry, etc. then guess what? Those primarily co-occur with pie, not cake or cookies. Therefore the pie topic far outweighs cake and cookies and is of more benefit for users looking for pie, and sitelinks are more relevant for that subject.
To be clear, in the case mentioned, the "pie" category pages show PR and non-pie category pages do not. They used to all show PR until just a while back. However, the homepage ranks for all three phrases, with "pie" being the least competitive, ranking a bit higher, and having had sitelinks for a while.
I doubt there's fully topical PageRank, but between this case and clues I've seen on other sites, it looks like biased PageRank might be coming into play in recent times.
|Now taking things a bit further, is there some added switch that says "rank this group of co-occurring terms today, and this other group tomorrow?" |
In doing a fresh read of the Historical Data patent (and another pending 2006 application), there are a few places where it can be construed to intimate that some of the time-sensitive factors aren't unrelated to phrase based indexing (and co-occurrence), especially in the portion that's been least discussed, updates and freshness factors.
[edited by: Marcia at 9:50 pm (utc) on Oct. 4, 2008]
A patent [patft.uspto.gov] of 2004, titled "Automatic taxonomy generation in search results using phrases", has been published in September 2008.
Reading the patent seems it's a kind of "powered PhraseRank" and sites are clustered according to their topics. The patent also explains how personalization of SERPs works according to user preferences and topics and finally how use the new algos to combat more incisivelty spam resources.
> from what I see one difference is whether there is enough backlink strength to support the diversity that the braod-based site offers.
Google should see IBLs to different category pages as opposed to the home page only/mostly as a counter point to possible spam wrt broad-topic sites.
Any new topical algo is bound to clash with the old 950/phrase-based spam algo and it could take some time for them to complete full integration with minimal collateral damage.
Google should judge the breadth of a site largely based on rate of site development. It is natural for a site to grow gradually over time and add new categories.
Unfortunately site age wasn't properly considered when the 950 penalty was instituted.
Recently I've seen an odd pattern in google results. In Brazil there are 4 big web portals (uol.com.br, terra.com.br, ig.com.br and globo.com). If you go to Google AdPlanner, you can see which category a website is listed and that's what I found:
uol.com.br = Web Portals
terra.com.br = Lyrics & Tabs
globo.com = News
ig.com.br = File Sharing
In Brazil it's quite common to have web portals partners' redirect their domains to the web portal's subdomains. Ex: www.widgets.com.br redirects to widgets.uol.com.br
Then I dig a little further, and I found that the category is based on the number of pages indexed (looks like it). Even though all domains above should be listed under "Web Portals", they have different category because a good portion of indexed pages are from their partners.
terra.com.br = a lot of pages from letras . terra.com.br
ig.com.br = a lot of pages from baixaki . ig.com.br
globo.com = a lot of pages from g1 . globo.com
uol.com.br = good distribution of large subdomains (ex vagalume.uol.com.br superdownloads.com.br)
But... Here is my question...
terra.com.br ranks really well for music related search terms
ig.com.br ranks really well for software related terms
- Do you think this categorization is automatic?
- Does it affect SERPs? Do you think Google puts the whole domain in a category and gives more weight to search terms of those categories?
- If that's the case, what can I do to compete well with web portals subdomains?
[edited by: tedster at 4:16 pm (utc) on Oct. 13, 2008]
[edit reason] moved from another location [/edit]
Hello DLaf, and welcome to the forums.
That's an excellent set of observations. Normally we do not discuss specific domains or keywords here, but in this case they are major portals so we'll make an exception.
As you can see from the earlier messages in this thread, other members are also suspecting that there is an automatic classification of some kind happening at Google - but we have no definitive understanding at this point.
It's always a challenge to compete with a high PR website, whether its a portal or just a stand alone domain. You may never be able to rank above them unless you become as strong as they are - but you can get on the first page, too, in many cases. It depends on the total picture of all competition on those keywords.
Interesting that the subdomains are affecting the categories of the higher level domain in AdPlanner. Yes, I think it's automated - the data is easily available through Google's phrase-based indexing [webmasterworld.com] algorithms.
Now the question becomes if a site in one category can rank for a competitive word that's not part of that category. I am beginning to feel that it has become difficult for a site to break into a new class of keywords, but it's not impossible. It seems to take a significant amount of content plus strong backlinks showing up to gain the new rankings - stronger in both these areas than in the past, perhpas, but it's still within possibility.
Do you remember "Dewey update"? I think it's not a casual name.
Melvis Dewey is the creator of taxonomy of libraries, ... so it could be possibile Matt Cutts was asking in April for report spam indicating Dewey keyword in strange SERPS rankings because they were testing this aspect of the algo.