|YP classification taxonomies - are there any joint standards?|
Or anyone sets his way independently
I'm a bit interested to learn more about the taxonomy systems that is used for YP directories.
We see different terms used in different sites: directory A uses "Automotive" as a big category with subs, while directory B has "Autombiles" , C has "Vehicles" etc.
I've been wondering:
How mid size directories (like Insider Pages, Truelocal, MyPages) sort their businesses?
- Do they "inherit" the classification from their data provider?
- Perhaps there's a joint standard for businesses classification, resembling the universal system for classifying books in libraries?
- Maybe they use a software for automatic taxonomy (I've seen some names here and there)?
- Or maybe a public gov system for this?
(p.s. I've taken a look at the YPPA top 300 headings - could they have something?)
For those who study and apply local search, the questions that you raise, are, as my colleague calls them, "the mother load".
The issue of taxonomy has far reaches permutations in local search from issues relating to category browsing, business listing accuracy, related categories options, keyword to category mapping, synonym drivers, keyword phrase aggregation, monetization strategies, …on and on. You name it in local search, and the issue of taxonomy applies to it.
In direct response to your questions...generally speaking, there are no Internet standardizations. NAIC and SIC codes are the most widely accepted traditional standards in industry classification, while traditional YP taxonomies are driven largely by look-up and advertiser demand factors. When it comes to the web however, unstructured queries and traditional business classification don't mix well. Therefore, all major engines usually buy a base web search taxonomy, scrape and combine and call 'one' their own, or they organically develop one around their specific usage requirements.
Taxonomies are everywhere on the web. Standards are not, and they won't be for some time.
You bring this issue up in the context of local search. And you are dead on. But in the world of traditional SEM topics like PPC, we speak about this issues in terms of keyword databases/generators/families. My point is that words and their relationship to one another is the fundamental factor of search and search engine marketing.
Your questions are excellent and sitting on the executive desks within most local search organizations today.
Thanks Chicago for the detailed answer!
I definitely agree that the main challenges in this subject surround the domains of SEM and keyword bidding, than simple classificaiton for browsing or navigation.
Taxonomies, you've got me started now - Prepare yourself for a long post(And I'm only doing this because it's the Local Search forum).
Taxonomies are problematic due to the need to classify each item into one or more category (thus making it very resource hungry).
Selecting the categories is the first hurdle, it's amazing how difficult this is. Every single person would categorise a complex dataset such as a Yellow Pages in a different way (assuming they all had to start from scratch and could be bothered to do so).
The well-crafted taxonomies that work in print, as Chicago points out, are not suited to the web. He should know and I definately agree with him.
So what do we do?
If we are going to do it for the web then we must think of the endless queries that are used in everyday search, then make a system that goes some way to understanding those queries.
I'm going to break the rules with a couple of URLs if I may be allowed, I have to do so to illustrate the point that an automated classification system is required.
I recently undertook a project which tries to categorise a research paper based on just its title. The categories have been laid out into 15 main panels which are divided further into a total of 67 sub-panels.
Bear with me I'm getting to the point.
Using the data collected from 268,000 research papers I used latent semantic indexing to create an application that can take a sentence and categorise it into the 67 ctegories based on a percentage score. I have put a reduced version of this online for you to see (which only works with single words, heck I can't show it all to the public).
It's not laid out well but it works like this:
The main panels (A to O) are made up of the sub panels that follow them (e.g. panel A is Cardiovascular Medicine, Cancer Studies, Infection and Immunology, Other Hospital Based Clinical Subjects & Other Laboratory Based Clinical Subjects).
If you click on a main or sub panel it will show you 100 words that it associates with the area (it knows around 50,000), that's not exciting I know.
Now click on one of those words (or enter it in the querystring) and it will come back with a list of 10 probable categories that the word sits in. There are words that don't work too well and I didn't have time to make this flag them (if you see a list that starts with Statistics and Operational Research then Computer Science and Informatics, it's probably a word that it does not know).
None of this has been influenced by hand, it's all computed from data and a bespoke algorithm.
When you ask it to put suggestions up for a sentence or paragraph it gets reall good at knowing what you are on about.
The point of this is that it is dealing with only 67 categories and took many days to calculate. Try doing that with 2000 or 20000 categories and you are into the Picobyte storage range!
So how should we do it (and who is mad enough to try?)
I think that it's possible to categorise data for an IYP and get real english queries to work.
The problem is that the task is Google-esque in size.
Take all of the know categorisation data (YP data, DMOZ etc) and manually cross-reference each schema to the other, using a base schema which would likely be an IYP (which we will throw out later!)
Crawl every site on that list and create a word frequency list for each site.
Create 2000 master word frequency lists based on the YP categories that each site is in. Each of these would retain the 'plumbers' or 'hairdressers' heading (note they are much more generic than a standard search normally is)
Create a term document matrix (TDM) for the 2000 datasets, this gives relative weighting for each word that appears in each dataset. It gets complex from here on.
Use the TDM to do a crude categorisation of each website/page so that we can see the areas that the site/page covers. This would be a statistically backed categorisation using as much pre-structured data as is to hand.
Create an N dimensional database for the amount of terms that are used (usually stemmed and reduced in some other ways), complicate this by adding word proximity data for each dimension.
Sit back and marvel at a database that is so huge it won't happen soon, but if it was to happen it would be able to understand where to send you for things as diverse as 'drive me to the church in style', 'leaking radiator' or anything you are to think of.
It would be able to pick out documents that are related to the search and even be able to pick out truly similar documents, it could even map the search terms or pages to the YP schema so you could still ask for the old YP classifications but it would return better results.
Get some sleep.
As you can see, there is a lot to the process, but I'd bet that Google are already working on something similar. P.S. the terminology may be a little off, if I was 100% correct I'd be locked in a room at Google currently trying to do that (I'd love to give it a go).
I hope that wasn't too boring, I just thought I'd try to put it into some kind of perspective.
great stuff as usual inbound.
Interesting thoughts inbound. You are right about it being a mammoth undertaking.
A few comments
Your Step 1:
|Take all of the know categorisation data (YP data, DMOZ etc) and manually cross-reference each schema to the other, using a base schema |
As you noted this is the crux of the issue. The issue is that throughout the disparate sources for local data (both online and off) there are 1000's of different taxonomies and no easy method of mapping. When you consider that there are thousands of categories in many of the 1000's of taxonomy structures you realize the enormity of the task from a manual course of action. Let alone accounting for quality assurance.
You could create an application to do the mapping for you but this requires you get the taxonomies in a format which can be “digestible” by an application and potentially using the very same LSI technology you described to create the best guess of the mapping for you. And even then, formulating a hierarchy mapping rather than just a simple relationship for the data within that type automated process is very difficult and laden with issues.
So where does that leave us. Well, I guess it comes down to the goal of the exercise. I’ll get philosophical and ask why does one need the map? Depending on your use, you may not even need it. If you are just looking for a navigational hierarchy, you can buy one or use one of the free ones like DMOZ. If you are looking to provide relevant results based on search there are other methods which meet the goal in varying degrees. If you have the LSI type logic you describe, the taxonomy may become a non-issue or at minimum much less of an issue.
All that being said, I would love to hear of other solutions aside from the manual brute force methods for mapping different taxonomies.
You are right that you could do the LSI without a 'map' and indeed you could serve perfectly good results from most queries. The problem lies in categorising searches so that relevant ads can be shown and 'did you mean' features created. LSI is great at finding relationships, but it's a statistical approach that seems to 'understand' a query when in fact it does not.
Having a taxonomy that could point to similar heading for each query will allow the monetisation of the searches. So Pay Per Click on an LSI based system would not need a whole host of keywords, you just pick the headings (and possible sub headings). Remember that this applies to Local Search, product search would still probably be better done in the traditional PPC way.
I might try this in a sector that I know very well if I get the chance, I just don't have the time.
That's *part* of the problem - only about 30% of local businesses have sites identified with their entity at the moment. Therefore you must use some sort of "category synonym" approach.
Also, your approach (which is excellent, btw) seems to favor queries that are looking for information close to home, as opposed to a business close to home. It doesn't address those people who want to find a list of hardware stores near them that sell widget 101A, and 70% of those hardware stores don't have a website.
To address the OP - you know I've talked twice with the Moderator and his organization just this week on these very issues? :)
|- Do they "inherit" the classification from their data provider? |
Sometimes. At TL, we take all of the taxonomies we are aware of and combine them into something called (don't laugh) "JakeCats".
|- Perhaps there's a joint standard for businesses classification, resembling the universal system for classifying books in libraries? |
It's called SIC or NAICS. Unfortunately, it sucks the big one, and even though it's supposed to be universal between data providers, it isn't.
|- Maybe they use a software for automatic taxonomy (I've seen some names here and there)? |
Half million dollars a year at least to use this software. And iut often has the same problems as manual taxonomies do.
|- Or maybe a public gov system for this? |
NAICS was developed by the government, if I remember correctly.
The problem is that any individual taxonomy system will be more relevant to the organization that designed it than to anyone else. SIC has maybe two or three categories dealing with the Internet (and zero dealing with the web), but they have thousands of categories dealing with mining.
JakeCats provide a lot more granularity than SIC codes do in consumer type areas, but we have one category for mining. :)
(One type of mining - should be data mining)
Thanks for chipping in, it is true that the LSI system would suit some types of queries better than others. I do think that it would be able to help people find service related businesses more easily than product based.
It would be O.K. (in a limited sense) for product searches (as if the data isn't there to be had then no-one would be able to do it).
There are a few types of results it could show:
* Probable categories (and hence business listings) for each search
* Best-match websites based on the search
* Best-match websites based on the most probable category
* 'Did you mean' suggestions based on proximity data (like GigaBits but more advanced) e.g. 'stem cells' => 'Try Cancer Research'
Remember that there are many types of task that are best undertaken with different resources. Industrial B2B sales research in the UK is easily done with Kompass (now those are meaty books - the electronic version is clearly better). Because of this I'd hazard a guess that loca product search may end up being dominated by a different site in comparison to local service search.
Just wait until we see Froogle Local or Kelkoo Local! (Although TL was able to find a Jimmy Choo (expensive shoes) stockist in Chicago, maybe TL is stealing a march on the non-existent Froogle Local?)
It's an exciting time. Now's the time to wait for it to happen or to go out and make it happen. I choose the latter.
Many thanks for the answers, Jake.
Inbound - great stuff!
|Using the data collected from 268,000 research papers I used latent semantic indexing to create an application that can take a sentence and categorise it into the 67 ctegories based on a percentage score. I have put a reduced version of this online for you to see (which only works with single words, heck I can't show it all to the public). |
You collect and analyze research paper content to categorize research paper titles.
What would be the relevant data to collect and analyze to categorize YP listings by title? The title will often be proper names with no meaning and no analogy.
The categorisation of websites is easier than the research paper task as there is much more than just a title to go on, we would use the content of a page/site to find the likely categories.
We were up against an odd problem when we did the categorisation by title; 35,000 research papers that did not have an electronic version available (also not categorised by hand in any way).
I'm going to start an experiment next week which should be interesting, I'll report back here with what I can tell you all.
|The categorisation of websites is easier than the research paper task as there is much more than just a title to go on, we would use the content of a page/site to find the likely categories. |
That was my first thought, to crawl reviews and the like, but there are many listings with title and address, and no content. Would they remain unclassified or default to a standard YP classification?
Nevertheless, looking forward to your report on further testing. This has got me excited.