Forum Moderators: open

Message Too Old, No Replies

Web Site Categorisation

What is the best way to categorise web content?

         

Dajuroka

7:02 am on Sep 19, 2003 (gmt 0)

10+ Year Member



Apart from the Open Directory categories, Dewey Decimal and Library of Congress has anyone else come up with a useful / standardised set of classifications for indexing or mapping the web? This to me is the true value of the Open Directory. If we can maintain one, public domain, standardised classification system which is easy to download into databases and the like (to a level of complexity that matches the web they are running) then 'everyone' can place their sites in the 'right place'.

As someone who has tried to grow a classification it is a nontrivial task and almost impossible for one person to maintain (sorry IS impossible).

I would love to hear what others have been doing in this area.

Thanks for listening.

windharp

1:08 pm on Sep 19, 2003 (gmt 0)

10+ Year Member



Just a note from an ODP-editors view:
We are continuously changing the structure, renaming, moving and reorganizing. I don't think a system can be static - to much is changing all the time.

hutcheson

1:39 pm on Sep 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



DDS and LOC are both copyrighted. The ODP structure has been used by some other projects with official staff blessing, and could probably be used for others.

Dajuroka is surely right about the difficulty of building and maintaining such a beast. DDS and LOC both put out regular updates, and the ODP is always growing.

All of these schemes have on the order of 300,000-500,000 categories, and regular "subcategory structures" that can be plugged in as necessary for newly overgrown categories.

choster

5:45 pm on Sep 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If I recall correctly (this may be apocryphal), the original structure for what became ODP was based very loosely on the major headings of the Usenet hierarchy-- arts.*, biz.*, comp.*, news.*, rec.*, soc.* and so on. I suppose this was reasonable enough considering the founders were engineers and programmers, not librarians or KM consultants.

To a certain extent then, most of the "top-level" headings of ODP are not actually subjects so much as notional impressions of subjects. Shopping is Shopping and Regional is Regional but there is considerable fluidity between, say, Computers and Science, or Home and Society. It is almost an exercise in fancy-- if I gave you a word and asked you to group it with one of twelve others, which would it be? The "real" subjects are in the layers below, narrower and more concrete topics which people understand immediately-- religion, movies, drug companies, Canadian football.

This model is extremely flexible, important because the world does not stand still. Some libraries still group personal computing with UFOs and psychics because the Dewey Decimal System grouped them with novelties, anomalies, and miscellaneous. :-)

At the same time, it has some, uh, glaring inadequacies. Perceptions of what a thing is (ODP ostentatiously terms this "ontology") vary from person to person, from culture to culture, and from language to language. It can be vague ("Home"? "Society"? And where does "Health" end and "Science" begin?). Lastly, some of the placements are forced-- ODP editors will be the very first to pronounce their astonishment and disaste that Education was fixed as a subcategory of Reference in the directory's infancy.

Dajuroka

12:56 am on Sep 20, 2003 (gmt 0)

10+ Year Member



So what is the answer?

Do we start to develop an open source classification based on the ODP or is that even now too fixed in its origins to cope with a major change? Do we take the basics from all the others and start to grow one?

Is ODP adequate or does the top level need readjusting? I am aware that the Library of Congress is looking at how it catagorises the web. The problem is that we are listing not just a document (or book) but commercial sites, blogs, audio, visual.

I like the ODP but I fear that the top level should be larger to make the tree a little more logical.

With Google using it now and thousands of other sites one wonders if it will just live on. Of course if it is too dynamic then information will be lost or impossible to find.

Dajuroka

12:33 pm on Sep 20, 2003 (gmt 0)

10+ Year Member



Its interesting as I gradually take in the classification it has dawned on me how USA centric the classification is ( I know here come the flames!).

But using "Math" instead of "Mathematics", "Kids and Teen" rather than "Children and Teenagers" and the whole use of the regional context.

This will make it harder to standardise sites. Its a bit like the use of .com in its early days being almost identical to .us. Must be hard for editors who find great regional sites that aren't US based.

And other little things like "Health Aging" at the same level as "Health Senior Health".

"Recreation> Antiques> US Civil War" but no other civil wars have antiques of interest?

I understand how these things happen and I do not for one minute suggest that the USA does not dominate the URI on the web but to have a universal directory even for english speaking sites would seem to me to need some significant shifts.

Certainly on my very little site I will be trying not regionalise as the ODP has... God knows how but will try.)

hutcheson

9:24 pm on Sep 20, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No flames necessary. 'Struth. Of course, DDS is even more so...and if I knew the LOC well enough, I'd guess it leans that way.

Some of it's inevitable. Pop culture worldwide is (for better or for worse--well if you insist, for worse and for worser--USA-centric: Pop music, movies, TV). World/Hindi probably has a suitably India-centric Movies category, and World/French a surprisingly unchauvinistic but still France-centric Literature category.

Some of it's fortunate. A USA-centric Religion category is richer than any other nation's could conceivably be (even if you omit Christianity altogether.) American Literature courses still probably spend more time on French authors than the reverse. (But I have a book, an anthology of Middle English literature was compiled by a French professor.) "Classical Chinese" music remains culturally limited in a way that "Classical German" music doesn't. Sciences and Engineering aren't culture-specific (although the priorities and emphases are), but the U.S. is large enough to provide a broad knowledge base and a fairly representative sample of research, and English is the closest thing to a world engineering language.

Some of it's probably harmless: a sociological curiosity.

Some of it's fixable, if we can find editors with the complementary knowledge and interests we need. The goal from the beginning was to index the "sum of HUMAN knowledge." It hasn't been achieved yet.

But some of it is going to remain confusing. I have a hard time dealing with British educational sites -- the terms are too different: so are the semesters, probably. The Australian denominations confuse me: ours have to be more confusing to Aussies. Asian musical styles and modes are probably eternally beyond my comprehension. Japanese literature. Italian political parties ... so long as there's something here to confuse everybody, we're probably doing as well as we can.

Dajuroka

10:54 am on Sep 22, 2003 (gmt 0)

10+ Year Member



I find it interesting that NineMSN use the following Top Structure for their home page. I wonder if they used some system or just invented yet another. It is surprising how little correlation with ODP at that TOP level.
Autos/boats/bikes
Computers & tech
Encyclopedia/study
Entertainment
Finance
Health & lifestyle
Jobs/careers
Kids
Magazines
Mobile central
News/current affairs
People & chat
Property & listings
Shopping
Sports
Travel
TV shows

hutcheson

2:08 pm on Sep 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Choster mentioned (in the other thread) that the top-level ODP categories are for historical reasons not so much USA-centric as geek-centric. It's based on what geeks were talking about on the net back before affiliate spam was invented. (And since they have been frozen from the beginning, there are oddities which nobody would repeat if redesigning from scratch -- geeks have their own blind spots, even besides a predilection for not-affiliate-spammy stuff.)

The ODP-designed taxonomy really doesn't begin until the second level (which has been extensively debated and occasionally modified, and wasn't really frozen till a couple of years ago: we tend to compensate for the oddities of the first level by lots of @links.) I should also mention that the freeze of the ODP's first level means that our licensees can (and some do) create their own different first level layout.

In contrast, it's easy enough (and hardly surprising) that MineMSN is laid out like a newspaper's classified ads or a TV station's program categories -- It's definitely designed by marketroids around what kind of content Microsoft hopes to sell to customers. I'd call it "mammon-centered" or "couch-potato-targeted."

[The Dewey Decimal System has its own oddities. IIRC, "Religion and Philosophy" is divided into 8 Christianity categories, 1 Philosophy category, and 1 "all other religions" catchall. That's valid based on what Melville was actually seeing in his library at the time, although a Chinese or Indian library might well benefit from a different breakdown.] I edit in "Religion", and I think it interesting that Christianity, Islam, and Scientology are third-level ODP categories -- higher than any of the other major directories, although not as high as in the DDS, where Religion is a first-level category. Its size would justify top-level at the ODP also, for that matter.]

Comparing online directories, the ODP's layout is more like Yahoo's -- another geek-founded project. But the comparision of MammonM$N and the ODP might be more interesting if you looked at the second-level categories. You could probably create a fairly good correlation even between MM$N and the Geekdirs if you picked and chose from second-to-fourth-level categories.

Dajuroka

6:46 am on Sep 23, 2003 (gmt 0)

10+ Year Member



So is there a place for starting to redefine for the 'future' or do we all just live with many systems and wish for one universal categorisation. It is interesting in my 'real' life I work in health and am active nationally (Australia) in developing a family of health classifications (as is WHO) and really many of them would also apply (at an increasingly atomic level) under 'Health' such as the International Classification of Disease and International Classification of Functioning. I know that the USA and UK has locked on to SNOMED CT for a terminology.

Anyhow life would certainly be easier if the web could use one system. I am sure 70% of currently classified URLs would not be disputed ie Golf is Golf (unless you are shopping then maybe it is Shopping Golf.... damn there it goes again...)

Has anyone created a straight text file of the ODP Categories (ie .txt .doc) as the RDF is very complex and a bugger to download when you have a capped download system. never seems to stop!

Are there any other forums where these issues are being debated?

tombola

8:08 am on Sep 23, 2003 (gmt 0)

10+ Year Member



Has anyone created a straight text file of the ODP Categories (ie .txt .doc)

I have. It's a .txt file of 35 MB and it contains only the name of the categories (last update: February 2003).

Send me a sticky mail if you have an idea how to make this file available to you ;-)