Forum Moderators: coopster & phranque

Message Too Old, No Replies

Automating Document Categorization

         

Gibble

9:53 pm on Mar 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As part of a knowledge management project I'm working on, I'm trying to find a way to automate the document categorization of uploaded documents.

ie. If someone uploads a .doc or .htm, etc on bicycles the "script" will know it's on bicycles and file it accordingly.

I've been trying to figure out how to implement a Naive Bayesian Filter, or something else that would work, but everything I read on it is just a broad overview, and nothing seems to simply explain how to implement such a thing.

Does anyone have any ideas, or solid resources on this topic?

Thanks for the help :)

Birdman

2:46 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



All you need to do is add the category selection to your upload form. Then in your upload script you use that value to alter the directory/folder the file gets saved to.

Your best bet is to build a dynamic, database driven site using PHP [php.net] / MySQL [mysql.com] combo. Here is a tutorial on building a dynamic website [hotwired.lycos.com].

Gibble

2:48 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



hehehe

Believe me, I know how to make dynamic sites, and I could add a category dropdown, but I don't want to, this system is going to be as automated as possible, we have alot of documents we want to just send it and have categorized, then get a listing so we can verify it, rather than spending days and days categorizing everything mannually.

brotherhood of LAN

3:08 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Gibble, hope youre not looking to generate lots of pages ..... :)

do a search on "wordnet" or "automatic hyponymy", and there is lots of pdf's about this stuff. I've read a few, you'd be lucky to get something that a computer could do all by itself - you need to hand feed it facts.

wordnet covers a few main topics and has some other useful stuff that might fit into your plans, otherwise youre going to have to autogenerate it by hand :)

this one sounds good "Thesaurus as a Tool for Automatic Detection of Lexical Cohesion"....happy reading :)

Gibble

3:22 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I realize it will need a sample set to work from for each category, and I don't mind feeding it some documents and telling it where to put them to get it started, but in time it should learn enough.

From what I understand it works something like this.

You give it 10 documents as a sample set for the category cars. It notices that the words car, automobile, ford, chevy, appear XX% of the time if the document belongs in this category. You send it another document, it then checks the word frequency in that document of these key words and returns a probability as to whether that document applies to the category.

The problem is converting the logic to code, I know it's been done, but do you think I can figure out HOW to do it myself?

brotherhood of LAN

3:56 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>HOW

that's what I meant by it's near impossible to do it automatically.

THIS page is about you autogenerating a cat structure, but it wouldnt have mentioned the words I typed if I hadnt posted. If it hadnt mentioned "hyponymy" then youd have to code the script to know that hyponymy is the same as "category structure".

wordnet has some text files, split into nouns,verbs,adverbs and adjectives. They have pointers between words that the program uses to distinguish relations between the words. If you look at their site, there is actually a PHP version somewhere, go hunt it out! ;)

For you categorising the pages as they arrive, your best bet is to simply get a script to count how many times a word is used, and say, use the top 10% of words in the doc as a possible categorization of the doc.

Then when you have those prime 1 or 2 keywords, you can use the likes of wordnet to provide the relevant pointers to the page. Or maybe you could use Brett's Quick Rank [webmasterworld.com]

When you do get the major keywords from "on the page", you'll want to have them in some sort of cat structure, and not a cat for every word you get lumped with.....

An example....we are talking about cat structures, if the keywords on this page amounted to "hyponymy","text","category", then we could relate it to what's already known in the dictionary


The noun hyponymy has 1 sense (no senses from tagged texts)

1. hyponymy, subordination -- (the semantic relation of being subordinate or belonging to a lower rank or class)

1 sense of hyponymy

Sense 1
hyponymy, subordination -- (the semantic relation of being subordinate or belonging to a lower rank or class)
=> semantic relation -- (a relation between meanings)

1 sense of hyponymy

Sense 1
hyponymy, subordination -- (the semantic relation of being subordinate or belonging to a lower rank or class)
-> semantic relation -- (a relation between meanings)
=> hyponymy, subordination -- (the semantic relation of being subordinate or belonging to a lower rank or class)
=> hypernymy, superordination -- (the semantic relation of being superordinate or belonging to a higher rank or class)
=> synonymy, synonymity, synonymousness -- (the semantic relation that holds between two words that can (in a given context) express the same meaning)
=> antonymy -- (the semantic relation that holds between two words that can (in a given context) express opposite meanings)
=> holonymy, whole to part relation -- (the semantic relation that holds between a whole and its parts)
=> meronymy, part to whole relation -- (the semantic relation that holds between a part and the whole)
=> troponymy -- (the semantic relation of being a manner of does something)

hope you find this info useful at least ;) I'm in the middle of making somethiing like this myself, but am concentrating on the storage rather than the precision.

Gibble

4:04 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmm...maybe I'm misinterpreting what your saying, but it doesn't seem to be helping.

Lets simplify this, say I have several directories of files, the program treats each directory as a category, I send it another file, it reads the documents in each directory, and compares the results with the document its being sent and returns the probability of the document fitting into each directory, and then places it in the directory with the highest probability.

brotherhood of LAN

4:11 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



ah, so youre going to have directories..geez i thought you wanted them autogenerated ;)

check out brett's quick rank, if a "word" or "phrase" ranks high, then it should go into the appropriate category.

apologies, I thought you wanted to autogenerate the categories aswell.

Gibble

4:36 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Autogenerated Categories would be nice, maybe in phase two, but not yet.

The problem with a word or phrase, is there can be several, and they could be unexpected, I need a way to read a sample of documents on a single category and come up with a list of words and phrases that indicate something belongs in that category and what the probability is.

The key is probabilities, something could fit into several categories/directores, and theoretically the directory that calculates to the highest probability is where that document should get stored (And consequently increases the sample size for that category and subsequent attempts checks more accurate).

brotherhood of LAN

4:57 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>probabilities

yes, Brett's quick rank will do that. Don't give words or phrases scores, give them probabilities, same thing to me - all relative right? :)

The problem with a word or phrase, is there can be several, and they could be unexpected, I need a way to read a sample of documents on a single category and come up with a list of words and phrases that indicate something belongs in that category and what the probability is.

wordnet (again), has an 90 page pdf, explaining that this is the biggest barrier to automation. IMHO you can work out all the probabilities in the world, you wont get a computer to understand that "me" means the same as "I". Someone has to do that sort of donkey work.

my 0.02 is to checkout wordnet HTH....

//added
This is a handy reference, somewhat on topic and has a few suggestions for tips and tricks to get what you need.
Automatic Hypertext Link Generation [fundp.ac.be]

Gibble

5:15 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Actually that document is pretty good.

brotherhood of LAN

5:22 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Check out the conclusion though. Maybe one part of a couple-of-pieces jigsaw.

"semi-automated" .... still got some work to go.

I noticed the doc mentioned neural networks, jatar posted a doc about these the other day. Here it is [freebsd.mu]

//
just as a sidenote, those makers of wordnet are "lexicographers", or "psycholinguists"...all 3 docs are somewhat related..course, it would be nice to have a script to tell you how much they are related :)

didnt mean to hijack the thread...good subject though.

Gibble

5:24 pm on Mar 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



hijack all you want, it's still on topic and every bit helps.

I do agree though, I am only looking at a small subset of the entire problem. Programming 101, divide and conquer