ie. If someone uploads a .doc or .htm, etc on bicycles the "script" will know it's on bicycles and file it accordingly.
I've been trying to figure out how to implement a Naive Bayesian Filter, or something else that would work, but everything I read on it is just a broad overview, and nothing seems to simply explain how to implement such a thing.
Does anyone have any ideas, or solid resources on this topic?
Thanks for the help :)
Your best bet is to build a dynamic, database driven site using PHP [php.net] / MySQL [mysql.com] combo. Here is a tutorial on building a dynamic website [hotwired.lycos.com].
Believe me, I know how to make dynamic sites, and I could add a category dropdown, but I don't want to, this system is going to be as automated as possible, we have alot of documents we want to just send it and have categorized, then get a listing so we can verify it, rather than spending days and days categorizing everything mannually.
do a search on "wordnet" or "automatic hyponymy", and there is lots of pdf's about this stuff. I've read a few, you'd be lucky to get something that a computer could do all by itself - you need to hand feed it facts.
wordnet covers a few main topics and has some other useful stuff that might fit into your plans, otherwise youre going to have to autogenerate it by hand :)
this one sounds good "Thesaurus as a Tool for Automatic Detection of Lexical Cohesion"....happy reading :)
From what I understand it works something like this.
You give it 10 documents as a sample set for the category cars. It notices that the words car, automobile, ford, chevy, appear XX% of the time if the document belongs in this category. You send it another document, it then checks the word frequency in that document of these key words and returns a probability as to whether that document applies to the category.
The problem is converting the logic to code, I know it's been done, but do you think I can figure out HOW to do it myself?
that's what I meant by it's near impossible to do it automatically.
THIS page is about you autogenerating a cat structure, but it wouldnt have mentioned the words I typed if I hadnt posted. If it hadnt mentioned "hyponymy" then youd have to code the script to know that hyponymy is the same as "category structure".
wordnet has some text files, split into nouns,verbs,adverbs and adjectives. They have pointers between words that the program uses to distinguish relations between the words. If you look at their site, there is actually a PHP version somewhere, go hunt it out! ;)
For you categorising the pages as they arrive, your best bet is to simply get a script to count how many times a word is used, and say, use the top 10% of words in the doc as a possible categorization of the doc.
Then when you have those prime 1 or 2 keywords, you can use the likes of wordnet to provide the relevant pointers to the page. Or maybe you could use Brett's Quick Rank [webmasterworld.com]
When you do get the major keywords from "on the page", you'll want to have them in some sort of cat structure, and not a cat for every word you get lumped with.....
An example....we are talking about cat structures, if the keywords on this page amounted to "hyponymy","text","category", then we could relate it to what's already known in the dictionary
The noun hyponymy has 1 sense (no senses from tagged texts)
1. hyponymy, subordination -- (the semantic relation of being subordinate or belonging to a lower rank or class)1 sense of hyponymy
Sense 1
hyponymy, subordination -- (the semantic relation of being subordinate or belonging to a lower rank or class)
=> semantic relation -- (a relation between meanings)1 sense of hyponymy
Sense 1
hyponymy, subordination -- (the semantic relation of being subordinate or belonging to a lower rank or class)
-> semantic relation -- (a relation between meanings)
=> hyponymy, subordination -- (the semantic relation of being subordinate or belonging to a lower rank or class)
=> hypernymy, superordination -- (the semantic relation of being superordinate or belonging to a higher rank or class)
=> synonymy, synonymity, synonymousness -- (the semantic relation that holds between two words that can (in a given context) express the same meaning)
=> antonymy -- (the semantic relation that holds between two words that can (in a given context) express opposite meanings)
=> holonymy, whole to part relation -- (the semantic relation that holds between a whole and its parts)
=> meronymy, part to whole relation -- (the semantic relation that holds between a part and the whole)
=> troponymy -- (the semantic relation of being a manner of does something)
hope you find this info useful at least ;) I'm in the middle of making somethiing like this myself, but am concentrating on the storage rather than the precision.
Lets simplify this, say I have several directories of files, the program treats each directory as a category, I send it another file, it reads the documents in each directory, and compares the results with the document its being sent and returns the probability of the document fitting into each directory, and then places it in the directory with the highest probability.
The problem with a word or phrase, is there can be several, and they could be unexpected, I need a way to read a sample of documents on a single category and come up with a list of words and phrases that indicate something belongs in that category and what the probability is.
The key is probabilities, something could fit into several categories/directores, and theoretically the directory that calculates to the highest probability is where that document should get stored (And consequently increases the sample size for that category and subsequent attempts checks more accurate).
yes, Brett's quick rank will do that. Don't give words or phrases scores, give them probabilities, same thing to me - all relative right? :)
The problem with a word or phrase, is there can be several, and they could be unexpected, I need a way to read a sample of documents on a single category and come up with a list of words and phrases that indicate something belongs in that category and what the probability is.
wordnet (again), has an 90 page pdf, explaining that this is the biggest barrier to automation. IMHO you can work out all the probabilities in the world, you wont get a computer to understand that "me" means the same as "I". Someone has to do that sort of donkey work.
my 0.02 is to checkout wordnet HTH....
//added
This is a handy reference, somewhat on topic and has a few suggestions for tips and tricks to get what you need.
Automatic Hypertext Link Generation [fundp.ac.be]
"semi-automated" .... still got some work to go.
I noticed the doc mentioned neural networks, jatar posted a doc about these the other day. Here it is [freebsd.mu]
//
just as a sidenote, those makers of wordnet are "lexicographers", or "psycholinguists"...all 3 docs are somewhat related..course, it would be nice to have a script to tell you how much they are related :)
didnt mean to hijack the thread...good subject though.