My brain hurts! I'm trying to work out whether a Wiktionary dump could be used as a data source or not.
I'm talking about extracting and utilising the "facts" contained in the dump. So, not the definitions (which I believe would be covered by the
licence [wikimediafoundation.org]) but other information such as word pronunciation/part of speech (noun, verb etc)/plurals etc.
Three issues as far as I can see.
* Given that facts aren't covered by copyright, the licence doesn't grant any rights (?)...
* ...but what about errors unique to wiktionary?
* Some countries do grant certain rights to compilations. How does this factor here?
To try and forestall some unnecessary tangents (not that I've ever seen this work ;) )
* I'm aware that posters can't give legal advice, only their opinion etc, etc.
* I'm not looking to hide the source of the data. Either I can do this legally, or not.
* I'm not looking to put the resulting application behind a paywall. It would be freely usable (however selections of the data chosen by users may fall under the heading of possibly identifying data (see the AOL debacle) so data generated by the app couldn't be shared.
* I'm aware of, and already using, both
The CMU Pronouncing Dictionary [speech.cs.cmu.edu] and
WordNet [wordnet.princeton.edu].
* I'm looking to use the data in the dump, not an analysis of the data.
* I don't need the data to be 100% accurate. Coverage is more important.
* I'm aware that parsing the information is likely to non-trivial, but I think I'm up to it (plus see the previous point).