Welcome to WebmasterWorld Guest from

Forum Moderators: phranque

Message Too Old, No Replies

Using wiktionary as a data source

My brain hurts!



8:22 am on Aug 12, 2011 (gmt 0)

5+ Year Member

My brain hurts! I'm trying to work out whether a Wiktionary dump could be used as a data source or not.

I'm talking about extracting and utilising the "facts" contained in the dump. So, not the definitions (which I believe would be covered by the licence [wikimediafoundation.org]) but other information such as word pronunciation/part of speech (noun, verb etc)/plurals etc.

Three issues as far as I can see.

* Given that facts aren't covered by copyright, the licence doesn't grant any rights (?)...
* ...but what about errors unique to wiktionary?
* Some countries do grant certain rights to compilations. How does this factor here?

To try and forestall some unnecessary tangents (not that I've ever seen this work ;) )

* I'm aware that posters can't give legal advice, only their opinion etc, etc.
* I'm not looking to hide the source of the data. Either I can do this legally, or not.
* I'm not looking to put the resulting application behind a paywall. It would be freely usable (however selections of the data chosen by users may fall under the heading of possibly identifying data (see the AOL debacle) so data generated by the app couldn't be shared.
* I'm aware of, and already using, both The CMU Pronouncing Dictionary [speech.cs.cmu.edu] and WordNet [wordnet.princeton.edu].
* I'm looking to use the data in the dump, not an analysis of the data.
* I don't need the data to be 100% accurate. Coverage is more important.
* I'm aware that parsing the information is likely to non-trivial, but I think I'm up to it (plus see the previous point).


5:31 pm on Aug 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

What about a duplicate content issue?

G keeps track of when a page is first crawled, so your site would be viewed as a duplicate, correct?

brotherhood of LAN

5:33 pm on Aug 12, 2011 (gmt 0)

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

I'm aware of, and already using, both The CMU Pronouncing Dictionary [speech.cs.cmu.edu] and WordNet [wordnet.princeton.edu].

OpenCyc is another one.


7:51 am on Aug 15, 2011 (gmt 0)

5+ Year Member

What about a duplicate content issue?

I'm not looking to duplicate the pages, I understand the implications of the licence if I was doing that (I think!)

I've got a word based web app that's hungry for data (particularly pronunciations, and word stems).

OpenCyc is another one.

Not sure if that one would be practical for what I'm looking to do right now. It may however, be possible to do some very interesting things with it later on. Thanks.


3:09 pm on Aug 15, 2011 (gmt 0)

10+ Year Member

You need to speak with an Attorney regarding that issue.

Featured Threads

Hot Threads This Week

Hot Threads This Month