homepage Welcome to WebmasterWorld Guest from 54.161.191.254
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque

Webmaster General Forum

    
Using wiktionary as a data source
My brain hurts!
Status_203




msg:4350683
 8:22 am on Aug 12, 2011 (gmt 0)

My brain hurts! I'm trying to work out whether a Wiktionary dump could be used as a data source or not.

I'm talking about extracting and utilising the "facts" contained in the dump. So, not the definitions (which I believe would be covered by the licence [wikimediafoundation.org]) but other information such as word pronunciation/part of speech (noun, verb etc)/plurals etc.

Three issues as far as I can see.

* Given that facts aren't covered by copyright, the licence doesn't grant any rights (?)...
* ...but what about errors unique to wiktionary?
* Some countries do grant certain rights to compilations. How does this factor here?

To try and forestall some unnecessary tangents (not that I've ever seen this work ;) )

* I'm aware that posters can't give legal advice, only their opinion etc, etc.
* I'm not looking to hide the source of the data. Either I can do this legally, or not.
* I'm not looking to put the resulting application behind a paywall. It would be freely usable (however selections of the data chosen by users may fall under the heading of possibly identifying data (see the AOL debacle) so data generated by the app couldn't be shared.
* I'm aware of, and already using, both The CMU Pronouncing Dictionary [speech.cs.cmu.edu] and WordNet [wordnet.princeton.edu].
* I'm looking to use the data in the dump, not an analysis of the data.
* I don't need the data to be 100% accurate. Coverage is more important.
* I'm aware that parsing the information is likely to non-trivial, but I think I'm up to it (plus see the previous point).

 

Hoople




msg:4350891
 5:31 pm on Aug 12, 2011 (gmt 0)

What about a duplicate content issue?

G keeps track of when a page is first crawled, so your site would be viewed as a duplicate, correct?

brotherhood of LAN




msg:4350896
 5:33 pm on Aug 12, 2011 (gmt 0)

I'm aware of, and already using, both The CMU Pronouncing Dictionary [speech.cs.cmu.edu] and WordNet [wordnet.princeton.edu].


OpenCyc is another one.
[opencyc.org...]

Status_203




msg:4351526
 7:51 am on Aug 15, 2011 (gmt 0)

What about a duplicate content issue?


I'm not looking to duplicate the pages, I understand the implications of the licence if I was doing that (I think!)

I've got a word based web app that's hungry for data (particularly pronunciations, and word stems).

OpenCyc is another one.


Not sure if that one would be practical for what I'm looking to do right now. It may however, be possible to do some very interesting things with it later on. Thanks.

eventus




msg:4351621
 3:09 pm on Aug 15, 2011 (gmt 0)

You need to speak with an Attorney regarding that issue.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved