TY - JOUR
T1 - WordSeg
T2 - standardizing unsupervised word form segmentation from text
AU - Bernard, Mathieu
AU - Thiolliere, Roland
AU - Saksida, Amanda
AU - Loukatou, Georgia R.
AU - Larsen, Elin
AU - Johnson, Mark
AU - Fibla, Laia
AU - Dupoux, Emmanuel
AU - Daland, Robert
AU - Cao, Xuan Nga
AU - Cristia, Alejandrina
PY - 2020/2
Y1 - 2020/2
N2 - A basic task in first language acquisition likely involves discovering the boundaries between words or morphemes in input where these basic units are not overtly segmented. A number of unsupervised learning algorithms have been proposed in the last 20 years for these purposes, some of which have been implemented computationally, but whose results remain difficult to compare across papers. We created a tool that is open source, enables reproducible results, and encourages cumulative science in this domain. WordSeg has a modular architecture: It combines a set of corpora description routines, multiple algorithms varying in complexity and cognitive assumptions (including several that were not publicly available, or insufficiently documented), and a rich evaluation package. In the paper, we illustrate the use of this package by analyzing a corpus of child-directed speech in various ways, which further allows us to make recommendations for experimental design of follow-up work. Supplementary materials allow readers to reproduce every result in this paper, and detailed online instructions further enable them to go beyond what we have done. Moreover, the system can be installed within container software that ensures a stable and reliable environment. Finally, by virtue of its modular architecture and transparency, WordSeg can work as an open-source platform, to which other researchers can add their own segmentation algorithms.
AB - A basic task in first language acquisition likely involves discovering the boundaries between words or morphemes in input where these basic units are not overtly segmented. A number of unsupervised learning algorithms have been proposed in the last 20 years for these purposes, some of which have been implemented computationally, but whose results remain difficult to compare across papers. We created a tool that is open source, enables reproducible results, and encourages cumulative science in this domain. WordSeg has a modular architecture: It combines a set of corpora description routines, multiple algorithms varying in complexity and cognitive assumptions (including several that were not publicly available, or insufficiently documented), and a rich evaluation package. In the paper, we illustrate the use of this package by analyzing a corpus of child-directed speech in various ways, which further allows us to make recommendations for experimental design of follow-up work. Supplementary materials allow readers to reproduce every result in this paper, and detailed online instructions further enable them to go beyond what we have done. Moreover, the system can be installed within container software that ensures a stable and reliable environment. Finally, by virtue of its modular architecture and transparency, WordSeg can work as an open-source platform, to which other researchers can add their own segmentation algorithms.
KW - Unsupervised word discovery
KW - First language acquisition
KW - Natural language processing
KW - Cumulative science
UR - http://www.scopus.com/inward/record.url?scp=85064436348&partnerID=8YFLogxK
U2 - 10.3758/s13428-019-01223-3
DO - 10.3758/s13428-019-01223-3
M3 - Article
C2 - 30937845
AN - SCOPUS:85064436348
SN - 1554-351X
VL - 52
SP - 264
EP - 278
JO - Behavior Research Methods
JF - Behavior Research Methods
IS - 1
ER -