TY - JOUR
T1 - Multilingual native language identification
AU - Malmasi, Shervin
AU - Dras, Mark
PY - 2017/3/1
Y1 - 2017/3/1
N2 - We present the first comprehensive study of Native Language Identification (NLI) applied to text written in languages other than English, using data from six languages. NLI is the task of predicting an author's first language using only their writings in a second language, with applications in Second Language Acquisition and forensic linguistics. Most research to date has focused on English but there is a need to apply NLI to other languages, not only to gauge its applicability but also to aid in teaching research for other emerging languages. With this goal, we identify six typologically very different sources of non-English second language data and conduct six experiments using a set of commonly used features. Our first two experiments evaluate our features and corpora, showing that the features perform well and at similar rates across languages. The third experiment compares non-native and native control data, showing that they can be discerned with 95 per cent accuracy. Our fourth experiment provides a cross-linguistic assessment of how the degree of syntactic data encoded in part-of-speech tags affects their efficiency as classification features, finding that most differences between first language groups lie in the ordering of the most basic word categories. We also tackle two questions that have not previously been addressed for NLI. Other work in NLI has shown that ensembles of classifiers over feature types work well and in our final experiment we use such an oracle classifier to derive an upper limit for classification accuracy with our feature set. We also present an analysis examining feature diversity, aiming to estimate the degree of overlap and complementarity between our chosen features employing an association measure for binary data. Finally, we conclude with a general discussion and outline directions for future work.
AB - We present the first comprehensive study of Native Language Identification (NLI) applied to text written in languages other than English, using data from six languages. NLI is the task of predicting an author's first language using only their writings in a second language, with applications in Second Language Acquisition and forensic linguistics. Most research to date has focused on English but there is a need to apply NLI to other languages, not only to gauge its applicability but also to aid in teaching research for other emerging languages. With this goal, we identify six typologically very different sources of non-English second language data and conduct six experiments using a set of commonly used features. Our first two experiments evaluate our features and corpora, showing that the features perform well and at similar rates across languages. The third experiment compares non-native and native control data, showing that they can be discerned with 95 per cent accuracy. Our fourth experiment provides a cross-linguistic assessment of how the degree of syntactic data encoded in part-of-speech tags affects their efficiency as classification features, finding that most differences between first language groups lie in the ordering of the most basic word categories. We also tackle two questions that have not previously been addressed for NLI. Other work in NLI has shown that ensembles of classifiers over feature types work well and in our final experiment we use such an oracle classifier to derive an upper limit for classification accuracy with our feature set. We also present an analysis examining feature diversity, aiming to estimate the degree of overlap and complementarity between our chosen features employing an association measure for binary data. Finally, we conclude with a general discussion and outline directions for future work.
UR - http://www.scopus.com/inward/record.url?scp=84948975834&partnerID=8YFLogxK
U2 - 10.1017/S1351324915000406
DO - 10.1017/S1351324915000406
M3 - Article
AN - SCOPUS:84948975834
SN - 1351-3249
VL - 23
SP - 163
EP - 215
JO - Natural Language Engineering
JF - Natural Language Engineering
IS - 2
ER -