We present a study of Native Language Identification (NLI) using data from learners of Norwegian, a language not yet used for this task. NLI is the task of predicting a writer’s first language using only their writings in a learned language. We find that three feature types, function words, part-of-speech n-grams and a hybrid part-of-speech/function word mixture n-gram model are useful here. Our system achieves an accuracy of 79% against a baseline of 13% for predicting an author’s L1. The same features can distinguish non-native writing with 99% accuracy. We also find that part-of-speech n-gram performance on this data deviates from previous NLI results, possibly due to the use of manually post-corrected tags.
|Number of pages||9|
|Journal||RANLP 2015 : International Conference Recent Advances in Natural Language Processing : proceedings|
|Publication status||Published - 2015|
|Event||International Conference Recent Advances in Natural Language Processing (10th : 2015) - Hissar, Bulgaria|
Duration: 7 Sep 2015 → 9 Sep 2015