Multilingual native language identification

Research output: Contribution to journalArticleResearchpeer-review

Abstract

We present the first comprehensive study of Native Language Identification (NLI) applied to text written in languages other than English, using data from six languages. NLI is the task of predicting an author's first language using only their writings in a second language, with applications in Second Language Acquisition and forensic linguistics. Most research to date has focused on English but there is a need to apply NLI to other languages, not only to gauge its applicability but also to aid in teaching research for other emerging languages. With this goal, we identify six typologically very different sources of non-English second language data and conduct six experiments using a set of commonly used features. Our first two experiments evaluate our features and corpora, showing that the features perform well and at similar rates across languages. The third experiment compares non-native and native control data, showing that they can be discerned with 95 per cent accuracy. Our fourth experiment provides a cross-linguistic assessment of how the degree of syntactic data encoded in part-of-speech tags affects their efficiency as classification features, finding that most differences between first language groups lie in the ordering of the most basic word categories. We also tackle two questions that have not previously been addressed for NLI. Other work in NLI has shown that ensembles of classifiers over feature types work well and in our final experiment we use such an oracle classifier to derive an upper limit for classification accuracy with our feature set. We also present an analysis examining feature diversity, aiming to estimate the degree of overlap and complementarity between our chosen features employing an association measure for binary data. Finally, we conclude with a general discussion and outline directions for future work.

LanguageEnglish
Pages163-215
Number of pages53
JournalNatural Language Engineering
Volume23
Issue number2
DOIs
Publication statusPublished - 1 Mar 2017

Fingerprint

language
Linguistics
Experiments
Classifiers
experiment
Syntactics
Gages
Teaching
Language
Native Language
linguistics
language group
Experiment
teaching research
language acquisition
efficiency
Classifier

Cite this

@article{6e51dd13ed5e485ab7677050ac8d1e88,
title = "Multilingual native language identification",
abstract = "We present the first comprehensive study of Native Language Identification (NLI) applied to text written in languages other than English, using data from six languages. NLI is the task of predicting an author's first language using only their writings in a second language, with applications in Second Language Acquisition and forensic linguistics. Most research to date has focused on English but there is a need to apply NLI to other languages, not only to gauge its applicability but also to aid in teaching research for other emerging languages. With this goal, we identify six typologically very different sources of non-English second language data and conduct six experiments using a set of commonly used features. Our first two experiments evaluate our features and corpora, showing that the features perform well and at similar rates across languages. The third experiment compares non-native and native control data, showing that they can be discerned with 95 per cent accuracy. Our fourth experiment provides a cross-linguistic assessment of how the degree of syntactic data encoded in part-of-speech tags affects their efficiency as classification features, finding that most differences between first language groups lie in the ordering of the most basic word categories. We also tackle two questions that have not previously been addressed for NLI. Other work in NLI has shown that ensembles of classifiers over feature types work well and in our final experiment we use such an oracle classifier to derive an upper limit for classification accuracy with our feature set. We also present an analysis examining feature diversity, aiming to estimate the degree of overlap and complementarity between our chosen features employing an association measure for binary data. Finally, we conclude with a general discussion and outline directions for future work.",
author = "Shervin Malmasi and Mark Dras",
year = "2017",
month = "3",
day = "1",
doi = "10.1017/S1351324915000406",
language = "English",
volume = "23",
pages = "163--215",
journal = "Natural Language Engineering",
issn = "1351-3249",
publisher = "Cambridge University Press",
number = "2",

}

Multilingual native language identification. / Malmasi, Shervin; Dras, Mark.

In: Natural Language Engineering, Vol. 23, No. 2, 01.03.2017, p. 163-215.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Multilingual native language identification

AU - Malmasi, Shervin

AU - Dras, Mark

PY - 2017/3/1

Y1 - 2017/3/1

N2 - We present the first comprehensive study of Native Language Identification (NLI) applied to text written in languages other than English, using data from six languages. NLI is the task of predicting an author's first language using only their writings in a second language, with applications in Second Language Acquisition and forensic linguistics. Most research to date has focused on English but there is a need to apply NLI to other languages, not only to gauge its applicability but also to aid in teaching research for other emerging languages. With this goal, we identify six typologically very different sources of non-English second language data and conduct six experiments using a set of commonly used features. Our first two experiments evaluate our features and corpora, showing that the features perform well and at similar rates across languages. The third experiment compares non-native and native control data, showing that they can be discerned with 95 per cent accuracy. Our fourth experiment provides a cross-linguistic assessment of how the degree of syntactic data encoded in part-of-speech tags affects their efficiency as classification features, finding that most differences between first language groups lie in the ordering of the most basic word categories. We also tackle two questions that have not previously been addressed for NLI. Other work in NLI has shown that ensembles of classifiers over feature types work well and in our final experiment we use such an oracle classifier to derive an upper limit for classification accuracy with our feature set. We also present an analysis examining feature diversity, aiming to estimate the degree of overlap and complementarity between our chosen features employing an association measure for binary data. Finally, we conclude with a general discussion and outline directions for future work.

AB - We present the first comprehensive study of Native Language Identification (NLI) applied to text written in languages other than English, using data from six languages. NLI is the task of predicting an author's first language using only their writings in a second language, with applications in Second Language Acquisition and forensic linguistics. Most research to date has focused on English but there is a need to apply NLI to other languages, not only to gauge its applicability but also to aid in teaching research for other emerging languages. With this goal, we identify six typologically very different sources of non-English second language data and conduct six experiments using a set of commonly used features. Our first two experiments evaluate our features and corpora, showing that the features perform well and at similar rates across languages. The third experiment compares non-native and native control data, showing that they can be discerned with 95 per cent accuracy. Our fourth experiment provides a cross-linguistic assessment of how the degree of syntactic data encoded in part-of-speech tags affects their efficiency as classification features, finding that most differences between first language groups lie in the ordering of the most basic word categories. We also tackle two questions that have not previously been addressed for NLI. Other work in NLI has shown that ensembles of classifiers over feature types work well and in our final experiment we use such an oracle classifier to derive an upper limit for classification accuracy with our feature set. We also present an analysis examining feature diversity, aiming to estimate the degree of overlap and complementarity between our chosen features employing an association measure for binary data. Finally, we conclude with a general discussion and outline directions for future work.

UR - http://www.scopus.com/inward/record.url?scp=84948975834&partnerID=8YFLogxK

U2 - 10.1017/S1351324915000406

DO - 10.1017/S1351324915000406

M3 - Article

VL - 23

SP - 163

EP - 215

JO - Natural Language Engineering

T2 - Natural Language Engineering

JF - Natural Language Engineering

SN - 1351-3249

IS - 2

ER -