TY - JOUR
T1 - Topic modeling for native language identification
AU - Wong, Sze-Meng Jojo
AU - Dras, Mark
AU - Johnson, Mark
N1 - Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.
PY - 2011
Y1 - 2011
N2 - Native language identification (NLI) is the task of determining the native language of an author writing in a second language. Several pieces of earlier work have found that features such as function words, part-of-speech n-grams and syntactic structure are helpful in NLI, perhaps representing characteristic errors of different native language speakers. This paper looks at the idea of using Latent Dirichlet Allocation as a feature clustering technique over lexical features to see whether there is any evidence that these smaller-scale features do cluster into more coherent latent
factors, and investigates their effect in a classification task. We find that although (not unexpectedly)classification accuracy decreases, there is some evidence of coherent clustering, which could help with much larger syntactic feature spaces.
AB - Native language identification (NLI) is the task of determining the native language of an author writing in a second language. Several pieces of earlier work have found that features such as function words, part-of-speech n-grams and syntactic structure are helpful in NLI, perhaps representing characteristic errors of different native language speakers. This paper looks at the idea of using Latent Dirichlet Allocation as a feature clustering technique over lexical features to see whether there is any evidence that these smaller-scale features do cluster into more coherent latent
factors, and investigates their effect in a classification task. We find that although (not unexpectedly)classification accuracy decreases, there is some evidence of coherent clustering, which could help with much larger syntactic feature spaces.
M3 - Conference paper
SN - 1834-7037
SP - 115
EP - 124
JO - Proceedings of the Australasian Language Technology Association Workshop 2011
JF - Proceedings of the Australasian Language Technology Association Workshop 2011
T2 - Australasian Language Technology Association Workshop
Y2 - 1 December 2011 through 2 December 2011
ER -