Native language identification (NLI) is the task of determining the native language of an author writing in a second language. Several pieces of earlier work have found that features such as function words, part-of-speech n-grams and syntactic structure are helpful in NLI, perhaps representing characteristic errors of different native language speakers. This paper looks at the idea of using Latent Dirichlet Allocation as a feature clustering technique over lexical features to see whether there is any evidence that these smaller-scale features do cluster into more coherent latent factors, and investigates their effect in a classification task. We find that although (not unexpectedly)classification accuracy decreases, there is some evidence of coherent clustering, which could help with much larger syntactic feature spaces.
|Number of pages||10|
|Journal||Proceedings of the Australasian Language Technology Association Workshop 2011|
|Publication status||Published - 2011|
|Event||Australasian Language Technology Association Workshop - Canberra|
Duration: 1 Dec 2011 → 2 Dec 2011