Topic modeling for native language identification

Sze-Meng Jojo Wong, Mark Dras, Mark Johnson

Research output: Contribution to journalConference paperpeer-review

14 Citations (Scopus)
39 Downloads (Pure)


Native language identification (NLI) is the task of determining the native language of an author writing in a second language. Several pieces of earlier work have found that features such as function words, part-of-speech n-grams and syntactic structure are helpful in NLI, perhaps representing characteristic errors of different native language speakers. This paper looks at the idea of using Latent Dirichlet Allocation as a feature clustering technique over lexical features to see whether there is any evidence that these smaller-scale features do cluster into more coherent latent factors, and investigates their effect in a classification task. We find that although (not unexpectedly)classification accuracy decreases, there is some evidence of coherent clustering, which could help with much larger syntactic feature spaces.
Original languageEnglish
Pages (from-to)115-124
Number of pages10
JournalProceedings of the Australasian Language Technology Association Workshop 2011
Publication statusPublished - 2011
EventAustralasian Language Technology Association Workshop - Canberra
Duration: 1 Dec 20112 Dec 2011

Bibliographical note

Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.


Dive into the research topics of 'Topic modeling for native language identification'. Together they form a unique fingerprint.

Cite this