Abstract
We present the first empirical study of distinguishing Persian and Dari texts at the sentence level, using discriminative models. As Dari is a low-resourced language, we developed a corpus of 28k sentences (14k per-language) for this task, and using character and word n-grams, we discriminate them with 96% accuracy using a classifier ensemble. Outof-domain cross-corpus evaluation was conducted to test the discriminative models’ generalizability, achieving 87% accuracy in classifying 79k sentences from the Uppsala Persian Corpus. A feature analysis revealed lexical, morphological and orthographic differences between the two classes. A number of directions for future work are discussed.
Original language | English |
---|---|
Pages | 53-58 |
Number of pages | 6 |
Publication status | Published - 2015 |
Event | International Conference of the Pacific Association for Computational Linguistics (14th : 2015) - Bali, Indonesia Duration: 19 May 2015 → 21 May 2015 |
Conference
Conference | International Conference of the Pacific Association for Computational Linguistics (14th : 2015) |
---|---|
City | Bali, Indonesia |
Period | 19/05/15 → 21/05/15 |
Keywords
- Language Identification
- Dialect Identification
- Persian
- Farsi
- Dari