Abstract
We present the first empirical study of distinguishing Persian and Dari texts at the sentence level, using discriminative models. As Dari is a low-resourced language, we developed a corpus of 28k sentences (14k per-language) for this task, and using character and word n-grams, we discriminate them with 96% accuracy using a classifier ensemble. Outof-domain cross-corpus evaluation was conducted to test the discriminative models’ generalizability, achieving 87% accuracy in classifying 79k sentences from the Uppsala Persian Corpus. A feature analysis revealed lexical, morphological and orthographic differences between the two classes. A number of directions for future work are discussed.
| Original language | English |
|---|---|
| Pages | 53-58 |
| Number of pages | 6 |
| Publication status | Published - 2015 |
| Event | International Conference of the Pacific Association for Computational Linguistics (14th : 2015) - Bali, Indonesia Duration: 19 May 2015 → 21 May 2015 |
Conference
| Conference | International Conference of the Pacific Association for Computational Linguistics (14th : 2015) |
|---|---|
| City | Bali, Indonesia |
| Period | 19/05/15 → 21/05/15 |
Keywords
- Language Identification
- Dialect Identification
- Persian
- Farsi
- Dari
Fingerprint
Dive into the research topics of 'Automatic Language Identification for Persian and Dari texts'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver