Automatic Language Identification for Persian and Dari texts

Research output: Contribution to conferencePaper

Abstract

We present the first empirical study of distinguishing Persian and Dari texts at the sentence level, using discriminative models. As Dari is a low-resourced language, we developed a corpus of 28k sentences (14k per-language) for this task, and using character and word n-grams, we discriminate them with 96% accuracy using a classifier ensemble. Outof-domain cross-corpus evaluation was conducted to test the discriminative models’ generalizability, achieving 87% accuracy in classifying 79k sentences from the Uppsala Persian Corpus. A feature analysis revealed lexical, morphological and orthographic differences between the two classes. A number of directions for future work are discussed.
Original languageEnglish
Pages53-58
Number of pages6
Publication statusPublished - 2015
EventInternational Conference of the Pacific Association for Computational Linguistics (14th : 2015) - Bali, Indonesia
Duration: 19 May 201521 May 2015

Conference

ConferenceInternational Conference of the Pacific Association for Computational Linguistics (14th : 2015)
CityBali, Indonesia
Period19/05/1521/05/15

Keywords

  • Language Identification
  • Dialect Identification
  • Persian
  • Farsi
  • Dari

Fingerprint

Dive into the research topics of 'Automatic Language Identification for Persian and Dari texts'. Together they form a unique fingerprint.

Cite this