Discriminating similar languages

Persian and Dari

Research output: Contribution to Newspaper/Magazine/WebsiteArticle

40 Downloads (Pure)

Abstract

Although widely-studied in recent years, Language Identification (LID) systems for determining the language of input texts often fail to discriminate between similar languages like Croatian-Serbian and Malay-Indonesian. This has brought attention to the task of discriminating similar languages, varieties and dialects - including a recent shared task. Persian (also known as Farsi) and Dari (Eastern Persian, spoken predominantly in Afghanistan) are two close variants that have not hitherto been investigated in LID and we report the first results on this pair. Dari is a low-resourced but important language, particularly for the U.S. due to its ongoing involvement in Afghanistan, which has led to increasing research interest. We developed a corpus of 28k sentences (14k per-language) and using character and word n-grams, we discriminated them with 96% accuracy. Out-of-domain cross-corpus evaluation was conducted to test the discriminative models' generalizability, achieving 87% accuracy in classifying 79k sentences from the Uppsala Persian Corpus. Feature analysis revealed lexical, morphological and orthographic inter-language differences. Further to determining document languages, LID has applications in character encoding detection, statistical machine translation, inducing dialect-to-dialect lexicons and authorship profiling in the forensic linguistics domain. In Information Retrieval it can help filter documents (e.g. news articles or search results) by dialect. LID can also be used in other Natural Language Processing tasks, including the adaptation of tools like part-of-speech taggers for low-resourced languages. Since Dari is too different to directly apply Persian resources, the distinguishing features identified through LID can assist in adapting existing resources.
Original languageEnglish
Number of pages1
Volume3
Specialist publicationTiny transactions on computer science (TinyToCS)
PublisherTiny Transactions on Computer Science
Publication statusPublished - 2015

Bibliographical note

Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Fingerprint Dive into the research topics of 'Discriminating similar languages: Persian and Dari'. Together they form a unique fingerprint.

Cite this