Arabic dialect identification using a parallel multidialectal corpus

Shervin Malmasi*, Eshrag Refaee, Mark Dras

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

21 Citations (Scopus)

Abstract

We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a metaclassifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving 74% accuracy against a random baseline of 16.7% and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as 94 %, but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (76%). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2,000 sentences from the MPCA, we classify over 26 k sentences from the radically different AOC dataset with 74% accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with 97% accuracy. We find that character n-g are a very informative feature for this task, in both within- and cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined.

Original languageEnglish
Title of host publicationComputational Linguistics - 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015, Revised Selected Papers
EditorsKôiti Hasida, Ayu Purwarianti
Place of PublicationSingapore
PublisherSpringer, Springer Nature
Pages35-53
Number of pages19
Volume593
ISBN (Print)9789811005145
DOIs
Publication statusPublished - 2016
Event14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015 - Bali, Indonesia
Duration: 19 May 201521 May 2015

Publication series

NameCommunications in Computer and Information Science
Volume593
ISSN (Print)18650929

Other

Other14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015
Country/TerritoryIndonesia
CityBali
Period19/05/1521/05/15

Fingerprint

Dive into the research topics of 'Arabic dialect identification using a parallel multidialectal corpus'. Together they form a unique fingerprint.

Cite this