Arabic dialect identification using a parallel multidialectal corpus

Shervin Malmasi, Eshrag Refaee, Mark Dras

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

Abstract

We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a metaclassifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving 74% accuracy against a random baseline of 16.7% and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as 94 %, but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (76%). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2,000 sentences from the MPCA, we classify over 26 k sentences from the radically different AOC dataset with 74% accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with 97% accuracy. We find that character n-g are a very informative feature for this task, in both within- and cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined.

LanguageEnglish
Title of host publicationComputational Linguistics - 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015, Revised Selected Papers
EditorsKôiti Hasida, Ayu Purwarianti
Place of PublicationSingapore
PublisherSpringer, Springer Nature
Pages35-53
Number of pages19
Volume593
ISBN (Print)9789811005145
DOIs
Publication statusPublished - 2016
Event14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015 - Bali, Indonesia
Duration: 19 May 201521 May 2015

Publication series

NameCommunications in Computer and Information Science
Volume593
ISSN (Print)18650929

Other

Other14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015
CountryIndonesia
CityBali
Period19/05/1521/05/15

Fingerprint

Classifiers
Experiments
Support vector machines

Cite this

Malmasi, S., Refaee, E., & Dras, M. (2016). Arabic dialect identification using a parallel multidialectal corpus. In K. Hasida, & A. Purwarianti (Eds.), Computational Linguistics - 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015, Revised Selected Papers (Vol. 593, pp. 35-53). (Communications in Computer and Information Science; Vol. 593). Singapore: Springer, Springer Nature. https://doi.org/10.1007/978-981-10-0515-2_3
Malmasi, Shervin ; Refaee, Eshrag ; Dras, Mark. / Arabic dialect identification using a parallel multidialectal corpus. Computational Linguistics - 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015, Revised Selected Papers. editor / Kôiti Hasida ; Ayu Purwarianti. Vol. 593 Singapore : Springer, Springer Nature, 2016. pp. 35-53 (Communications in Computer and Information Science).
@inproceedings{b2d5da91d1bc47f2b38218818fb653b0,
title = "Arabic dialect identification using a parallel multidialectal corpus",
abstract = "We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a metaclassifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving 74{\%} accuracy against a random baseline of 16.7{\%} and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as 94 {\%}, but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (76{\%}). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2,000 sentences from the MPCA, we classify over 26 k sentences from the radically different AOC dataset with 74{\%} accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with 97{\%} accuracy. We find that character n-g are a very informative feature for this task, in both within- and cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined.",
author = "Shervin Malmasi and Eshrag Refaee and Mark Dras",
year = "2016",
doi = "10.1007/978-981-10-0515-2_3",
language = "English",
isbn = "9789811005145",
volume = "593",
series = "Communications in Computer and Information Science",
publisher = "Springer, Springer Nature",
pages = "35--53",
editor = "K{\^o}iti Hasida and Ayu Purwarianti",
booktitle = "Computational Linguistics - 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015, Revised Selected Papers",
address = "United States",

}

Malmasi, S, Refaee, E & Dras, M 2016, Arabic dialect identification using a parallel multidialectal corpus. in K Hasida & A Purwarianti (eds), Computational Linguistics - 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015, Revised Selected Papers. vol. 593, Communications in Computer and Information Science, vol. 593, Springer, Springer Nature, Singapore, pp. 35-53, 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015, Bali, Indonesia, 19/05/15. https://doi.org/10.1007/978-981-10-0515-2_3

Arabic dialect identification using a parallel multidialectal corpus. / Malmasi, Shervin; Refaee, Eshrag; Dras, Mark.

Computational Linguistics - 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015, Revised Selected Papers. ed. / Kôiti Hasida; Ayu Purwarianti. Vol. 593 Singapore : Springer, Springer Nature, 2016. p. 35-53 (Communications in Computer and Information Science; Vol. 593).

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

TY - GEN

T1 - Arabic dialect identification using a parallel multidialectal corpus

AU - Malmasi, Shervin

AU - Refaee, Eshrag

AU - Dras, Mark

PY - 2016

Y1 - 2016

N2 - We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a metaclassifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving 74% accuracy against a random baseline of 16.7% and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as 94 %, but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (76%). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2,000 sentences from the MPCA, we classify over 26 k sentences from the radically different AOC dataset with 74% accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with 97% accuracy. We find that character n-g are a very informative feature for this task, in both within- and cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined.

AB - We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a metaclassifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving 74% accuracy against a random baseline of 16.7% and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as 94 %, but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (76%). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2,000 sentences from the MPCA, we classify over 26 k sentences from the radically different AOC dataset with 74% accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with 97% accuracy. We find that character n-g are a very informative feature for this task, in both within- and cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined.

UR - http://www.scopus.com/inward/record.url?scp=84961177974&partnerID=8YFLogxK

U2 - 10.1007/978-981-10-0515-2_3

DO - 10.1007/978-981-10-0515-2_3

M3 - Conference proceeding contribution

SN - 9789811005145

VL - 593

T3 - Communications in Computer and Information Science

SP - 35

EP - 53

BT - Computational Linguistics - 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015, Revised Selected Papers

A2 - Hasida, Kôiti

A2 - Purwarianti, Ayu

PB - Springer, Springer Nature

CY - Singapore

ER -

Malmasi S, Refaee E, Dras M. Arabic dialect identification using a parallel multidialectal corpus. In Hasida K, Purwarianti A, editors, Computational Linguistics - 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015, Revised Selected Papers. Vol. 593. Singapore: Springer, Springer Nature. 2016. p. 35-53. (Communications in Computer and Information Science). https://doi.org/10.1007/978-981-10-0515-2_3