TY - GEN
T1 - Arabic dialect identification using a parallel multidialectal corpus
AU - Malmasi, Shervin
AU - Refaee, Eshrag
AU - Dras, Mark
PY - 2016
Y1 - 2016
N2 - We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a metaclassifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving 74% accuracy against a random baseline of 16.7% and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as 94 %, but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (76%). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2,000 sentences from the MPCA, we classify over 26 k sentences from the radically different AOC dataset with 74% accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with 97% accuracy. We find that character n-g are a very informative feature for this task, in both within- and cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined.
AB - We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a metaclassifier using stacked generalization – a method not previously applied for this task. We first conduct a 6-way multi-dialect classification task in the first experiment, achieving 74% accuracy against a random baseline of 16.7% and demonstrating that meta-classifiers can large performance increases over single classifiers. The second experiment investigates pairwise binary dialect classification within the corpus, yielding results as high as 94 %, but also highlighting poorer results between closely related dialects such as Palestinian and Jordanian (76%). Our final experiment conducts cross-corpus evaluation on the widely used Arabic Online Commentary (AOC) dataset and demonstrates that despite differing greatly in size and content, models trained with the MPCA generalize to the AOC, and vice versa. Using only 2,000 sentences from the MPCA, we classify over 26 k sentences from the radically different AOC dataset with 74% accuracy. We also use this data to classify a new dataset of MSA and Egyptian Arabic tweets with 97% accuracy. We find that character n-g are a very informative feature for this task, in both within- and cross-corpus settings. Contrary to previous results, they outperform word n-grams in several experiments here. Several directions for future work are outlined.
UR - http://www.scopus.com/inward/record.url?scp=84961177974&partnerID=8YFLogxK
U2 - 10.1007/978-981-10-0515-2_3
DO - 10.1007/978-981-10-0515-2_3
M3 - Conference proceeding contribution
AN - SCOPUS:84961177974
SN - 9789811005145
VL - 593
T3 - Communications in Computer and Information Science
SP - 35
EP - 53
BT - Computational Linguistics - 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015, Revised Selected Papers
A2 - Hasida, Kôiti
A2 - Purwarianti, Ayu
PB - Springer, Springer Nature
CY - Singapore
T2 - 14th International Conference of the Pacific Association for Computaitonal Linguistics, PACLING 2015
Y2 - 19 May 2015 through 21 May 2015
ER -