Zero-resource Multi-dialectal Arabic Natural Language Understanding

Muhammad Khalifa; Hesham Hassan; Aly Fahmy

doi:10.14569/IJACSA.2021.0120369

DOI: 10.14569/IJACSA.2021.0120369

PDF

Zero-resource Multi-dialectal Arabic Natural Language Understanding

Author 1: Muhammad Khalifa

Author 2: Hesham Hassan

Author 3: Aly Fahmy

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 12 Issue 3, 2021.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: A reasonable amount of annotated data is required for fine-tuning pre-trained language models (PLM) on down-stream tasks. However, obtaining labeled examples for different language varieties can be costly. In this paper, we investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a PLM on modern standard Arabic (MSA) data only— identifying a significant performance drop when evaluating such models on DA. To remedy such performance drop, we propose self-training with unlabeled DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD) on several DA varieties. Our results demonstrate the effectiveness of self-training with unlabeled DA data: improving zero-shot MSA-to-DA transfer by as large as ~10% F₁ (NER), 2% accuracy (POS tagging), and 4.5% F₁ (SRD). We conduct an ablation experiment and show that the performance boost observed directly results from the unlabeled DA examples used for self-training. Our work opens up opportunities for leveraging the relatively abundant labeled MSA datasets to develop DA models for zero and low-resource dialects. We also report new state-of-the-art performance on all three tasks and open-source our fine-tuned models for the research community.

Keywords: Natural language processing; natural language understanding; low-resource learning; semi-supervised learning; named entity recognition; part-of-speech tagging; sarcasm detec-tion; pre-trained language models

Muhammad Khalifa, Hesham Hassan and Aly Fahmy, “Zero-resource Multi-dialectal Arabic Natural Language Understanding” International Journal of Advanced Computer Science and Applications(IJACSA), 12(3), 2021. http://dx.doi.org/10.14569/IJACSA.2021.0120369

@article{Khalifa2021,
title = {Zero-resource Multi-dialectal Arabic Natural Language Understanding},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2021.0120369},
url = {http://dx.doi.org/10.14569/IJACSA.2021.0120369},
year = {2021},
publisher = {The Science and Information Organization},
volume = {12},
number = {3},
author = {Muhammad Khalifa and Hesham Hassan and Aly Fahmy}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Zero-resource Multi-dialectal Arabic Natural Language Understanding

Upcoming Conferences