Zero-Resource Multi-Dialectal Arabic Natural Language Understanding

A reasonable amount of annotated data is required for fine-tuning pre-trained language models (PLM) on downstream tasks. However, obtaining labeled examples for different language varieties can be costly. In this paper, we investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a PLM on modern standard Arabic (MSA) data only -- identifying a significant performance drop when evaluating such models on DA. To remedy such performance drop, we propose self-training with unlabeled DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD) on several DA varieties. Our results demonstrate the effectiveness of self-training with unlabeled DA data: improving zero-shot MSA-to-DA transfer by as large as $\sim$10\% F$_1$ (NER), 2\% accuracy (POS tagging), and 4.5\% F$_1$ (SRD). We conduct an ablation experiment and show that the performance boost observed directly results from the unlabeled DA examples used for self-training. Our work opens up opportunities for leveraging the relatively abundant labeled MSA datasets to develop DA models for zero and low-resource dialects. We also report new state-of-the-art performance on all three tasks and open-source our fine-tuned models for the research community.


I. INTRODUCTION
Neural language models [1], [2] with contextual word representations [3] have become dominant for a wide range of Natural Language Processing (NLP) downstream tasks. More precisely, contextual representations from transformerbased [4] language models [5], [6], pre-trained on large amounts of raw data and then fine-tuned on labeled tasksspecific data, has produced state-of-the-art performance on many tasks, even when using fewer labeled examples. Such tasks include question answering [7], text classification [6], named entity recognition (NER), and part-of-speech (POS) tagging [8], [9].
Typically, such language models see a huge amount of data during pre-training, which could mistakenly lead us to assume they have a strong generalization capability even in situations where the language varieties seen at test time are different from those the language model was fine-tuned on. To investigate this particular situation, we first study the impact of using a language model pre-trained on huge Arabic corpora for two popular sequence tagging tasks (NER and POS tagging) and one text classification task (sarcasm detection) when finetuned on available labeled data, regardless of language variety (Section VII-A). To test the model utility for tasks based on exclusively dialectal Arabic (DA), we then remove all dialectal data from the training splits and fine-tune a model only on MSA. Evaluating such a model in a zero-shot setting, i.e., on Egyptian (EGY), Gulf (GLF), and Levantine (LEV) varieties, we observe a significant performance drop. This shows the somewhat brittle ability of pre-trained language models without dialect-specific fine-tuning.
Unfortunately, the scarcity of labeled DA resources covering sufficient tasks and dialectal varieties has significantly slowed down research on DA [10]. Consequently, a question arises: "How can we develop models nuanced to downstream tasks in dialectal contexts without annotated DA examples?". We apply self-training, a classical semi-supervised approach where we augment the training data with confidently-predicted dialectal data. We empirically show that self-training is indeed an effective strategy, which proves to be useful in zero-shot (where no gold dialectal data are included in training set) independently as well as with self-training (Sections VII-B and VII-C, respectively).
Our experiments reveal that self-training is always a useful strategy that consistently improves over mere fine-tuning. In order to understand why this is the case (i.e., why combining self-training with fine-tuning yields better results than mere fine-tuning), we perform an extensive error analysis based on our NER data. We discover that self-training helps the model most with improving false positives (approximately 59.7%). This includes in cases involving DA tokens whose MSA orthographic counterparts [11] are either named entities or trigger words that frequently co-occur with named entities in MSA. Interestingly, such out-of-MSA tokens occur in highly dialectal contexts (e.g., interjections and idiomatic expressions employed in interpersonal social media communication) or ones where the social media context in which the language (DA) employed affords more freedom of speech and a platform for political satire. We present our error analysis in Section VIII.
We choose Arabic as our experimental playground since it affords a rich context of linguistic variation: In addition to the standard variety, MSA, Arabic also has several dialects, thus offering an excellent context for studying our problem. From a geopolitical perspective, Arabic also has a strategic significance. This is a function of Arabic being the native tongue of 400 million speakers in 22 countries, spanning across two continents (Africa and Asia). In addition, the three dialects of our choice (EGY, GLF, LEV) are popular dialects that are widely used online. This makes our resulting models highly useful in practical situations at scale. Pragmatically, ability to develop NLP systems on dialectal tasks with no-to-small labeled dialect data immediately eases a serious bottleneck. Arabic dialects differ among themselves and from MSA at all linguistic levels, posing challenges to traditional NLP approaches. Having to develop annotated resources across the various dialects for the different tasks would be quite costly, and perhaps unnecessary. Therefore, zero-shot crossdialectal transfer would be valuable when only some language varieties have the labeled resources. We also note that our method is language-independent, and we hypothesize it can be directly applied to other varieties of Arabic or in other linguistic contexts for other languages and varieties.
Our research contributions in this paper are 3-fold: 1) We study the problem of MSA-to-DA transfer in the context of sequence labeling and text classification and show, through experiments, that when training with MSA data only, a wide performance gap exists between testing on MSA and DA. That is, models fine-tuned on MSA generalize poorly to DA in zeroshot settings. 2) We propose self-training to improve zero-and fewshot MSA-to-DA transfer. Our approach requires little-to-no labeled DA data. We evaluate extensively on 3 different dialects across the 3 aforementioned tasks, and show that our method indeed narrows the performance gap between MSA and DA by a margin as wide as ∼ 10% F 1 points. Moreover, we conduct an ablation experiment to evaluate the importance of using unlabeled DA rather than MSA data in the zero-shot setting, and we show that unlabeled DA data is indeed much more effective and necessary for adapting the model to DA data during testing. 3) We develop state-of-the-art models for the 3 tasks of (NER, POS tagging, and SRD), which we intend to publicly release for the research community.
We now review relevant literature.

II. RELATED WORK
Classical machine learning techniques, including SVM and Conditional Random Fields (CRFs) [12] applied manuallyextracted, hand-crafted word-and character-level features, were previously employed for various sequence labeling tasks including NER, POS tagging, chunking. More recently, however, neural architectures, have become the defacto approach for various tasks including sequence labeling. This usually includes an autoregressive architecture such as vanilla Recurrent Neural Networks (RNN) [13] or the more sophisticated Long Short-Term Memory networks (LSTM) [14]. The networks processes the input text in a word-by-word fashion, and the network is trained to predict the correct label for each word. In addition, more capacity can be given to such networks by adding an additional layer that processes the input in a rightto-left fashion [15], [16].
Neural approaches usually make use of both word-and character-features. Word-level features usually consist in semantic word embeddings, which are trained on a large raw corpus in a self-supervised fashion [17], [18]. Character-level features can be extracted through an additional network such as LSTM [19] or CNN [20]. Neural techniques has produced better or comparable results to classical approaches in addition to alleviating the need to manually hand-craft features.
With respect to NER but mostly in the context of MSA, due to lack of dialectal NER datasets. For example, [30] applied a CRF layer over n-gram features to perform NER. [31] combined a decision tree [32] with rule-based features. Other, but little, work has focused on NER in the context of social media data, where DA and MSA are usually mixed together. For instance, [29] used cross-lingual resources, namely English to improve Arabic NER. However, they obtained poor results when evaluating on social media data. More recently, [21] applied bi-directional LSTM networks on both character-and word-levels to perform NER on the Tweets dataset [29]. As for Egyptian dialect, specifically, [33] performed NER by applying a CRF tagger on a set of lexical, morphological, and gazetteerbased features. Their approach showed improvements over baselines but the performance on dialectal data was not on par with it on MSA data, showing the challenges brought by dialectal contexts. To the best of our knowledge, little attention has been given to NER on dialectal Arabic and no prior work has studied the performance when training on MSA data and evaluating on DA data, respectively.
As for POS tagging and similarly to NER, the performance of models trained on MSA drops significantly when used with DA [34], [25]. Initial systems for Arabic POS tagging relied on both statistical features and linguistic rules crafted by experts [35], [36] or combined machine learning techniques with rules [37]. More recent work adopted classical machine learning model such as SVM applied on n-gram features [38], [39]. Other work used n-gram features. RNNs and their variants were later adapted for the task [40], [25], [41].
Dialectal Arabic POS tagging has received some attention although usually limited to work individual dialects such as Gulf [42], [25] and Egyptian [43], [44]. [45] studied multidialectal POS tagging by proposing an annotated DA dataset from twitter spanning 4 different dialects, namely, Gulf, Egyptian, Levantine, and Maghrebi. While their results show a performance drop on DA when training on MSA only, no attempt was done to improve the DA performance in that case. We can see that despite both the difficulty and scarcity of annotated DA data for all of the different dialects and tasks, most previous work has focused on annotating unidialectal datasets attempting to leverage the already abundant MSA datasets. A classical work [43], who employed an MSA morphological analyzer with a minimal supervision to perform POS tagging on Egyptian data with unlabeled Egyptian and Levantine data.
With respect to Arabic Sarcasm Detection, the majority of research has focused on detecting sarcastic tweets. [63] used Random Forests to identify sarcastic political tweets. [64] proposed a shared task on irony detection in Arabic Tweets. The submitted systems to the shared task varied in their approaches from classical models with count-based features [65], [66] to deep models [67], [68]. [69] highlighted the connection between sentiment analysis and sarcasm detection, by showing how sentiment classifiers fail with sarcastic inputs. They also proposed the largest publicly available Arabic sarcasm detection dataset, ArSarcasm, which we use in this work. We can see that so far, sarcasm detection methods have been applied to social media data collectively, with no effort made to study the zero-shot performance across dialects of state-of-the-art methods.
Pre-trained Language Models. Sequential transfer learning, where a network is first pre-trained on a relevant task before fine-tuning on the target task, originally appeared in domain of computer vision, and has recently been adapted in NLP. [70] Proposed to pre-train a LSTM network for language modeling and then fine-tune for classification. Similarly, ELMO [3] leveraged contextual representations obtained from a network pretrained for language modeling to perform many NLP tasks. Similar approaches were proposed such as BERT [5] that relied not on RNNs, but on bidirectional Transformers [4], and on a different pre-training objective, namely masked language modeling. Other variations appeared including RoBERTa [6], MASS [71], and ELECTRA [72]. Fine-tuning these pre-trained models on task-specific data has produced state-of-the-art performance, especially in cases when sufficiently large labeled data does not exist. They have been applied to several tasks, including text classification, question answering, named entity recognition [9], and POS tagging [8].
Cross-lingual Learning. Cross-lingual learning (CLL) refers to using labeled resources from resource-rich languages to build models for data-scarce languages. In a sense, knowledge learned about language structure and tasks is transferred to low-resource languages Cross-lingual learning is of particular importance due to the scarcity of labeled resources in many of the world's languages, some of which are spoken by millions of people (Marathi and Gondi, for example). While our work can be better described as cross-dialectal, the techniques used for cross-lingual learning can easily be adapted for settings such as ours. In this work, Modern Standard Arabic (MSA) and Arabic dialects (DA) represent the high-resource and lowresource languages, respectively.
Many techniques were proposed for CLL, including using cross-lingual word embeddings [73], [74], [75], [76], where the two monolingual vector spaces are mapped into the same shared space. While cross-lingual word embeddings enable comparing meaning across languages [73], they typically fail when we do not have enough data to train good monolingual embeddings. In addition, adversarial learning [77] has played an important role in cross-lingual learning where an adversarial objective is employed to learn language-independent representations [78], [79], [80], [81]. As a result, the model learns to rely more on general language structure and commonalities between languages, and therefore can generalize across languages. Multilingual extensions of pre-trained language models have emerged through joint pre-training on several languages. Examples include mBERT [5], XLM [82] and XLM-RoBERTa [9]. During pre-training on multiple languages, the model learns to exploit common structure among pre-training languages even without explicit alignment [83]. These models have become useful for few-shot and zero-shot cross-lingual settings, where there is little or no access to labeled data in the target language. For instance [9] evaluate a crosslingual version of RoBERTa [6], namely XLM-RoBERTa, on cross-lingual learning across different tasks such as question answering, text classification, and named entity recognition.
Semi-supervised Learning. Several methods were proposed for leveraging unlabeled data for learning including co-training [84], graph-based learning [85], tri-training [86], and self-training [87]. A variety of semi-supervised learning methods have been successfully applied to a number of NLP tasks including NER [88], [89], POS tagging [90], parsing [91], word sense disambiguation [92], and text classification [93], [94]. Self-training has been applied in cross-lingual settings where gold labels are rare in the target language. For example, [95] proposed a combination of Active learning and selftraining for cross-lingual sentiment classification. [96] made use of self-training for named entity tagging and linking across 282 different languages. [97] used self-training for crosslingual word mapping to create additional word pairs for training. [98] employed self-training to improve zero-shot cross-lingual sentiment classification with mBERT [5]. With English as their source language, they improved performance on 7 languages by self-training using unlabeled data in their target languages. Lastly, [99] used the self-labeled examples produced by self-training to create adversarial examples in order to improve robustness and generalization.
We now introduce our tasks.

III. TASKS
Named Entity Recognition (NER) is defined as the information extraction task that attempts to locate, extract, and automatically classify named entities into predefined classes or types in unstructured texts [100]. Typically, NER is integrated into more complex tasks, where, for example, we might need to handle entities in a special way. For instance, when translating the Arabic sentence " " to English, it would be useful to know that " " is a person name, and therefore should not be be translated into the word "generosity". Similarly, NER can be useful for other tasks question answering, information retrieval and summarization.
Part-of-Speech (POS) tagging is the task of assigning a word in a context to its part-of-speech tag. Such tags include adverb (ADV), adjective (ADJ), pronoun (PRON), and many others. For example, given an input sentence " ", our goal is to tag each word as follows: . POS tagging is an essential NLU task with many applications in speech recognition, machine translation, and information retrieval. Both NER and POS tagging are sequence labeling tasks, where we assign a label to each word in the input context. Sarcasm Detection is the task of identifying sarcastic utterances where the author intends a different meaning than what is being literally enunciated [46]. Sarcasm detection is crucial for NLU as neglecting to detect sarcasm can easily lead to the misinterpretation of the intended meaning, and therefore significantly degrade the accuracy of tasks such as sentiment classification, emotion recognition, and opinion mining [69]. For example the word " " in the utterance " " can erroneously lead sentiment classifiers into positive sentiment, although the sentiment has negative sentiment. Sarcasm Detection is typically treated as a binary classification task, where an utterance is classified as either sarcastic or not.

IV. METHOD
In this work, we show that models trained on MSA for NER, POS tagging, and Sarcasm Detection generalize poorly to dialect inputs when used in zero-shot-settings (i.e., no annotated DA data used during training). Across the three tasks, we test how self-training would fare as an approach to leverage unlabeled DA data to improve performance on DA. Self-training involves training a model using its own predictions on a set of unlabeled data identical from its original training split. Next, we formally describe our algorithm. The notation used in this section to describe our algorithm is directed towards sequence labeling (since we experiment with 2 sequence labeling tasks out of 3). However, it should be straightforward to adapt it to the context of text classification as in [98]. Second, for every unlabeled DA example u i , we use the model to tag each of its tokens to obtain a set of predictions and confidence scores for each token are the label and confidence score (Softmax probability) for the jth token in u i . Third, we employ a selection mechanism to identify examples from U that are going to be added to L for the next iteration.
For a selection mechanism, we experiment with both a thresholding approach and a fixed-size [98] approach. In the thresholding method, a threshold τ is applied on the minimum confidence per example. That is, we only add an example u i Obtain prediction p ui on unlabeled DA example u i using model M ; remove u i from U and add it to L; 8 end 9 until stopping criterion satisfied , where S is a hyper-parameter. We experiment with both approaches and report results in Section VII.

B. Self-training for Classification
For sarcasm detection, we follow [98] who select an equivalent number of examples from each class, which we will refer to as class balancing. In other words, let c ui be the confidence of the most probable class assigned to example

V. PRETRAINED LANGUAGE MODEL
In this work, we turn our attention to fine-tuning pretrained language models (PLMs) on our three tasks. While self-training can basically be applied to many types of other models such as LSTM networks [14], we select PLMs for two reasons. First, PLMs have been shown to outperform models trained from scratch on a wide variety of tasks [5], [70], [82]. Second, we aim to show that even state-of-the-art models still perform poorly in certain low-resource settings asserting that we still need methods to handle such scenarios. Pre-trained language models make use As a pre-trained language model, we use XLM-RoBERTa [9] (XML-R for short). XLM-R is a cross-lingual model, and we choose it since it is reported to perform better than mBERT, the multilingual model from Google [5]. XLM-R also uses Common Crawl for training, which is more likely to have dialectal data than Wikipedia Arabic (used in mBERT), making it more suited to our work. We now introduce our experiments.

VI. EXPERIMENTS
We begin our experiments with evaluating the standard fine-tuning performance of XLM-R models on NER, POS tagging, and SRD against strong baselines. We then use our best models from this first round to investigate the MSA-to-DA zero-shot transfer, showing a significant performance drop even when using pre-trained XLM-R. Consequently, we evaluate self-training in zero-(NER, POS tagging, SRD) and fewshot (POS tagging) settings, showing substantial performance improvements in both cases. We now introduce our datasets. POS Tagging: There are a number of Arabic POS tagging datasets, mostly on MSA [103] but also on dialects such as EGY [104]. To show that the proposed approach is able to work across multiple dialects, we ideally needed data from more than one dialect. Hence, we use the multi-dialectal (MD) dataset from [45], comprising 350 tweets from various Arabic dialects including MSA, Egyptian (EGY), Gulf (GLF), and Levantine (LEV). This dataset has 21 POS tags, some of which are suited to social media (since it is derived from Twitter). We show the POS tag set from [45] in Table XIII (in the Appendix). We further evaluate fine-tuning XLMR for POS tagging on a Classical Arabic dataset, namely the Quranic Arabic Corpus (QAS). [105].

A. Datasets
Sarcasm Detection: We use the Ar-Sarcasm dataset provided by [69], which has a total of 10,547 example split into training and test sets. Each example in this dataset is labeled by its dialect and sarcasm label. For our experiments, we set aside 20% of the training data as a development set. Table I shows sizes of the datasets used. We now introduce our baselines.

B. Baselines
For the NER task, we use the following baselines: • NERA [31]: A hybrid system of rule-based features and a decision tree classifier.
• WC-CNN [22]: A character-and a word-level CNN with a CRF layer.
• mBERT [5]: A fine-tuned multilingual BERT-Base-Cased (110M parameters), pre-trained with a masked language modeling objective on the Wikipedia corpus of 104 languages (including Arabic). For fine-tuning, we find that (based on experiments on our development set) a learning rate of 6 × 10 −5 works best with a dropout of 0.1.
In addition, we compare to the published results in [28], AraBERT [106], and CAMel [107] for the ANERCorp dataset.
We also compare to the published results in [22] for the 4 datasets.
For the POS tagging task, we compare to our own implementation of WC-BiLSTM (since there is no published research that uses this method on the task, as far as we know) and run mBERT on our data. We also compare to the CRF results published by [45]. In addition, for the Gulf dialect, we compare to the BiLSTM with compositional character representation and word representations (CC2W+W) published results in [25].

For the Sarcasm Detection task:
• Word-level BiLSTM: A bidirectional LSTM on the word level. We use the same hyper-parameters as in [69].
• Word-level CNN [108]: the network is has one convolutional layer of 10 filters of sizes 3, 5, and 7.

C. Experimental Setup
Our main models are XLM-R BASE (L = 12, H = 768, A = 12, 270M params) and XLM-R LARGE (L = 24, H = 1024, A = 16, 550M params), where L is number of layers, H is the hidden size, A is the number of self-attention heads. For XLM-R experiments, we use Adam optimizer with 1e −5 learning rate, batch size of 16. We typically fine-tune for 20 epochs, keeping the best model on the development set for testing. We report results on the test split for each dataset, across the two tasks. For all BiLSTM experiments, we use the same hyper-parameters as [22].
For all the self-training experiments, we use the dialect subset of the Arabic online news commentary (AOC) dataset [109], comprising the EGY, GLF, and LEV varieties limiting to equal sizes of 9K examples per dialect (total =27K) 2 . We use the split from [110] of AOC, removing the dialect labels and just using the comments themselves for our self-training. Each iteration involved fine-tuning the model for K = 5 epochs. As a stopping criterion, we use early stopping with patience of 10 epochs. Other hyper-parameters are set as listed before.

A. Fine-tuning XLM-R
We start by showing the result of fine-tuning XLM-R on the NER task, on each of the 4 Arabic NER (ANER) datasets listed in Section VI-A. Table II shows the test set macro F 1 score on each of the 4 ANER datasets. Clearly, the fine-tuned XLM-R models outperform other baselines on all datasets, except on the NW-2003 where WC-CNN [22] performs slightly better than XLM-R LARGE .
For POS Tagging, Table III shows test set word accuracy of the XLM-R models compared to baselines on the Quranic Arabic Corpus (QAC) and 4 different subsets from the multidialectal dataset [45]. Again, XLM-R models (both base and large) outperform all other models. A question arises why XLM-R models outperform both mBERT and AraBERT. As noted before, for XLM-R vs. mBERT, XLM-R was pre-trained on much larger data: CommonCrawl for XLM-R vs. Wikipedia for mBERT. Hence, the larger dataset of XLM-R is giving it an advantage over mBERT. For comparison with AraBERT, although the pre-training data for XLM-R and AraBERT may be comparable, even the smaller XLM-R model (XLM-R BASE ) has more than twice the number of parameters of the BERT BASE architecture on which AraBERT and mBERT are built (270M v. 110M). Hence, XLM-R model capacity gives it another advantage. We now report our experiments with zeroshot transfer from MSA to DA.
For Sarcasm Detection, we fine-tune XLM-R BASE and XLM-R LARGE on the full Ar-Sarcasm dataset and compare their performance against three different baselines in Table IV. Worst performance is given by CNN, which can be attributed to the way CNNs work; by capturing local n-gram features, the CNN filters fail to learn the wide contextual features required to detect sarcasm. Clearly, mBERT is performing very well compared to BiLSTM and CNN but XLM-R BASE and XLM-R LARGE outperfrom all other baselines on the task with 69.83% and 74.07% macro F1 points, respectively, achieving new stateof-the-art on the Ar-Sarcasm dataset.

B. MSA-DA Zero-Shot Transfer
As before, we start by the discussion of NER experiments. To evaluate the utility of approach, we obviously need DA data labeled for NER. We observed that the dataset from [29] contains both MSA and DA examples (tweets). Hence, we train a binary classifier to distinguish DA data from MSA 5     POS Tagging, we already have MSA data for training and the three previously used DA datasets, namely EGY, GLF and LEV, for evaluation. We use those for the zero-shot setting by omitting their training sets and using only the development and test sets.
We first study how well models trained for NER and POS tagging on MSA data only will generalize to DA inputs during test time. We evaluate this zero-shot performance on both the XLM-R BASE and XLM-R LARGE models. For NER, we train on ANERCorp (which is pure MSA) and evaluate on both Darwish-MSA and Darwish-DA. While for POS tagging, we train on the MSA subset [45] and evaluate on the corresponding test set for each dialect. As shown in Table  V, For NER, a significant generalization gap of around 20 % F 1 points exists between evaluation on MSA and DA using both models. While for POS tagging, the gap is as large as 18.13 % accuracy for the LEV dialect with XLM-R BASE . The smallest generalization gap is on the GLF variety, which is perhaps due to the high overlap between GLF and MSA [25].

C. Zero-shot Self-Training
Here, for NER, similar to Section VII-B, we train on ANERCorp (pure MSA) and evaluate on Darwish-MSA and Darwish-DA. Table VI shows self-training NER results employing the selection mechanisms listed in Section IV, and with different values for S and τ . The best improvement is achieved with the thresholding selection mechanism with a τ = 0.90, where we have an F 1 gain of 10.03 points. More generally, self-training improves zero-shot performance in all cases albeit with different F 1 gains. Interestingly, we find that self-training also improves test performance on MSA with the base XLM-R model. This is likely attributed to the existence of MSA content in the unlabeled AOC data. It is noteworthy, however, that the much higher-capacity large model deteriorates on MSA if self-trained (dropping from 68.32% to 67.21%). This shows the ability of the large model to learn representations very specific to DA when self-trained. It is also interesting to see that the best self-trained base model achieving 50.10% F 1 , outperforming the large model before the latter is self-trained (47.35% in the zero-shot setting). This shows that a base self-trained model, suitable for running on  As for POS tagging, we similarly observe consistent improvements in zero-shot transfer with self-training (Table VII). The best model achieves accuracy gains of 2.41% (EGY), 1.41% (GLF), and 1.74% (LEV). Again, this demonstrates the utility of self-training pre-trained language models on the POS tagging task even in absence of labeled dialectal POS data (zero-shot).
For Sarcasm Detection, we follow [98] in balancing the examples selected in each self-training iteration through selecting an equal number of examples from each class (sarcastic and non-sarcastic). Without the balancing step, we find that the selected examples come from the most frequent class (non-sarcastic), which hurts performance since the model is learning only one class. The results for sarcasm detection are shown in Table VIII, where we see that self-training adds 3% and 2.5% (for XLM-R BASE ) and 5.9% and 4.5% (for XLM-R LARGE ) macro F1 points on the development and test sets, respectively using the best settings for self-training (S = 100 with class balancing). We also find that selection based on probability thresholds performs much worse than fixed-size selection, hence we omit these results.

D. Ablation Experiment
Here, we conduct an ablation experiment with the NER task in order to verify our hypothesis that the performance boost primarily comes from using unlabeled DA data for selftraining. By using a MSA dataset with the same size as our unlabeled DA one 7 , we can compare the performance of the self-trained model in both settings: MSA and DA unlabeled data. We run 3 different self-training experiments using 3 different values for τ using each type of unlabeled data. Results are shown in table IX. While we find slight performance boost due to self-training even with MSA unlabeled data, the average F1 score with unlabeled DA is better by 2.67 points, showing that using unlabeled DA data for self-training has helped the model adapt to DA data during testing.

A. NER
To understand why self-training the pre-trained language model improves over mere fine-tuning, we perform an error analysis. For the error analysis, we focus on the NER task where we observe a huge self-training gain. We use the development set of Darwish-DA (See section VII-C) for the error analysis. We compare predictions of the standard finetuned XLM-R BASE model (FT) and the best performing selftraining (τ = 0.9) model (ST) on the data. The error analysis leads to an interesting discovery: The greatest benefit from the ST model comes mostly from reducing false positives (see Table X). In other words, self-training helps regularize the 7 We use a set of MSA tweets from the AOC dataset mentioned before.  To understand why the ST model improves false positive rate, we manually inspect the cases it correctly identifies that were misclassified by the FT model. We show examples of these cases in Table XIV (in the Appendix). As the table shows, the ST model is able to identify dialectal tokens whose equivalent MSA forms can act as trigger words (usually followed by a PER named entity). We refer to this category as false trigger words. An example is the word "prophet" (row 1 in Table XIV). A similar example that falls within this category is in row (2), where the model is confused by the token ( "who" in EGY, but "to" in MSA and hence the wrong prediction as LOC). A second category of errors is caused by non-standard social media language, such as use of letter repetitions in interjections (e.g., in row (3) in Table XIV). In these cases, the FT model also assigns the class PER, but the ST model correctly identifies the tag as "O". A third class of errors arises as a result of out-of-MSA vocabulary. For example, the words in rows (4)(5)(6) are all out-of-MSA where the FT model, not knowing these, assigns the most frequent named entity label in train (PER). A fourth category of errors occurs as a result of a token that is usually part of a named entity in MSA, that otherwise functions as part of an idiomatic expression in DA. Row (7) in Table XIV illustrates this case.
We also investigate errors shared by both the FT and ST models (errors which the ST model also could not fix).
Some of these errors result from the fact that often times both MSA and DA use the same word for both person and location names. Row (1) in Table XV (in the Appendix) is an example where the word "Mubarak", name of the ex-Egypt President, is used as LOC. Other errors include out-of-MSA tokens mistaken as named entities. An example is in row (3) in Table XV, where ,("proof" or "basis" in EGY) is confused for ("emirate", which is a location). False trigger words, mentioned before, also play a role here. An example is in row (7) where is confused for PER due to the trigger word "Hey!" that is usually followed by a person name. Spelling mistakes cause third source of errors, as in row (4). We also note that even with self-training, detecting ORG entities is more challenging than PER or LOC. The problem becomes harder when such organizations are not seen in training such as in rows (8) , (9) and (10) , all of which do not occur in the training set (ANERCorp).
Here we investigate the false negatives produces by the self-trained models observing a number of named entities that were misclassified by the self-trained model as unnamed ones. See Table XVI (in the Appendix).
As an example, we take the last name which was classified both correctly and incorrectly in different contexts by the self-trained model. Context of correct classification is " ", while it is " " for the incorrect classification. First, we note that is not a common name (zero occurrences in the MSA training set). Second, we observe that in the correct case, the word was preceded by the first name which was correctly classified as PER, making it easier for the model to assign PER to the word afterwards as a surname.

B. Sarcasm Detection
We also conduct an error analysis on Sarcasm Detection comparing the predictions of XLM-R BASE with and without self-training. For that we use the best model on the development set (XLM-R BASE , S=100 with class balancing). Our We also analyze sample errors that were fixed by the self-trained model. See Table XVII (in the Appendix). The first four examples represent false negatives, where the finetuned model assumed to be non-sarcastic. We can see that in such dialectal contexts, the fine-tuned model suffers from many unseen words during training on MSA. More specifically, words such as and in example (1), or in (2), in (4), or an idiom such as n e(3), or in (5), or in (6), all of which represent dialect-specific language that is not encountered in MSA contexts, and therefore represents a significant challenge in zero-shot settings.
In addition, we show sample errors shared between the finetuned and the self-training models. See Table XVIII (in the Appendix). As to why the self-trained model has not corrected these errors, we can hypothesize that it may be due to that the vocabulary used in these inputs was not seen during selftraining. In other words, this vocabulary was either not selected by the self-training selection mechanism to be added to the training data or not existing at all in the unlabeled examples used for self-training. As a result, the model was not adapted sufficiently to handle these or similar contexts. We assume the performance on these inputs could improve with larger and more diverse unlabeled examples used for self-training. Even though pre-trained language models have improved many NLP tasks, they still need a significant amount of labeled data for high-performance fine-tuning. In this paper, we proposed to self-train pre-trained language models by using unlabeled Dialectal Arabic (DA) data to improve zero-shot performance when training on Modern Standard Arabic (MSA) data only. Our experiments showed substantial performance gains on two sequence labeling tasks (NER and POS), and one text classification task (sarcasm detection) on different Arabic varieties. Our method is dialect-and task-agnostic, and we believe it can be applied to other tasks and dialectal varieties. We intend to test this claim in future research. Moreover, we evaluated the fine-tuning of the recent XLM-RoBERTa language models, establishing new state-of-the-art results on all of the three tasks studied.

XII. ERROR ANALYSIS
The "regularizing" effect caused by self-training and discussed in section VIII can sometimes produce false negatives as shown in Table XI. We see a number of named entities that were misclassified by the self-trained model as unnamed ones. As an example, we take the last name which was classified both correctly and incorrectly in different contexts by the self-trained model. Context of correct classification is " ", while it is " " for the incorrect classification. First, we note that is not a common name (zero occurrences in the MSA training set). Second, we observe that in the correct case, the word was preceded by the first name which was correctly classified as PER, making it easier for the model to assign PER to the word afterwards as a surname.