Arabic Chatbots : A Survey

A Chatbot is a programmed entity that handles human-like conversations between an artificial agent and humans. This conversation has attracted the attention of researchers who are interested in the interaction between humans and machines to make the conversation more rational and hence pass the Turing test. The available research done in the field of Arabic chatbots is comparably scarce. This paper presents a review of the published Arabic chatbots studies to identify the gap of knowledge and to highlight the areas that needs more study and research. This study concluded the rarity of available research on Arabic chatbots and that all available works are retrieval based. Keywords—Artificial intelligence; Arabic chatbot; conversational agent; ArabChat; human-machine interaction;


I. INTRODUCTION
Artificial Intelligence (AI) is focused on the learning processes.The idea of using a human language to communicate with computers is holding merit to AI.A chatbot or a chat-agent is an intelligent conversation agent which interacts with human users via natural language and emulates human conversation.This area has attracted more interest from both research and industry fields in the past few years [1].The first chatbot was developed in Massachusetts Institute of Technology (MIT), where Weizenbaum implemented the ELIZA chatbot to emulate a psycho-therapist in 1966 [2].
Nowadays, a variety of chatbots are available online to serve in different domains ranging from customer service and information acquisition to entertainment where primarily users react with these applications to make a small conversation.They extend from unpretentious systems which extract answers from datasets when they match specific keywords to more advanced ones which utilize Natural Language Processing (NLP) techniques.A chatbot could be programmed to serve almost any human language.Although the research on English chatbots is diffuses heavily, there is a Scarcity in the Arabic chatbots due to difficulties in the Arabic language.In this paper, we present a survey on Arabic chatbots.The number of online Arabic users have increased which motivated to build Arabic chatbots.
Processing Arabic language texts have a lot of challenges [3] such as rich morphology, the high degree of ambiguity, orthographic variations, and the existence of multiple dialects.Moreover, written Arabic text can be classified into three categories.First, Classical Arabic (CAL) or Quranic Arabic in the Holy Quran.Second, the Modern Standard Arabic (MSA) that is the official language in the Arab world and used in a formal written and spoken forms in mediums such as news, education, and literature [4].The third category is Dialectal Arabic (DA), which is used daily in spoken and written personal communication and in informal settings, where each country and region has its own dialect [2].
The rest of the paper is organized as follows: Section II III presents the background.Section III presents the survey methodology.Section IV discusses the Arabic chatbots researches.Section V presents the conclusion.

II. BACKGROUND
There are generally three components of Chabot [5]: the interface, which interacts with user's input and output, the Knowledge Base or brain, which include the content of the conversation and keep truck of the domain, and the Conversation Engine, which manages the semantic context of the conversation.
There are two types of dataset models, which represent the knowledge source type in chatbots: the retrieval-based model and the generative-based model.In the retrieval-based model, a chatbot uses a pool of predefined responses and employs a type of heuristics to select the proper response to the input, but it may not be applicable when there is no existing predefined response.In the generative-based model, a chatbot uses a set of the techniques for generating new responses and could utilize predefined responses as well using deep learning and neural network (NN).In the following, we will discuss the literature based on the different techniques used, the length of the conversation, the domain of conversation, and the dataset model.
In the retrieval-based model, there are common techniques to build the conversational agent, using pattern matching, Artificial Intelligence Markup Language (AIML), Ontologies, Parsing, Markov Chain Model, and ChatScript.While in the generative-based model, the different techniques are neural network and deep learning techniques.Seq2seq based on neural network will be introduced.
Pattern matching is used mainly in the question/answer chatbots, where the system matches the input with a predefined structure to create a response.AIML is widely used in chatbot design, it is a language derived of XML.It represents the knowledge as objects, consisting of topics and categories.The AIML pattern consists of words having letters and numerals but no special characters or spaces [6].Many chatbot applications have emerged in English.A.L.I.C.E.[7] is a retrieval-based model and it is using advanced pattern matching and AIML approach.It is the first AIML-based www.ijacsa.thesai.orgchatbot and won the Loebner Prize in 2000, 2001 and 2004.A.L.I.C.E is a supervised learning and it is based on categories containing a pattern, and a template for the response.Category patterns are matched to find the most appropriate response to a user input.AIML tags provided for consideration of context, conditional branching to produce new responses.Some Arabic studies are applied to A.L.I.C.E.
Ontology [8] is a set of interconnected hierarchy classes.The knowledge base can be described as a graph that contains classes, each class describes the concepts and the properties.The classes that have a logical relationship are also connected, and use these relationships to imply new statements (reasoning).Examples of ontologies are OpenCyc and Wordnet.
Textual Parsing [9] is a method which converts the text into a set of words (lexical parsing) to determine its grammatical structure.After the tree is built from these words, the lexical structure can be then checked if it forms the rule of the language (syntactical parsing).The latter parsers are getting more complex using natural language processing.The Markov Chain Model [8] depend on the probability of occurrence of a word or letter in the input text, this method helps in building responses that are probabilistically more suitable and hence more correct.For example, if an input text is "xyyyzxyzyyyzy", then the Markov model of order 0 predicts that letter "x" occurs with a probability 2/13.The Markov model of order 1 predicts the fixed probability for every letter depends on the previous letter.
ChatScript [8], [9] aims to be easier to maintain than AIML by focusing on better syntax, it fixes the zero-word matching problems.The Chatscript first finds the best topic that matches the user query string and executes a rule in that topic.Rather than using separated categories for each word as in AIML, Chatscript uses "concepts" to merge similar words with meanings or parts of speech.Suzette (written in ChatScript) won the 2010 Loebner Prize.
On the other hand, there is a common technique to build a generative-based model chatbot using a Recurrent Neural Network (RNN).Seq2seq model [10] is an encoder-decoder model that uses RNN and it is primarily used for translating from one language to some other language, but in the context of chatbots, the input is translated to a response.The seq2seq model is composed of two main RNNs, an encoder RNN which takes the input sequence and encapsulates the information into a fixed representation one cell at time, and a decoder RNN which take that representation, and generates a variable length text that best responds to it also one cell at time.Seq2seq encodes only the important information in the sequence and convert a sequence of symbols into a fixed size feature vector.The cell used in RNN is long short-term memory (LSTM), It allows the cells to remember what information needs to be remembered or updated from the previous cells [10].On scanning the literature, no studies were found to apply a neural network in Arabic chatbot design.
In addition to the technique used and the type of data being processed, the length of the conversation and the domain of conversation, as well as the dataset model are considered aspects in this survey and will be discussed.The conversation length is classified into short and long conversation.A short conversation is a single response produced for a single input such as a question/ answer conversation.Where a long conversation indicates that a large amount of information is exchanged during the conversation lifetime, and this information is tracked and may be present in the output.On the conversation domain, chatbots are classified into two types, closed and open.The closed domain is designed to serve a specific purpose, where the knowledge that is required to generate a suitable response to an input is limited.While the open domain is like human"s conversation, the domain may change with the time, supporting more than one conversation domain.

III. METHODOLOGY
The research survey methodology consists of scanning different literature databases.A similar methodology is used in the literature review here [11].The literature collection was done in highly cited computer science libraries like: IEEE, ACM, Springer, Science Direct and Google Scholar.The search was done using ten keywords coupled with the keyword "Arabic".Those keywords are "chatbot", "chatterbot", "ArabChat", "chat agent", "interactive agent", "conversational agent", "conversational robot", "artificial conversational", "dialogue", and "utterance".The result consisted of 184 papers as shown in Fig. 1.Those papers collected from 2004 to 2017 and evaluated according to the title and the abstract of the paper, eliminating the papers that do not present an implementation of an Arabic chatbot.After evaluation and elimination, we found that there are fourteen papers present Arabic chatbot application, which was from the IEEE and Springer libraries and Google Scholar.

IV. ARABIC CHATBOT RELATED WORK
Although there are available developed Arabic chatbots applications, the research available on Arabic chatbots is limited.Some examples on the former are services as Al-Haj Bot, Rammas, Msa3ed, Theyabi, El-Kahwagy, and others provide an Arabic chatbot application developed for a commercial purpose.Also, there are platforms that provide developers with coding facility and aid such as Watson by IBM, Messenger Bot by Facebook, Telegram Bot, PandoraBot.These platforms and applications are excluded from this review for the lack of published research on them, making it difficult to analyze and compare fairly.The purpose of this survey is to highlight state of the art Arabic chatbot research.
The fourteen collected papers present twelve different Arabic chatbot applications, that classified and evaluated in a manner similar to what is applied here [12], [8], [9].Based on the data type a chatbot processes, it is classified into two categories, text and speech conversation chatbot.The classification relies on the chatbot input and output interaction.In each category, chatbots are classified based on the implementation technique into two subcategories namely, pattern matching and AIML approach.In addition to that, we will discuss chatbot aspects such as the length of the conversation in terms of interaction duration, the domain of conversation as topics domain that chatbot can interact with, and the dataset model of the chatbot.www.ijacsa.thesai.org

A. Text Conversational Chatbot
In this category, the interaction with the chatbot conversation is through textual input and output.The chatbot in this category is classified based on the implementation technique into two subcategories, those are AIML and pattern match approach.
1) AIML approach: Among the earliest research studies on Arabic chatbot application in the collected related work is Quran chatbot [13] by Shawar and Atwell.It is a chatbot on the Quran Islamic holy book.The Quran contains 6236 verses or "Ayahs" and 114 "Surahs" which is a set of verses.The format of the user inputs are Arabic words with "Tashkil" or diacritics, that is used as phonetic guides.The chatbot replies by finding the "Ayahs" from the Quran that contain the user"s input.The nature of Quran text is non-conversational, a Java program is developed to adopt a learning process.The learning process was based on the most significant word of the "Ayha" that represents the category in the AIML file and template is the Ayah.Arabic AIML file is generated by the Java program.The conversation length is short since the chatbot responses with a single response to a single user input.The domain of the conversation is limited by the content of the Quran.From that, the sources of the chatbot dataset are retrieval-based.The interaction and response of the chatbot are limited by the pool of the most significant word of the "Ayhas" that are extracted by the java program.
The Quran14-114 chatbot by Shawar and Atwell [6] is a version of the ALICE chatbot [14] added to the Java program to interact as the Quran chatbot.The conversation in the chatbot is short.The user inputs a question or a statement in English, and the chatbot responses with one or more appropriate "Surahs" and "Ayahs" from the Quran in both English and Arabic.The Java program reads the Quran text from a corpus and converts it to the AIML format to be used by ALICE chatbot.The domain of the conversation is closed based on the content of the Quran in both the English and Arabic languages.Also, this is a retrieval-based model as the dataset is limited by the content of the Quran.The challenge in this work is to show how ALICE chatbot adapted to learn from non-conversational text.
The Arabic Web Question Answering (QA) chatbot [2], [15] is a web interface chatbot based on an Arabic QA corpus, that was built from five different web pages with 412 Arabic question and answer.Those web pages" cover topics such as motherhood and pregnancy, dental care, fasting and health, blood disease such as cholesterol and diabetes, and blood charity.The chatbot supports more than one closed domain, and thus regarded as a closed conversation domain.The chatbot conversation is short, where the user inputs a textual question in MSA about one of the supported domains and, the chatbot responses with the answer without using sophisticated NLP.Also, a Java program was developed to convert the text corpus to create two AIML files atomic and default.The atomic file contains the questions and answers that appear in the corpus.The default file is used to guarantee that the user question is mapped to the appropriate question stored in the knowledge base.Moreover, the file is built using the first word and the most significant word approach.The first word acts as a classifier to the question and the most significant word is the least frequent in the question.
The latter is done by building questions" frequency list after applying a tokenization process to the question.The generated list contains the question"s words along with their frequencies.Then, the approach extracts the two most significant words in the list, those are the two least frequent words, used as keywords to map the question to an answer.The purpose of www.ijacsa.thesai.orgusing the most significant word approach was to increase the rate of the expected output.The chatbot was tested by entering fifteen questions and the result was 93% correct answers.The main drawback in this model appears when the structure of the question is changed or altered from the stored in the knowledge base, then the chatbot responses with wrong answers.That happens because the chatbot does not use a heuristic to select a proper response and it is based on a direct retrieval model.Also, a success rate of to be 93% is not justifiable having a dataset of fifteen questions only.
BOTTA chatbot by Ali and Habash [3] is a female chatbot, that simulates friendly conversations with users.The chatbot supports Egyptian Arabic dialect for both input and output.BOTTA is available to the public.It simulates the English chatbot Rosie [16].The knowledge base is made up of Rosie"s AIML files set.Some of Rosie"s AIML files are translated directly to Arabic, and the others are modified according to the use of the Arabic dialects.Also, for each conversation, BOTTA chatbot temporally stores the basic information about the user such as age, gender, and nationality by asking questions yielding a conversation that is open since the chatbot can response to different topics domain.The length of the conversation is long where chatbot can response to the user based on previous information in conversation.However, it does not update the knowledge base and add new responses, so it is based on retrieval-based model.It depends on a pool of predefined responses using heuristics to response with an appropriate output.Also, the chatbot does not perform the text normalization on the user input to get the suitable response.It performs orthographic transformations, that includes correcting common spelling mistakes of the user input.With this method, BOTTA was able to resolve 85.1% of the common spelling mistakes in Arabic typing.
Table I shows a summary on the discussed textual conversation chatbots that uses AIML approach.2) Pattern matching approach: Mohammad Hijjawi, Zuhair Bandar, Keeley Crockett and David Mclean [17] implemented the ArabChat, which is a conversational agent web interface.The chatbot conversation domain is closed, designed to serve the students of the Applied Science University in Jordon.The interaction between the user and the chatbot is through textual Arabic MSA language.The conversation remains ongoing until one of the conversation"s parties terminates it.The ArabChat reuses the previously exchanged information during the conversation as a response to the user input, creating long conversations.The core components of the ArabChat chatbot are the scripting engine and a scripting language.The scripting engine is divided into subcomponents, that allows handling topics of the conversations.ArabChat knowledge base contains 1218 utterances, that are classified into contexts, each context contains rules.The rules consist of patterns and associated textual responses.ArabChat was tested over 174 users, the average input for each user was 7 inputs per user.The result shows that 73.56% of the inputs matched the expected output.
Enhanced ArabChat [18] is an updated version of ArabChat [17] by Hijjawi, Bandar and Crockett.This version uses extra features including Utterance Classification and Hybrid Rule.These improvements were at the engine level while some additional improvements need to be added to the scripting language and knowledge base to meet the changes needs.Utterance classification feature aims to distinguish between a question and non-question utterances.It works by adding extra keywords to the pattern of the question-based rule, to deal with keyword matching.Hybrid Rule is the second feature and it focuses on how to reply and deal with an utterance that request many topics.Although ArabChat gave a better result of Ratio of Matched Utterances to the Total (RMUT) than enhanced one due to unserious users, the manual checking gives more accurate results and showed improvement in performance.By analyzing logs manually, Enhanced ArabChat deals successfully with 82% of utterances with two topics and this ratio is decreasing when the number of topics is increased in the utterance.Using manual checking, classifying utterance shows a high percentage of question-based utterances due to three factors: the selected domain, the users' needs, that implies that they are more likely to ask rather than discuss, and difficulties to script a large number of rules.
ArabChat with classification methodology [19] is another ArabChat [17] update by Hijjawi, Bandar and Crockett.Using a new classification methodology for Arabic utterances.This new approach classifies the sentences into questions and nonquestions including assertions and instructions.The benefit of applying this approach is that the number of patterns required per rule will decrease and hence increase the performance by firing the suitable rule, depending on the utterance type being a question or non-question.Different topics and list of function words have been used from domains such as politics, religion, sports, education, business and adding some synthetic nonquestion sentences and indirect questions.This classification is done by pre-processing the Arabic sentence into equivalent numeric tokens and then importing the tokens into a machine www.ijacsa.thesai.orglearning toolkit in WEKA.In WEKA, a Decision Tree, which achieve the highest accurate classifier to be applied on the tokenized numeric dataset, is generated and then is converted into a standard IF-THEN classification rule to classify utterances.
Mobile ArabChat [20] is based on the original ArabChat [17] and it is a mobile-based conversational agent and it is also used to work as an advisor for students in Applied Science University in Amman.It is a light version of ArabChat implemented in Android.Although there are some challenges facing users in the Arab Countries such as slow and unstable internet connection and limited bandwidth, this application works even with limited Internet bandwidth.Mobile ArabChat implemented pattern matching approach based on the text.This framework consists of the same component as in ArabChat: scripting engine, scripting language and a knowledge base.Based on a subjective approach, 96% of users agree that using Mobile ArabChat via mobile is better than using the same system via desktop.However, Mobile ArabChat needs an internet connection to work.
Abdullah [4] is an Arabic Conversational Intelligent Tutoring System (CITS) that teaches children aged 10 to 12 years old essential topics about Islam.This online system can engage with students using MSA.That asking a series questions to the students, and discuss with them their answers, using Classical Arabic to give evidences from the Quran and Hadith, which is the sayings and traditions of the Prophet of Islam Muhammed.The system is using images and sound effects to interact with students and can determine the student's knowledge level and hence direct the conversation.Abdullah CITS can distinguish between the user's questions and answers.The framework is based on a Pattern Matching approach, it consists of knowledge base having subject topics, the Conversational Agent scripting language to deliver the tutorial conversation to the learners, and The Tutorial Knowledge Base to determine the level of individual student knowledge and the subject.
LANA [21] is another CITS and it was developed for children with Autism Spectrum Disorder (ASD) that are 10 to 16 years old who have reached a basic competency with the mechanics of Arabic writing to teach them topics on science using MSA.Children with ASD have difficulties in traditional learning because the teacher can't meet the need of every individual student.LANA engages children with a science tutorial delivered in MSA.It is similar to Abdullah CITS, but it offers different learning style models such as visual, auditory and kinesthetic, enabling children to practice learning skills independently based on their needs using pattern matching and short text similarity algorithm.This system also interacts with children using materials such as picture, audio, or instructions according to the user"s learning style.Table II shows a brief review of the discussed related pattern matching text conversational chatbots.

B. Speech Conversation Chatbot
The interaction in the speech conversation chatbot is based on the voice as an input, or output, or both.Also, the textual interaction for input or output in this type of chatbot is supported as well with the voice interaction.However, the research on speech conversation chatbots is limited in Arabic.This section presents two related works, that are classified based on the used approach into AIML and pattern matching.www.ijacsa.thesai.org 1) AIML Approach: Hala [22] is a female robot receptionist located at Carnegie Mellon University in Qatar (CMU-Qatar).Hala accepts and speaks English and Arabic.There are three possible input modes the English, MSA or "Arabizi" which is Arabic written in English letters.The users interact with the chatbot through the keyboard.Hala responses by producing a voice reply and a text appears next to her face on the screen.The response language depends on the user input language.Hala provides information about campus directions, weather, local events and answer queries regarding her personal life, in an open domain and long conversation style.The conversation between Hala and the user takes an equal number of turns.When the user leaves, Hala will detect the conversation was ended after a defined timeout.The purpose of implementing Hala project was to explore culture of the human-robot interaction in the CMU-Qatar by studying the dialogue patterns such as robot's attributes, covered knowledge bases, and cultural variation in the community of users.
2) Pattern Matching Approach: IbnSina [23], [24] is a multilingual conversational robot, that supports Arabic MSA and English.The user interacts with it through text or voice inputs.IbnSina robot responses with audio output, where the language of the response matches the user input language.IbnSina robot generates human interaction dialogue by accessing the online Wikipedia and the stored Quran database.This makes the Chabot of IbnSina robot covering a wide area of topics.Because of that, it replies to general questions, translating words, or answering the question by giving online information, or from the stored books in its database.Also, it gives the user feedback when there are missing information or incorrect spelling.That makes the conversation style of IbnSina open and long.However, it does not generate new responses, as it depends on the information predefined in the dataset to respond with an appropriate output.The IbnSina conversation system is designed based on object-oriented classes as Wikipedia class and Quran class, that allows the robot to reply with the expected response such as chatterbot class, that enables making a simple conversation and reply to user inquiries.Also, there is a developed chatterbot module, that replies to user inputs.
There are two modules supported in the conversation.First, text-to-speech, where the input is text, and the output is audio.Second, speech-to-text, the input is an audio that is converted to text then processed as the first module to get the speech output.In addition, IbnSina robot supports other features such as read-aloud-text by reading a text image through the camera located in the robot eyes area.The robot interacts with the user by body interactions such as real-time lip syncing, eye blinking, face movement, facial expressions, and shaking hands.Table III shows a summary of the discussed related work for both AIML and pattern matching approach of the speech Arabic chatbots.
From the reviewed studies, we notice that all presented work on Arabic chatbot applications is employing retrievalbased model.That is, the chatbot responses are based on the data pool from AIML files, database, or web pages.Which can limit the capability and usability of the chatbot.Also, we notice that all related work relay on AIML or pattern matching approaches.That may lead to 1) a small size of chatbot's dataset and a restriction to closed domain 2) limits chatbot's response to the user where it requires that the user input matches the chatbot dataset to get the correct response.Moreover, the complexity of Arabic grammar and the user's spelling and grammar mistakes could be one of the reasons for the shortage in capabilities of Arabic chatbots in the literature.Which can explain the limited number of the text and speech Arabic chatbot applications.

V. CONCLUSIONS
This paper presents a survey on Arabic chatbots covering twelve different Arabic chatbot studies.They are classified based on the chatbot conversation interaction type into two groups, text and speech conversational chatbots.The studies were presented and evaluated based on the implementation technique, the conversation length and domain, and the model used for the chatbot dataset.The evaluation shows that all the reviewed chatbots were incorporating retrieval-based dataset model.This rises a flag to focus on studying and formalizing Arabic NLP for the conversational agent research.Linguistic complexities hindering Arabic NLP such as morphological ambiguities which means that the word has many meanings, and syntactic ambiguities which means that the sentence has more than one structure, along with the diversity of Arabic dialects are open research challenges that needs cooperation of linguists and computer scientists.Hence, until now, pattern matching and AIML are the ways used to build Arabic conversational agents.Additionally, generative-based and deep learning models are challenging to achieve in Arabic, for the lack of available resources to train the learning model compared to the pool of resources available in English for example.Moreover, there is a shortage in the published

Fig. 1 .
Fig. 1.Total Results from Literature Databases For Search Keywords.

TABLE III .
A SUMMARY OF SPEECH ARABIC CHATBOTS