A Review on Event-Based Epidemic Surveillance Systems that Support the Arabic Language

With the revolution of the internet, many eventbased systems have been developed for monitoring epidemic threats. These systems rely on unstructured data gathered from various online sources. Moreover, some systems are able to handle more than one language to cover all news reports related to disease outbreaks worldwide. The aim of this paper is to examine existing systems in terms of supporting the Arabic language. The 28 identified systems were evaluated based on different criteria. The results of this evaluation show that only 5 systems support the Arabic language using translation tools; hence, disease outbreaks in news reports written in Arabic are not directly processed. In other words, no existing event-based system in the literature has yet been developed specifically for Arabic health news reports to monitor epidemic diseases. Keywords—Public health; infectious disease; event extraction; disease surveillance system; arabic language


I. INTRODUCTION
During the past few years, the spread of many different pandemic diseases has increased worldwide; for example, the disease caused by the Ebola virus was first reported by the World Health Organization (WHO) in Guinea in 2014, which then spread rapidly to many West African countries causing hundreds of deaths (Guinea 346, Liberia 181, Nigeria 1, Sierra Leone 37) [1], [2].It was also transmitted to other countries outside of the African continent, including Italy, the United Kingdom, Spain and United States of America.The latest information about Ebola can be reached via the following link.http://www.who.int/csr/don/archive/disease/ebola/en/.
In addition to the outbreak of Ebola in Africa, respiratory syndrome coronavirus (SARSCoV) was identified in Asia (2002/2003), an outbreak of the pandemic disease H1N1 influenza virus occurred worldwide (2009), and the Middle East Respiratory Syndrome (MERS) was found in Saudi Arabia (2012 -to date) [3], [4].Therefore, the threat of infectious disease outbreaks to public health has prompted countries and organizations to develop several early warning surveillance systems [5].However, the traditional surveillance systems or indicator-based surveillance systems need public health networks (Sentinel Networks) in order to collect predefined structured data about diseases on a routine basis from indicator sources, such as over-the-counter drugs and emergency department visits [6], [7].Therefore, in the case of using passive surveillance systems, regular submission of monthly, weekly or daily reports of disease data by all health facilities is required.Although implementing this type of system has some advantages, such as its ability to cover all parts of a country, and its statistical power, it takes a couple of weeks for disease patterns to be detected and for the results regarding possible outbreaks to be disseminated; furthermore, not all countries have the required infrastructure to implement this system [3], [7], [8], [9].On the other hand, as a result of the technological revolution of the internet, another type of surveillance system has emerged.This type is called the event-based surveillance system [10].Generally, event-based surveillance systems can be described as real-time monitoring of diseases 24/7 through gathering information from informal sources, such as online news.According to Keller et al. [11], WHO's investigations into the majority of disease outbreaks are obtained through diverse online informal sources.
The remainder of the paper is organized as follows: in Section II, a background to the topic and a review of related work are presented.Section III describes methods used in performing this work.Section IV explores disease outbreak surveillance systems in Arabic.Section V discusses the state of the art of outbreak surveillance systems.Section VI presents the results and discussion.Finally, the conclusion of this work is presented in Section VII.

II. RELATED WORK
According to Agheneza et al. [12], surveillance systems are classified into two types based on the type of data being processed.The first type is the indicator-based surveillance system (syndromic surveillance).The second type is the event-based surveillance system that collects and processes unstructured data from formal and informal sources, such as newspapers, reports, and medical websites.The current paper will review and focus solely on the event-based surveillance system, which utilizes text mining techniques to process media sources (news reports related to disease outbreaks) for text understanding, i.e. detecting and extracting infectious disease outbreak-related information, such as disease type, location name, date, and number of victims, if any.Also, more focus will be placed on systems that are able to process Arabic unstructured texts in the health domain.

A. Indicator-based Surveillance Systems
To date, many existing public indicator-based surveillance applications have been introduced to detect and track increases in disease incidence rates based on structured predefined information collected from different official sources, such as emergency room visits, telephone calls, and over-the-counter drug sales (syndromic/clinical surveillance data) [6].Many detection methods are used to perform this task.Tsui et al. [13] categorized these methods into three types: temporal (SPC, regression, time series, and forecast-based methods), spatial (scan statistics), and spatiotemporal surveillance techniques.Detailed information on these types of health surveillance systems can be seen in [13], [14]; the most recent review of these systems can be found in [15], [16].In addition, in [17] the national communicable diseases surveillance systems proposed between 2000 and 2016 in developed countries were reviewed.

B. Event-based Surveillance Systems
On the other hand, many event-based surveillance systems have been developed for manipulating gathered unstructured data relating to infectious disease outbreaks from informal or non-traditional sources, such as online newspapers, news reports and social media [6].Keller et al. [11] examined and compared the performance of three existing systems: EpiSPIDER, HealthMap and Global Public Health Intelligence Network (GPHIN), which have been developed to process event-based outbreak information.However, the most comprehensive review was conducted by Velasco et al. [3], who reviewed studies of infectious disease surveillance publications between 1990 and 2011 in detail, yielding 13 event-based systems.These systems were developed between 1994 and 2006, as listed in Table I, and used 15 review criteria: system name, system category, country, year started, coordinating organization, purpose, jurisdiction, supporting Arabic, disease type, public access, data processing, dissemination of data, most avid users, system evaluation and Homepage.
Choi et al. [5] performed a systematic review of web-based infectious disease surveillance systems, published in 2016.They identified 11 web-based surveillance systems, including GOARN, GPHIN, MedISys , BioCaster, HealthMap, ProMED, and EpiSPIDER, which had already been reviewed by Velasco et al. [3].However, they also reviewed new systems: EpiSimS (now known as Object-oriented Platform for People in Infectious Epidemic OPPIE) [18], Google Flu Trends [19], GET WELL [20] and Influenzanet [21], which were not mentioned in [3], as can be seen in Table I.
To the best of the authors' knowledge, the recent review article was carried out by Yan et al. [22] and was published in October 2017.Developed systems in articles published between 2006 and 2016 were evaluated in terms of their methods, timeliness and accuracy outcomes.

III. METHODS
Web-based infectious disease surveillance systems were systematically reviewed by focusing on multilingual systems, and systems dedicated to the Arabic language.Many electronic databases, such as Google Scholar, PubMed, Web of Science, IEEE Xplore Digital Library, and CiteSeerx were visited for reviewing the English literature published between 1994 and 2018.Moreover, many different terms or keywords were used for achieving the search; these include "surveillance systems", "infectious disease systems", "event/internetbased surveillance systems", "Arabic surveillance systems", "syndromic surveillance", "biosurveillance", and "Arabic text mining systems".

IV. DISEASE OUTBREAK SURVEILLANCE SYSTEMS IN ARABIC
Some of the developed systems, such as Argus, BioCaster, GOARN, GPHIN, HealthMap, MedISys and PULS, MiTAP and ProMED are able to process texts written in languages other than English.As previously mentioned, the aim of this study is to investigate event-based surveillance systems that can process Arabic texts in the domain of disease outbreaks.Al-Mahmoud and Al-Razgan [23] conducted a systematic review of published works on Arabic text mining between 2002 and 2014.The review showed that the topics of the articles were limited to a few different domains, such as opinion mining, crime domain, social networks, Arabic Wikipedia, and Islamic studies.Therefore, it is expected that disease outbreaks in news reports written in Arabic have not been directly processed; i.e. the 5 developed systems that support the Arabic language in Table I use translation tools to translate Arabic online news reports of disease outbreaks into English in order to process them for identifying the desired information.Further investigation on these systems can be seen below.

1) ProMED-mail
Monitoring Emerging Diseases (ProMED-mail) is a multilingual early warning system of emerging disease outbreaks developed in 1994 by Hugh-Jones [24], [25].It monitors human, plant and animal diseases worldwide.Moreover, some surveillance systems depend on health warning reports produced by ProMED-mail, such as Argus, BioCaster and HealthMap.This system is available to the public and no subscription fees are required.The ProMEDmail's source of data depends on reports obtained from its subscribers.Currently, there are more than 70,000 subscribers from over 185 countries.With regard to the reports produced by the system, these are reviewed by a number of experts before dissemination [25], [51].The system relies on its subscribers for performing the translation task.Although ProMEDmail utilizes translated Arabic news reports, disseminating information in Arabic is not provided.
2) The Global Public Health Intelligence Network (GPHIN) GPHIN is a multilingual internet-based system developed by Health Canada in collaboration with the World Health Organization (WHO).The system is able to collect public health reports from global media sources, such as newswires and websites on a realtime basis in order to monitor infectious disease outbreaks.As previously mentioned, GPHIN is a multilingual system supporting eight languages (English, Chinese, Spanish, Portuguese, Russian, Arabic, French and Farsi) to monitor disease outbreaks by using machine translation to translate non-English reports into English, and vice versa.According to Wang and Barry [52], the GPHIN team supports the Indonesian language.Moreover, not only can GPHIN track events related to disease outbreaks or infectious diseases but it can also track other events, such as animal diseases, chemical incidents, plant diseases, and contaminated food and water.Most of the WHO information is provided by GPHIN.Furthermore, the Centers for Disease Control and Prevention (CDC), and the Food and Agriculture Organization of the United Nations (FAO) use GPHIN on a daily basis [12].However, GPHIN is not free and official organizations must pay to subscribe.It also presents  [55].PULS extracts the disease name, number of victims, and their conditions, location and date.According to Agheneza [12], PULS is only able to process reports in the English language.MedISys provides three types of access levels [55]: • Free access for public In this review, few systems were found in the literature that are able to directly process Arabic health data, i.e. without using translation engines.One such system, the named entity recognition system, NAMERAMA, has been developed to identify disease related-information such as diagnosis methods, symptoms, disease names and treatment methods from textual reports in the Arabic medical domain [57].This system uses Bayesian Belief Networks (BBN) to extract aforementioned entities and is comprised of two stages: the first is the processing stage, which includes preprocessing, data analysis and feature extraction; the second stage is based on BBN for performing the classification task.The AMIRA tool is used for applying Part of Speech (POS) to the data, and an annotated corpus is used to evaluate the proposed system.However, only 27 articles were used for evaluating the performance of the NAMERAMA system, which is considered a very small dataset.In addition, this system only focused on identifying cancer disease-related information, i.e. other types of diseases were not covered.Table II lists the evaluation results.
In [58], two methods for identifying and extracting medical terms from the Arabic medical corpus were proposed.Their work forms part of the Multimedica project, funded by the Spanish Ministry of Science and Innovation.The aim of the project is the development of multilingual resources and tools that include the Spanish, Arabic, and Japanese languages to process published reports by news agencies in the health domain.The first proposed approach is based on a gazetteer that contains 3473 Arabic medical terms.The terms used are translated from English medical terms resources (SNOMED and UMLS) using Google translator.In contrast, the second approach uses 410 Arabic terms that are the equivalents of Latin prefixes and suffixes commonly used in the medical and health domain.The evaluation results show that the first approach achieved 100% accuracy and outperformed the second approach; however, it only achieved 54% recall, which is relatively low.
With regard to infectious disease outbreaks, Alruily et al. [59] presented a preliminary work on developing a web-based surveillance system to track infectious diseases by extracting disease-related information from Arabic news textual reports.However, no results were reported because it was a foundational study.

V. LATEST DISEASE OUTBREAK SURVEILLANCE SYSTEMS
As can be seen in Table I, the most recent system to be developed is that of Alshowaib [44], who used a rulebased approach to extract disease outbreak-related information, i.e. named entities, such as disease name, date and location, location of the reporting authority, and outbreak incident.The rules were created based on analysis of textual disease outbreak reports.This system is solely dedicated to the English language.The performance of this system can be seen in the following Table III.Nguyen and Nguyen [46], [60] developed the Disease Extraction System for Real-time Monitoring (DESRM).This system is used for Vietnamese online news.The approach used for performing this task depends on semantic rules and machine learning to extract infectious disease events.DESRM consists of two components: disease event identification from textual data, and disease event information extraction.For identifying phrases and detecting disease events, semantic rules and machine learning (maximum entropy model) are used.For extracting related information of disease events: time, disease name, and place in the second component, Name Entity Recognition (NER) rules and dictionary are utilized.The China Infectious Diseases Automated-Alert and Response System (CIDARS) was developed in 2008 by the Chinese Center for Disease Control and Prevention [42].CIDARS uses three early warning methods: spatial-temporal model, temporal model and fixed threshold detection method [61].Although CIDARS is able to detect signals of infectious disease, many false positive signals are produced [62].Investigations performed in 2017 on surveillance and early warning systems of infectious disease developed between 2012 and 2016 in China can be seen in [63].
The EpiCore global surveillance project was established by the International Society for Infectious Diseases, the Skoll Global Threats Fund, HealthMap, the Program for Monitoring Emerging Diseases (ProMED-mail) and the Public Health Interventions Network (TEPHINET) in 2013 [43], [64].The EpiCore system is an online platform used to verify informal health alert reports related to potential disease outbreaks by health experts.Moreover, a web-based diagnostic with epidemic alerts was proposed by Okokpujie et al. [48].It is also able to prescribe medications based on symptoms and is used for issuing alerts about the outbreak of epidemic diseases.This system relies on data provided by the users.Hyper Text Mark-up Language, Cascading Style Sheets, Javascript, Ajax, PHP, MySQL were utilized for developing this system.For analyzing the collected data, a medical diagnostic engine called Infermedica was used.
Moreover, a web-based diagnostic with epidemic alert was proposed by Okokpujie et al. [48].Also, it is able to prescribe medication based on the symptoms and issuing alert about outbreak of epidemic diseases.This system relies on data provided by the users.Hyper Text Mark-up Language, Cascading Style Sheets, Javascript, Ajax , PHP , MySQL were utilized for developing this system.For analyzing the collected data a medical diagnostic engine called Infermedica was used.Arsevska et al. [50] developed the Platform for Automated Extraction of Disease Information from the web (PADI-web) to monitor infectious animal diseases.It uses data collected from Google News to extract epidemic disease related-information, such as number of victims, dates and locations.The information extraction process relies on rule-based techniques of data mining.For evaluating the performance of PADI-web, 352 news reports were used, achieving F-scores 95% of the diseases, 85% of the number of cases, 83% of dates and 80% of locations, respectively.Osaghae et al. [65] proposed a web-based grassroots epidemic alert system.However, this type of system is a passive surveillance system, as defined by WHO, because it relies on official data collected from official places, such as health centers, hospitals and registered laboratories.Therefore, it is not covered in this review.
Several existing systems have been developed for specific disease outbreaks, such as influenza epidemics, but these are limited to a specific data source, e.g.Online Social Networks (OSN), such as Twitter.For example, Elhadad et al. [66] investigated social media in order to extract information about food-borne disease outbreaks by monitoring restaurants in New York City.A prototype was developed based on supervised machine learning to detect the review comments or discussions about a food poisoning incident, or the people affected by the incident.The system yielded high results; however, no specific numbers relating to the results were reported.Furthermore, the system works only on Yelp data collected from the Yelp website.Additionally, Talvis et al. [47], 2014, proposed the flutrack system (http://flutrack.org)for monitoring the spread of influenza epidemics.Every 20 minutes, the system gathers tweets written in English using the Twitter API to detect potential flu outbreaks.In other words, the aim of this system is to track and visualize influenza epidemics in real time.The list of searching tags, namely, influenza, flu, chills, headache, sore throat, runny nose, sneezing, fever, and dry cough were used to extract and track flu-related tweets.The system was evaluated and achieved an accuracy of 92%.Related works of systems proposed for epidemics of seasonal influenza can be seen in the Google Flu Trends system [19], MappyHealth application [67], Tracking Flu Infections on Twitter [68], detecting influenza epidemics by analyzing Twitter messages [69], HealthTweets.org:a Platform for Public Health Surveillance Using Twitter [45], and the ARGO system for monitoring dengue fever epidemics [49].For further reading on these types of systems, a systematic review of the literature conducted on proposed systems published between 2004 and 2015 that detect and track a pandemic using online social networks was published in 2016 by [70].In addition, the most recent review was presented by Pollett et al. [71] in 2017 for evaluating the internet-based biosurveillance performance of diseases caused by bacteria, parasites and viruses.Furthermore, other important disease outbreak surveillance systems are listed below.

1) Proteus-BIO
Grishman et al. [30], [31] developed the Proteus-BIO system for creating and automatically updating a database with information on infectious disease outbreaks.Proteus-BIO system consists of five phases:

2) BioCaster
BioCaster is an automated web service that monitors global online media for detecting infectious disease outbreaks [34], [72].The system is able to process 1700 Really Simple Syndication (RSS) feeds from different sources: the World Health Organization (WHO) outbreak reports, ProMED-mail, Google News, and the European Media Monitor.The system is limited to seven languages: English, Japanese, Chinese, Spanish, Thai, Vietnamese and French; it comprises four components: event recognition, topic classification, disease/location detection and named entity recognition.In addition, the visualization is provided in this system to the user by plotting extracted information on a Google map.BioCaster has been evaluated on a gold standard corpus of annotated news articles; for all named entity classes the system achieved an F-measure of 76.97.With regard to topic classification performance, the system was able to achieve 0.89 precision, 0.97 recall, and F-measure 0.93.However, BioCaster depends on ontology, which is a limitation in the system, as it is unable to identify new diseases or locations.3) Automatic online news monitoring and classification for syndromic surveillance Zhang et al. [41] developed an automatic online news monitoring and classification system for syndromic surveillance on infectious disease.The system consists of three components:

VI. RESULTS AND DISCUSSION
In general, the aim of this review was to examine existing event-based surveillance systems focusing on finding systems developed for the Arabic language.It found that 5 systems supported Arabic in different ways but were originally created for the English language.These systems, such as MedISys and PULS, GPHIN and HealthMap use translator engines to translate from non-English to English.However, a small number of systems have been developed for several specific languages other than English, such as Swedish and Vietnamese.With regard to the Arabic language, however, this review shows that no existing surveillance system for monitoring infectious disease outbreaks has yet been created to directly process Arabic texts.Processing texts to identify certain entities depending on translation is not sufficient; indeed, translation may in fact lead to identifying incorrect information.As Agheneza claimed [12], most event-based surveillance systems are based in the USA and Europe, and a few systems are based in Asia.However, only one system was found in the area of the Middle East and North Africa (MENA); this was developed by Alshowaib [44] but to date is not available online and developed to handle English health news reports.
Arabic is a Semitic language with a very complex morphology, as it is highly inflectional, and therefore, dealing with texts written in Arabic is highly complicated.Arabic is comprised of 28 letters (3 vowels and 25 consonants) that are used to form words.For correct pronunciation, the diacritical marks are used and are placed around the letters.Arabic can appear in different forms: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic [73].Ibrahim et al. [74] claimed that the main dialects in the Arabic word are: Gulf, Iraqi, Moroccan, Levantine, Yemeni, and Egyptian.Furthermore, the Arabic medical domain faces a number of challenges.For instance, diglossia is common in the health community in Arabic countries [58].According to Samy et al. [58], the English language is the primary teaching language in Egypt, Iraq, Jordan, and the Arabic Gulf countries for teaching all health courses at universities whereas French is used at universities in the North African Arab countries.Moreover, the communication languages used by health workers, either verbal or written, are English and French, even for patients prescriptions or their health reports.

VII. CONCLUSION
The aim of this review is to examine existing event-based surveillance systems focusing on finding systems developed for the Arabic language.This review shows that no existing early warning surveillance system for monitoring outbreaks of infectious diseases has yet been created to directly process Arabic texts, i.e, without using translator tools.This result might be due to some difficulties in the Arabic language itself and in the medical domain in particular.However, other event-based surveillance systems developed for detecting and tracking disease outbreaks are presented, and the components and performance of these systems are discussed.Most systems developed in the USA and Europe to process data are written in their own native language, and are then enhanced to serve other languages by utilizing a translator engine, which helps monitor the spread of pandemic diseases worldwide.

TABLE I .
DIFFERENT DEVELOPED TYPES OF EVENT-BASED SYSTEMS /medisys.newsbrief.eu/medisys/homeedition/ar/home.html.However, not all information provided is in Arabic and sometimes Google translate is used for translation some none-Arabic news reports.
[56]e access level is restricted for public health 5) HealthMap HealthMap is a multilingual automated real-time web-based surveillance system.This system is free and is publicly available[37].It can also be browsed in 7 languages: English, French, Portuguese, Russian, Chinese, Arabic and Spanish.Similarly, HealthMap uses a translation engine to handle non-English articles.HealthMap comprises five components as follows: data gathering from diverse online sources: newswires, Really Simple Syndication (RSS) feeds, ProMED Mail, and WHO Classification, Database, Web Backend and Web Frontend[56].The system's tools are Linux, Apache, MySQL and PHP.It also utilizes free services provided by other developers, such as Google Translate API, Google Maps, GoogleMap API for PHP and xajax PHP AJAX library.

TABLE III
Table IV and Table V present the performance of the system.