An Evaluation of Automatic Text Summarization of News Articles: The Case of Three Online Arabic Text Summary Generators

Digital news platforms and online newspapers have multiplied at an unprecedented speed, making it difficult for users to read and follow all news articles on important, relevant topics. Numerous automatic text summarization systems have thus been developed to address the increasing needs of users around the world for summaries that reduce reading and processing time. Various automatic summarization systems have been developed and/or adapted in Arabic. The evaluation of automatic summarization performance is as important as the summarization process itself. Despite the importance of assessing summarization systems to identify potential limitations and improve their performance, very little has been done in this respect on systems in Arabic. Therefore, this study evaluated three text summarizers AlSummarizer, LAKHASLY, and RESOOMER using a corpus built of 40 news articles. Only articles written in Modern Standard Arabic (MSA) were selected as this is the formal and working language of Arab newspapers and news networks. Three expert examiners generated manual summaries and examined the linguistic consistency and relevance of the automatic summaries to the original news articles by comparing the automatic summaries to the manual (human) summaries. The scores for the three automatic summarizers were very similar and indicated that their performance was not satisfactory. In particular, the automatic summaries had serious problems with sentence relevance, which has negative implications for the reliability of such systems. The poor performance of Arabic summarizers can mainly be attributed to the unique morphological and syntactic characteristics of Arabic, which differ in many ways from English and other Western languages (the original language/s of automatic summarizers), and are critical in building sentence relevance and coherence in Arabic. Thus, summarization systems should be trained to identify discourse markers within the texts and use these in the generation of automatic summaries. This will have a positive impact on the quality and reliability of text summarization systems. Arabic summarization systems need to incorporate semantic approaches to improve performance and construct more coherent and meaningful summaries. This study was limited to news articles in MSA. However, the findings of the study and their implications can be extended to other genres, including academic articles. Keywords—AlSummarizer; Arabic; automatic summarization; discourse markers; extraction; LAKHASLY; news articles; RESOOMER; sentence relevance


I. INTRODUCTION
The recent unprecedented growth of digital news platforms and online newspapers has resulted in considerable changes in terms of news production and audience reception. Compared to traditional newspapers, digital news networks and online newspapers are extremely fast and easily accessible. Different reports indicate that online newspapers replaced traditional newspapers causing them to lose much of their audience [1]. That is why, almost all daily and weekly newspapers in the West and in the Arab world run their own websites and publish electronic editions [2][3][4]. Furthermore, news websites have now proliferated in an unprecedented manner, and without any of the traditional restrictions. There is no need for licenses, offices or even employees and correspondents. Anyone can create a news website in the same way used to create a personal website. This has the effect of producing fast and prolific news in an unprecedented manner.
The extensive number and popularity of online newspapers have changed the way people consume newspapers and magazines in many ways. According to Watson [5], by 2020, more than two-thirds of people in the United Kingdom were reading or downloading online news, newspapers, or magazines. In comparison to 2007, the number of online readers had tripled. As of February 2019, the Guardian and Mail Online were the second and third most popular websites in the UK, respectively, as shown in Fig. 1. 91 | P a g e www.ijacsa.thesai.org In the same study, Thurman [6] asserts that over recent years online newspapers and news websites have been gaining massive popularity and increasing in an unprecedented manner. In the face of these developments, it is impossible for a normal audience to follow all that is written on important and relevant topics. It is a challenging task for individuals to read this huge content in a limited time span, as the news changes daily if not hourly.
In response, numerous automatic summarization systems have been developed over recent years to generate meaningful summaries that can reduce reading and processing times. Researchers have developed summarization tools and methods that automatically summarize the content of news articles in effective ways. It is even argued that summarization has become an integral part of everyday life [7]. This can be seen in the rapid developments of numerous applications and websites including Summarize Bot, Resoomer, SMMRY, Inshorts, and Text Summarization API (Rapid API) that provide summarization services to news articles to millions of users around the world.
Despite the availability of text summarizers in different languages including Chinese, English, French, and Spanish that provide good summarization services for millions of global users, automatic summarization in Arabic is still very limited. This can be attributed to the unique linguistic system of Arabic where multilingual text summarizers cannot be used with Arabic texts. It is also true that the morphological and syntactic properties of Arabic still pose serious challenges for different Natural Language Processing (NLP) applications, including information retrieval, localization, machine translation, and automatic text summarization [8,9].
Another reason for this limitation is the lack of evaluation studies of automatic Arabic text summarization. Evaluation is an integral part in the automatic summarization process [10][11][12][13][14][15][16]. According to Al Qassem, et al. [17], the evaluation process is one of the main challenges that have adverse impacts on the availability and reliability of automatic Arabic text summarization systems. This can be attributed to the lack of gold standard summaries for Arabic. They add that automatic evaluation of Arabic summarization is more complicated due to the lack of Arabic benchmark corpora, lexica, and machine-readable dictionaries.
To address this limitation, an evaluation of three text summarizers AlSummarizer, LAKHASLY, and RESOOMER is carried out hereunder. A corpus of 40 news articles was built. The articles were randomly selected from the most popular and digital news networks and online newspapers in the Arab world. It was, however, considered that the selected news articles cover different themes and subjects including politics, business, sports, and entertainment. Only articles written in Modern Standard Arabic were selected. The rationale being that MSA is still the formal and working language of the Arab newspapers and news networks.
The remainder of this article is organized as follows. Section 2 is a brief survey of automatic summarization literature in general and automatic Arabic text summarization systems in particular. Section 3 defines the research methods and procedures. In this part, the selected summarizers and articles are defined. Procedures of carrying out the study are also defined and established. Section 4 reports the results. It evaluates the performance of the selected summarizers regarding the news articles. Section 5 concludes the study.

II. PREVIOUS WORK
Previously, text summarizations were carried out using non-computational methods, using philological methods where experts and professionals produced their summaries based on their own evaluation of the most important concepts in the texts under investigation. With the development of digital technologies, computational approaches have been integrated into text summarization for generating automatic summaries of different text genres. It can be obviously seen that the recent years have witnessed an increasing rate in the development of automatic text summarizers. These have been essentially developed to address the increasing needs of users all over the world. Gambhir and Gupta [18] assert that it has now become impossible for traditional or non-computational classification methods to deal with the large amounts of data available today. They indicate that conventional or noncomputational summarization methods are no longer effective or reliable to deal with the prolific size of digital texts and archives available on the internet. According to Cheng and Lapata [19], the need to access and digest large amounts of textual data has provided strong impetus to develop automatic summarization systems, aiming to create shorter versions of one or more documents, whilst preserving their information content. Much effort in automatic summarization has been devoted to sentence extraction, where a summary is created by identifying and subsequently concatenating the most salient text units in a document.
Soni, et al. [20] agree that due to massive rate of rising data at a on the Internet, automatic text summarization tools have a powerful effect on today's world. The entire material is very difficult for a person to describe and ingest. It is a very difficult task to manually convert or summarize, hence, automation is required. Using artificial intelligence methods, automatic text summarization can be accomplished. 92 | P a g e www.ijacsa.thesai.org Nenkova and McKeown [21] add that objectivity and consistency have always been main considerations in automatic summarizations. The argument is that automatic summarization is imperative for addressing the inconsistencies and lack of objectivity that were associated with conventional or non-computational methods of text summarization.
The first attempt of computer-based text summarization is attributed to Hans Peter Luhn in 1958 [22][23][24][25]. In his article 'The Automatic Creation of Literature Abstracts', Luhn [26] proposed an algorithm to facilitate quick and accurate identification of the topic of published papers in a way that saves prospective readers time and effort in finding useful information in a given article or report. The underlying principle of Luhn's approach was that the salient points of an author's argument can be identified through the statistical analysis of the most frequent words and phrases occurring in texts. The hypothesis was that authors generally use important words more frequently throughout a paper and this can be conveniently used as a predictor for selecting the sentences with more repetition of the keywords and extracting them to generate an automatic summary [27][28][29].
Despite the development of different approaches to automatic text summarization, extractive methods remain the most popular summarization methods. In such methods, automatic summarizers are trained to identify the most important phrases and sentences, usually using statistical methods, and generate automatic summaries based on the extraction process of these sentences and phrases. Extractivebased summarization approaches are based on identifying and selecting only the most important phrases and sentences in texts under consideration. To generate an automatic summary, automatic summarizers then incorporate all the important phrases and sentences. In this case, therefore, every line and word of the summary actually belongs to the original summarized text [30].
In Arabic, over recent years, a very limited number of automatic text summarization systems have been developed, compared to other languages including English, Spanish, and Chinese [16,17,31]. Al-Saleh and Menai [10] comment that despite the long history of text summarization, studies of the Arabic language in this area have only recently emerged, and they have been negatively influenced by the lack of Arabic gold standard summaries.
The literature indicates that automatic text summarization systems have been largely based on extractive methods. These extractive summarization systems are mainly based on numerical and statistical measures [32][33][34][35][36]. The main hypothesis in these approaches is the ability of training the machine to identify the most important sentences and phrases for building the summaries. This is usually carried out through using different statistical measures including Principal Components Analysis (PCA) and Term-Frequency Adverse Document-Frequency (TD-IDF). To simplify, these weighting methods are used as indicators for retaining only the important information within texts and discarding information of secondary or minor importance. Successful implementation of these mechanisms is thus a critical factor for the success of summarization systems [37].
Weighting methods are usually combined with other methods that support the identification of the most important sentences, clauses, and phrases in the texts. One popular method is the position or location of sentences. The premise is that the location of a sentence in a document is related to the amount of information it contains. Other techniques involve grouping and summarizing related texts in a process known as multi-document summarization.
The argument is that the majority of automatic Arabic summarization systems are based on statistical methods. One limitation with these techniques is that automatic summaries are based only on those sentences with the highest scores based on statistical measures and techniques. In response, symbolic approaches have been adopted [38][39][40][41]. Unlike the numerical and statistical approaches, symbolic-based extractive summarization systems are based solely on linguistic information and indicators for identifying the most important sentences and expressions within texts. The premise of such approaches is that summaries should be built on a rhetorical structure that considers the rhetorical relations within texts. Work on symbolic-based extractive summarization systems is still very limited.
Over recent years, work on/with automatic Arabic summarization systems has reflected on the development of online summarizers that provide summarization services for users in an easy and accessible way. These summarizers are based on the developments in automatic Arabic summarization research and industry. One major problem, however, is the lack of evaluation studies that can determine the readability and reliability of these summarizers. This study addresses this gap in the literature through an evaluation of three Arabic summarizers, namely AlSummarizer, LAKHASLY, and RESOOMER.

III. METHODS AND RESULTS
This study is based on a corpus of 40 newspaper articles. The articles were selected from nine newspapers issued published in different Arab countries, as shown in Table 1.
The selected articles and opinions represent different topics including politics, business, and sports as shown in Table 2.   TABLE I. THE SELECTED NEWSPAPERS

Newspaper Title Country Number of articles
Al-Ahram Egypt 8

Morocco 4
Al-Jazeera Saudi Arabia 5 Asharq Al-Awsat Saudi Arabia 5 93 | P a g e www.ijacsa.thesai.org For convenience, the selected articles were coded 01-40, as shown in Table 3. The full information pertaining to the selected articles, including transliteration and English translations of the headlines of the selected articles is given in Appendix No. 1. The selected articles were summarized using three Arabic summarizers: AlSummarizer, LAKHASLY, and RESOOMER. These are currently the most popular Arabic summarizers. All three summarizers are based on extractive summarization methods. AlSummarizer is multilingual software that provides summarization solutions in different languages including Arabic, Dutch, English, Farsi, and Turkish. LAKHASLY is an Online Summarization tool. It provides automatic summaries for Arabic and English texts. It is widely used all over the world for users interested in Arabic summaries. Finally, RESOOMER is online summarizers which provide summarization services in different languages including Arabic, English, French, German, Italian, Polish, and Spanish.
For evaluating the performance of the selected summarizers, manual (human) evaluation methods were used. Three examiners were selected to generate manual summaries and to examine the linguistic consistency and relevance of the automatic summaries to the original news articles.

IV. ANALYSIS AND DISCUSSIONS
To evaluate the performance of the three selected Arabic summarizers, the automatic summaries were compared to the manual (human) summaries produced by the experts who participated in the study. Results are shown in Table 4, Table  5, and Table 6.
Comparisons to the manual summaries indicate that the scores of the three summarizers are very similar. The scores also indicate that the performance of the three summarizers is not satisfactory. This has negative impacts on the reliability of these summarizers. Another problem with the three summarizers is the lack of sentence relevance and coherence. 94 | P a g e www.ijacsa.thesai.org This can be seen in the following example taken from LAKHASLY. There is a problem with the sentence relevance. Sentences are not well connected to one another, which has adverse impacts on the readability and understanding what the original text about, as shown in Box 1.  Omar [42] explains that if sentences and clauses in the automatic summaries are not connected, the overall argumentative structure of the text is not supported and the 95 | P a g e www.ijacsa.thesai.org thematic significance of the original texts is lost. He adds that in many cases automatic summaries generated in this fashion are misleading for readers and users. Alami, et al. [43] agree that one main reason for the low performance of text summarizers in Arabic is that sentences in the extracted or generated summaries are not relevant; therefore, the main point of the original texts is not clear. The lack of sentence relevance thus has negative impacts on the user's ability to grasp the meaning of original texts. In such a case, automatic summaries do not provide the users with concise and relevant information that helps them determine and assess the importance of texts without having to read all of the texts.
The lack of sentence relevance and coherence in automatic summarization in Arabic can be attributed to the multilingual nature of automatic text summarizers. Many of the summarizers are usually offered in different languages without considering language-specificity. It is almost agreed upon, that the unique linguistic properties of Arabic are always associated with the low performance in different natural language processing (NLP) applications including automatic summarization [10,17,[44][45][46]. To put it into context, the morpho-syntactic system of Arabic is different in many ways from English and other Western languages (the original language/s of automatic summarizers). The morphological and syntactic properties of Arabic are indispensable in building sentence relevance and coherence in Arabic. According to Al Qassem, et al. [17], the main challenge in Arabic text summarization is in the complexity of the Arabic language itself: 1) the meaning of a text is highly dependent on the context; 2) there are more inherent variations within Arabic than any other language; 3) the diacritics are usually absent in the texts of news articles and any online content.
To improve the sentence relevance in the automatic summarization of news articles in Arabic, this study proposes that summarization systems should be trained to identify the discourse markers within the texts and furthermore to use these discourse markers in the generation of automatic summaries. The hypothesis is that discourse markers can be gainfully used to create cohesive texts with sentences that are linked together and relevant to one another. In other words, discourse markers can be used to build cohesive and coherent texts through interrelated sentences which will have positive impacts on the quality and reliability of text summarization systems. Arabic summarization systems need to integrate semantic-based methods for improving the quality of summarization performance and generating more coherent and meaningful summaries.
V. CONCLUSION Digital news platforms and online newspapers have multiplied today at an unprecedented speed, making it difficult for users to read and follow all news articles on important relevant topics. Numerous automatic text summarization systems have thus been developed to address the increasing needs of users around the world to access summaries that reduce reading and processing time. In Arabic, different automatic summarization systems have been developed and/or adapted in order to address the increasing need for automatic Arabic summaries. Evaluation of automatic summarization performance is as important as automatic summarization itself.
Despite the importance of the evaluation and assessment of automatic summarization systems for identifying the limitations and for improving the summarization performance, very little has been done on the evaluation systems of automatic text summarization in Arabic. Therefore, an evaluation of three text summarizers AlSummarizer, LAKHASLY, and RESOOMER was carried out. A corpus of forty news articles was built. Only articles written in Modern Standard Arabic (MSA) were selected. The rationale being that MSA is still the formal and working language of Arab newspapers and news networks. For evaluating the performance of the selected summarizers, manual (human) evaluation methods were used. Three examiners were selected to generate manual summaries and to examine the linguistic consistency and relevance of the automatic summaries to the original news articles. To evaluate the performance of the three selected Arabic summarizers, the automatic summaries were compared to the manual (human) summaries produced by the experts who participated in the study. Results indicated that the scores of the three summarizers were very similar. The scores also indicate that the performance of the three summarizers is not satisfactory. Furthermore, the automatic summaries have a serious problem with sentence relevance that has adverse impacts on the reliability of such systems.
It can be concluded that the poor performance of Arabic summarizers can be mainly attributed to the unique morphological and syntactic characteristics of Arabic. The morphological-syntactic system of Arabic is different in many ways from English and other Western languages (the original language/s of automatic summarizers). The morphological and syntactic properties of Arabic are indispensable in building sentence relevance and coherence in Arabic. Thus, this study proposes that summarization systems should be trained to identify the discourse markers within the texts and furthermore to use these discourse markers in the generation of automatic summaries. This will have positive impacts on the quality and reliability of text summarization systems. Arabic summarization systems need to incorporate semantic approaches for improving the quality of summarization performance and building more coherent and meaningful summaries.
This study was limited to the news articles in MSA. However, the findings of the study and their implications can be extended to other genres including academic articles. Further research, however, is recommended to address the performance of automatic Arabic text summarization in social media language and colloquial dialects in Arabic.