News Aggregator and Efﬁcient Summarization System

—News Aggregator is simply an online software which collects new stories and events around the world from various sources all in one place. News aggregator plays a very important role in reducing time consumption, as all of the news that would be explored through more than one website will be placed only in a single location. Also, summarizing this aggregated content absolutely will save reader’s time. A proposed technique used called the TextRank algorithm that showed promising results for summarization. This paper presents the main goal of this project which is developing a news aggregator able to aggregate relevant articles of a certain input keyword or key-phrase. Summarizing the relevant articles after enhancing the text to give the reader understandable and efﬁcient summary.


I. INTRODUCTION
In the last few years, the world had incredible and huge growth in the rate of news that is published [1]. People live in a time full of information, data, and news [2]. So, nowadays news has an important part and position within the community. As people read the news daily to keep up with the most recent data and inputs. These data may be about technology, sports, weather, food, and celebrities or many other fields [3].
With the development of the Internet, and lot of websites that provide the same data and information, getting this has become simpler getting to it has become simpler [4]. So, users frequently discover it troublesome to decide which of these websites can provide the specified data within the most valuable and effective way [2]. The conventional commerce model of daily papers has been threatened by the internet to lessening their advertising income and by presenting new online media, such as web-only news, blogs and news aggregators [5]. The online social system is a valuable instrument for collecting, aggregating and expending the specific or common contents for different aims in a certain period of time. Daily papers are in competition with presentday online media as shown in Fig. 1. Among online media sites, feeds aggregators show up to be more significant [5] [6].
An Outsell determination (2009) [5], 57% of feeds media clients go to computerized sites, and they are too more likely to turn to an aggregator 31 % than to a daily paper location 8% or other news sites 18% [5] Feeds aggregator combines news data, and regularly briefs it in a good format and design for the reader, from various sites, newspapers, and agencies [2]. News aggregators are frequently included in classifications such as "Websites Each Engineer Ought to Visit" [7].
Despite the pros of the presence of lots of information to the people through the internet, it will get us another problem which is information overload. There will be too much information that is in front of the user and might not be his interests [8]. This problem can be solved throughout the proposed system. As News Aggregator looks like a gateway that integrates different feeds websites, it organizes feeds by subject [9]. It could be a site that takes data and news from numerous sources and displays it in a single site [10]. Which simplifies readers' search and reading time for news by gathering content based on viewing history [11]. Using news aggregation is one of the best ways to stay on top of the news and topics you want. They offer convenience and time-saving features [12].
News Aggregator system will have a major requirement which is Summarization. Summarizing articles from various sources talking about the same event then writing the content of this event on one summarized page with all perspectives [13]. Summarization is to create a shorter and smaller form of a text by protecting its meaning and the key substance of the initial content [14] [15]. So, summarization has a lot of pros like reducing the time of reading to the user and getting only useful and real news. Content summarization methods can be categorized into Extractive summaries and Abstractive summaries [16]. Extractive Summarization depends on extracting a few parts, such as phrases and sentences, from a piece of text and gather them together to form a summary. Therefore, identifying the right sentences for summarization is of the most extreme importance in an extractive method [17]. But Abstractive Summarization utilizes advanced NLP methods to generate an completely new summary. A few parts of this summary may not indeed appear within the original text [17]. In this paper, we follow extracting summarization technique which gives better output and right sentence for Summarization.
The rest of the paper is structured as follows: Section II reviews related work on News aggregator based on summarization. The proposed system is presented in Section III. Section IV presents experimental results about the summarization and performance analysis of the system. Finally, Section V concludes the paper and discusses some future work.

II. RELATED WORK
In this section, we will discuss other news aggregator websites which based on summarization: In [18], the authors were focusing on gathering news using matrix-based analysis (MNA) with 5 main steps as follows: the first and second steps are data gathering and extracting the article from the websites and save it in the database. The third step is grouping where they categorize the articles. The last two steps are summarization and visualization that view the important article to the user. Before the grouping step, they added the matrix-based analysis where the matrix has entity as row and the column is the states about the entities. When starting analysis, the user defines what he's looking for where MNA prepare the default values for this purpose. After that, the initialization of the matrix extends a matrix over the two required chosen dimension and look in each cell for the cell documents. The summarization phase is done according to the following steps: topic summary, cell summary and summarizing both by using TF-IDF for each cell in the matrix.
According to [11], the authors were aiming to accumulate the content from diverse websites such as articles fond moreover news headlines from blogs and websites. The belief that Rich Site Summary (RSS) gives us summarized and short data. Which is preferable for the news aggregator that they are still a successful solution for indexing articles. As reducing the time required for visiting some websites, subscribed users can quickly utilize Rich Site Summary feeds without wasting time going to numerous websites. Creating HTTP requests from the web-server is the primary step in the application and these requests are received from clients. At that point, they utilize Python to download Rich Site Summary feeds and extract articles from it according to the input. After periods of time, the web-server gives some requests to the subscribed users and in case there are any upgrades, it'll be stored and downloaded.
Author in [19] was aiming to use Rich Site Summary integrated with HTML by using wrappers (programs) and parser in order to extract the information from a specific source, then adjust them according to news categories and personalized web views via a web-based interface. They explain how they do the content scanner by using HTML and Rich Site Summary. The first step is wrapping (HTML/Rich Site Summary wrapper) which involves identifying the URL address of the new items from the source with category per the news, and the address is stored in the database as for each category pair and also combined with the corresponding wrapper. The second step of wrapping is getting information from the new items, that will be used for getting and indexing the article, for each article they obtain the first sentence and pass it to the corresponding HTML page. According to [9], the authors were aiming to collect the news from multiple sites, newspapers, magazines, and television and merge them all in one summarized website. It progresses the goodness of results because the contents and data in it are brief and summarized. So, their work based on the Rich Site Summary fetcher for recovering Rich Site Summary reports from specific websites at a certain time. They also use web Crawling (Scraping) besides Rich Site Summary to get more accurate results. Web scraping may be a method utilized to collect huge amounts of information from websites. From all the above mentioned researches on the news aggregator, the quality of the aggregator system is still an open area to be introduced.

III. PROPOSED SYSTEM
In this section, the basic structure of the proposed system is described, which will be able to aggregate online news from cloud service and summarize its content to reduce user's reading time. 2) Pre-processing stage: Consists of applying specific steps to the aggregated articles such as: a) Lowercase: used to reduce the size of the vocabulary in our data that cause multiple copies of the same word meaning.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020 b) Stop-word Removal: Done to remove small information of a text in order to focus on important words. c) Lemmatization and Stemming: Remove inflection and map the word into the original/root form.
3) Processing stage: Consists of applying the summarization algorithm on the aggregated articles.
In this paper, we use the TextRank Algorithm [20] for summarizing articles which is a graph-based ranking algorithm that is used for summarization. Fig. 3 shows the steps of the textrank algorithm: The steps can be illustrated below: a) A set of articles are collected and combined as one one original article. b) Tokenizing the original text as shown in Fig.  3 into sentences. c) The Third step is vectorization. In this step, each word is represented by a vector based on the co-occurrence of a word with the others in a single sentence using Global vector algorithm (GloVe) [21]. Then we represent each sentence by a vector calculated from the mean of words vectors in a sentence. d) In this step, we obtain a similarity matrix for all sentences using cosine similarity [22]. The similarity here refers to common content in sentences.
+ a n b n is the dot product of the two vectors. e) After obtaining the similarity matrix in the previous step, we convert it into a Graph where the edges determined by a similarity relation between them. Those edges are used to obtain the vertices weight. The importance of a sentence is based on the number of edges that represented as a score for each vertex as shown in Fig. 4 using PageRank algorithm [23]. Let the directed graph, where V represents set of vertices and E represents set of edges. The vertex scoreV i is defined as follows: where d is a factor that it's value is between 0 and 1 and usually the value is 0.85, which represents the probability of going to another random vertex from a given vertex in the graph.  4) Output stage: Summary that contains key ideas to the topic is generated.
The proposed system overview will make the user able to explore news from more than one source. The system will also give the user the main feature which is a readable summary to all of the aggregated content of the same topic. In the next section, an experiment will be discussed to compare between two summarization algorithms and to decide which one of them will fulfill the requirements of a good and accurate summary.

IV. EXPERIMENTAL RESULTS
To validate the effectiveness of the proposed News Aggregator system, a set of experiments have been conducted with different keywords and key phrases. Also, the efficiency of the TextRank algorithm is tested against a common used algorithm which is Word Frequency [24]. Fig. 6 shows the original article that will be summarized, Fig. 7 and 8 are the output summary after applying summarization algorithms. "Word Frequency" algorithm depends on more than one factor to summarize the input text as: • The existence of the frequency table is the first step towards executing the algorithm, then every sentence will be tokenized.
• After tokenizing, we will have separated sentences, scoring every sentence will be the next step, which its formula is a division of every non-stop word in the sentence by the total number of words in the sentence.
• Getting average score of the sentences is the next step. A comparison will be done between every sentence and this average score, if the score sentence is larger, then this sentence will be considered as a summarized part of the input article. In this paper, we used a summary evaluation tool named 'Rouge" for our summarization comparisons between Tex-tRank and WordFrequency algorithms, it turns out that this algorithm has drawbacks. Drawbacks of Rouge tool is that its execution depends on the permanent existence of an expert one who knows the actual rules of summarization. Rouge executes by comparing the results of this expert to the system results and his existence might not be always available for us to have his/her consultation. This is the reason to find another evaluation criteria which was applying a survey with our social network, asking to rate each 2 summaries out of 5 which were implemented by 2 different algorithms. 1 refers to a very bad summary and 5 refers to an excellent one.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020 TextRank algorithm was rated as 4 out of 5 from 60% from people who read it and 30% were rating 5 out of 5 as a summary as shown in Fig. 10. While Word Frequency algorithm was having bad rates from the reader as 60% were rating its summary as 2 out of 5 which is of course a bad percentage as shown in Fig. 9. These results ensure that TextRank has more efficient results in summarization.
To further emphasize that, the output summary from the TextRank algorithm was revised by an expert besides the normal readers rating. The feedback considers that the output summary fulfills all the requirements of a good summary such as: • The length is about 10% of the original.
• Short paragraphs that contain the key ideas and to the point.
• Could be read and clearly understood without referring to the original article.
• Used the appropriate language just like that used in the original article.

V. DISCUSSION
The results of our proposed system presented a very good and understandable summary, as this information was ensured by an expert in this type of fields. TextRank summary was more acceptable in the survey results according to readers perspective and that's an indication that user is more comfortable with reading the summary after applying the experiment. Also, the system fully works online so any type of aggregated articles will be on the spot and will be chosen from trusted and determined sources.

VI. CONCLUSION
After testing two summarization algorithms, TextRank algorithm was the chosen approach to be applied in the summarization system over Word Frequency algorithm. The reason behind applying TextRank algorithm was simply that TextRank gives more efficient summary for the reader. The system will generate output summary from online sources that contains key ideas to the certain article topic and it could be understood without referring to the original article.

ACKNOWLEDGMENT
We would like to express our special thanks to our advisor Dr. Walaa Hassan for supporting us and for her patience and motivation for our project. As her guidance helped us in writing this paper. Also, we would like to thank and appreciate the efforts of Dr. Sama Dawood (Associate Professor of Faculty of Alsun at Misr International University) for helping us in our summarization phase by defining which summary algorithm has the accurate output.