Analysis , Visualization and Classification of Summarized News Articles : A Novel Approach

Due to advancement in technology, enormous amount of data is generated every day. One of the main challenges of large amount of data is user overloaded with huge volume of data. Hence effective methods are highly required to help user to comprehend large amount of data. This research work proposes effective methods to extract and represent the data. The summarization is applicable to obtain a brief overview of the text and sentiment analysis can obtain emotions expressed in the text computationally. The combined text summarization and sentiment analysis is proposed on BBC news articles. A pronoun replacement based text summarization method is developed and VADER sentiment analyzer is used to determine sentiment information. The 3-D visualization schemes have been provided to represent the sentiment information. The sentiment analysis and classification are performed on original BBC news articles as well as on summarized articles using classifiers, such as Logistic Regression, Random Forest and Adaboost. On original news articles highest classification rate of 84.93%, using summarization of ratio 25%, 50% and 75% highest classification rates of 78.73%, 83.06% and 83.23%, respectively are observed. Keywords—Summarization; sentiment analysis; 3-D visualization; sentiment classification


I. INTRODUCTION
Huge amount of data is being generated every day in the form of social media data, various blogs, web sites, Wikipedia, online newspapers, etc. Due to wide spread usages of social media such as Facebook, Twitter, Yahoo! etc. have enormously increased the amount data that has been produced.The Wikipedia alone contain five million articles and thousands of new articles generated every day.There are various online web sites which are publishing newspapers on daily basis.One of the main challenges of huge data is that user gets over loaded with data and requires effective way to absorb the large volume amount of data.Effective data extraction and representation techniques are needed to help user to comprehend huge data.The text summarization is the technique intended to produce a brief overview of the input text and also reduces the amount of data.Moreover, the sentiment analysis is the computational technique, which deduce the user emotion expressed in the text.The sentiment analysis been effective applied to various fields such as product reviews [1], [2], news articles [3], political debate [4], twitter sentiment data analysis [5], [6], stock market [7], [8], etc.
The goal of the text summarization is to obtain a brief summary of the text [9].This method of text summarizing can be utilized in different applications namely searching documents related to a particular subject and obtain an overview, gather headline from newspaper articles, assimilate emails, obtain summary of medical information, to produce brief of scientific articles [10], [11] etc.There are various steps involved in text summarization such as topic identification, interpretation and summary generation [12].A notable work by Bennostein et al. on topic identification is presented in [13] with a frame work for topic identification and applications.Work on Wikipedia graph centrality method for topic identification is presented in [14].During text interpretation, the meaning of the text is obtained.For text interpretation researchers have focused on various methods such as ontology based interpretation [15] and text interpretation [16].The goal of text summarization is to generate an abstract or synopsis on single or multiple documents.J. Alan et al. [11] have presented a text summarization method based on novelty detection at the sentence level.Literature review presented in [17] by Lloret et al. have noted that there are two summarization methods: abstraction and extraction.Semantic representation are constructed from text to produce a brief overview in abstraction method [13], [18].The extractive summarization methods discussed as in [19]-[21] are intend to choose words, sentences and phrases from the given text to obtain the summary.Forming summary based on frequency of words related to the topic has found suitable application in several area [22], [23] of text analysis.It is observed that in a given document the words that are occurring more frequently indicates the subject on which the text is pivoted.Rafael Ferreira et al. [24] have accessed the sentence scoring technique for text summarization.In their work, it was noted that obtaining the frequency of important words and extract sentence to prepare the summary is one of the effective methods.Pronouns are place the holders for proper nouns, which are often used in the text.In the process of filtering and stopword removing, pronouns are also eliminated affecting the frequency of proper nouns.In this research work the summarizing technique is proposed in which, pronouns are replaced at first with proper nouns and then the frequency of words are computed, thereby enhancing the frequency information related to proper nouns to generate an improved version of the text summary.
Sentiment analysis has found applications in healthcare [25], [26], tourism [27], fraud detection [28], finance [29], politics [30], business [31].There are additional area of applications that are found in [32].The sentiment analysis of online news articles is presented in [33].The prediction of positive and negative sentiment on financial news is carried out in [34].The opinion mining engine for news article is present in [35], which uses the knowledge from ConceptNet and SenticNet.The sentiment classification described in [36] uses informatics and theoretic approach.A. Mudinas et al. have presented a notable work [37] on lexicon and conceptlevel sentiment analysis.T. H. A. Soliman et al. [38] have carried out mining of online customer reviews utilizing support vector machines and a similar work on sentiment analysis has been reported in [39] based on As-LDA model.There is an interesting work reported on sentiment analysis based on machine learning techniques in [40].Sentence level sentiment analysis has been carried out using cloud machine learning techniques in [41].Sentiment analysis using different types of lexicon dictionaries are listed in [42], [43] .
With motivation to help user to comprehend large volume of data, in this research work, summarization on news articles is performed then carried out sentiment analysis and representation.The extractive text summarization method is developed based on [21] to produce a brief overview of news articles.VADER [42] sentiment analyzer is used on original news articles and summarized news articles to deduce sentiment opinion from the text.By using VADER, various sentiment information has been collected as negative, neutral, positive, compound score and count related to sentiment words.Further sentiment information is represented using several visualizations schemes in three dimension such as column plots, surface plots, scatter plots etc.These 3-D visualization methods give a clear and better scheme to portray the sentiment information.Further the sentiment analysis and classification are carried out on original and summarized news articles using classifiers namely Logistic Regression, Random Forest and Adaboost classifiers.The experiments are carried out on BBC news articles and classification performance is tested on 10-fold cross validation.In Section 2 the method of text summarization with pronoun replacement is described, sentiment visualization and classification is presented in Section 3, an example on summarization and sentiment analysis is given in Section 4. Experiment results are presented in Section 5 followed by Section 6, which covers the conclusion.

II. PRONOUN REPLACEMENT BASED TEXT SUMMARIZATION
The text summarization involves in generating a brief summary of given text.Before generating summary, the preprocessing is carried out on the text.The preprocessing involves noise elimination, lowering text, tokenization, identify stopwords such as that, a, the, etc., and removal of them [44].During preprocessing, pronouns which are place holders for proper nouns are also eliminated.In this research, a summarization technique is developed in which pronouns are replaced with proper nouns and then extractive summarization is carried.In the extractive methods [24] of text summary generation is to look for keywords or the most important words and their frequency in the text.The approach for identifying the important words is to eliminate stopwords and remaining words are taken as important words.As a part of stopword elimination, pronouns are also eliminated, thereby losing the frequency information.In this research the summarization method of [21] is developed as depicted in Fig. 1.For a given input text, Part of the Speech (POS) tagger of [45], [46] is employed to recognize various parts of a sentence.The pronouns are recognized and replaced with proper nouns.The proper noun that is occurring before pronoun and closer to pronoun is considered to replace the pronoun.However the original input text is used to produce the final text summary.The next step is to eliminate the stopwords from the text and determine frequency of remaining words in the text.The computation of weightage of keywords or important words is as follows.Let n k be the number of keywords and n o be the number of stopwords, then a sentence has total n s = n k + n o words.Let f k be the frequency of the k th keyword.Also n t be the number of keywords in the entire text.The weightage of k th keyword can be calculated as The sentence weightage is computed as summation of weightage of words given in (2).The sentences having important keywords with more weightage will have higher sentence weightage.
The priority order of the sentence is determined using sentence weightage, which indicates order to extract the sentences to form the summary.The user specifies the summary ratio S r to decide the length of the summary required.For a text with m t number of lines and S r given summary ratio , the length of the summary m s is calculated as The text summary is generated by extracting lines in priority ordered up to required length of m s .

III. SENTIMENT VISUALIZATION AND CLASSIFICATION
The VADER of [42] is a simple rule based sentiment analyzer.It consists of list of lexical features and associated sentiment measures.Based on grammatical and syntactical usage of the language, several rules are formed, which are used to determine the sentiment of the text.A lexicon basically is a list of words with each word assigned a semantic oriented values as positive value or negative value [47].In VADER list of lexicons the features are assigned values between the range of -4 to +4, here -4 being extreme negative and +4 is extreme positive.In Table I, few words from VADER lexicon list are shown.
It is interesting to perform the sentiment analysis of the news article.Sentiment analysis on news articles are carried in various research such as [33]- [35].The news article written by an author or journalist provides an opinion on the subject about which article was written.The sentiment analysis thus provides a sentiment evaluation of the news articles.In this research VADER is utilized to perform the sentiment analysis of the BBC news articles.Schematic diagram for sentiment analysis is depicted in Fig. 2. The news articles are subjected to preprocessing such as word tokenization and stopwords removal.Then VADER is applied to compute sentiment score of the news article.The VADER utilizes lexicon list and computes sentiment information such as compound, neutral, negative and positive scores.Also it gives count of positive, negative and neutral words.In this research, novel 3-D visualizations of sentiment information obtained from VADER are presented.The visualizations schemes in terms of three dimensions column plots, surface, scatter plots etc., are developed.These 3D visualization provide better insight of sentiment information gathered from news articles.Furthermore, the sentiment classification on summarized news articles as shown in Fig. 3 is performed.As significant amount of data being generated every day, it is becoming important to provide techniques which help user to effectively comprehend the data.The text summarization provide a brief overview of input text and effectively enable user to focus on reduced version of the text.When text summarization is applied to news articles it gives a brief overview of the news with inherent subjective information.Usually the news articles provide elaborated discussion on the subject and hence it is appropriate to perform text summarization to obtain important discussions in news.The sentiment analysis and classification on summarized version of news articles is introduced as shown in Fig. 3.The preprocessing of news articles is performed in which words are tokenized and stopwords are removed.The news articles are subjected to summarization to generate overviews.The sentiment classification is performed on the summarized version of the news articles.Feature vectors of Ngram size are created using a bag of words [48].Next Logistic Regression, Random Forest and Adaboost classifiers are used for classification.Logistic Regression used as base classifier in Adaboost classifier.

IV. SUMMARIZATION AND SENTIMENT ANALYSIS EXAMPLE
The summarization and sentiment analysis is briefly explained with an example in this section.An input news article is shown in the Fig. 4, which is on Football from BBCSport In this article, the pronoun such as 'he', 'it', 'they' etc., have been used several times as place holder for proper nouns.Text summarization is performed on this text using pronoun replacement method as described in Section 2. Once the pronouns are replaced, the stopwords are eliminated from the text to identify important words or keywords.Next, weightage of each keywords is computing by using (1).Table II shows the computed weightage for the few keyword words from input article.The sentence weightage is computed using (2) based on the weightage of keywords present in that sentence.Also each sentence is assigned a priority number based on its sentence weightage.Lower priority number is assigned for the sentence with higher weightage.In Table III sentence (partially depicted), its weightage and priority number for few sentences are shown.User provides the summarization ratio, using which the summary is generated.The number sentences to be included in the summary can be found by equation (3) using.The summary is formed by extracting the sentences from the article in priority order.Summarized text for Fig. 4 is collected with ratio as 25%, 50% and 75% and results of summary are shown in Table IV.Further, sentiment analysis using VADER is performed for each summarized text.The VADER computes sentiment information such as negative, neutral, positive and compound score which are given in Table IV.

V. EXPERIMENTAL RESULTS
The experiments are conducted on news article collected from [49], which are BBC articles.The BBCSport dataset includes 737 documents about articles on five topical areas as Athletics, Cricket, Football, Rugby and Tennis from BBC sport web site between the years 2004 to 2005.It is interesting to perform the sentiment analysis on news articles.VADER sentiment analyzer is applied on the news articles on dataset BBCSport.Moreover the POS tagger of [45], [46] is utilized to determine various parts of sentences.Proper nouns and their occurrences in article are gathered.In Table V, top three nouns having maximum occurrence in the articles with their frequency are shown.The VADER sentimental analyzer gives various scores such as negative, neutral, positive and compound score which are given in columns 3, 4, 5 and 6 respectively in Table V.The count of negative, neutral and positive words are given in column 7, 8 and 9, respectively.

Sentiment Visualization:
A novel 3-D visualization of sentimental information obtained from VADER is presented in Fig. 5. Twenty news article on Football and Athletic from BBCSport dataset are considered.For each article, the number of occurrences of proper nouns is determined.In Fig. 5 The news articles are subjected to sentiment analyzer VADER, which provides various sentiment score also it gives count of negative, positive and neutral words in the articles.The sentiment of the article is positive for compund score greater than zero, neutral for compound score of zero otherwise it is negative.Fig. 7 provides 3-D visualization of count information obtained from VADER.In Fig. 7(a) the 3-D scatter plot is depicted for news articles of Football.More positive sentiments are observed in Fig. 7(a) than negative or neutral.The 3-D scatter plots for Cricket shown in Fig. 7(b), Athletic in Fig. 7(c) and Rugby in Fig. 7(d).
In Fig. 8, ten words with positive sentiment and in Fig. 9, ten words with negative sentiments are depicted.In each graph, the word with its sentiment score and its percentage contribution are shown.In Fig. 8(a Summarization and Classification: The sentiment classification is carried out on news articles.The BBCSport news article dataset consists of 737 article related to Football, Cricket, Athletic, Rugby, and Tennis.Later each article is subjected to summarization with ratio of 25%, 50% and 75% hence dataset consists of 2948 articles.The sentiment analysis is performed on each article using VADER.The Logistic Regression, Random Forest and AdaBoost classifiers are used for sentiment classification.Feature vectors of Ngram size are constructed from news articles by preparing bag of words as given in [48].From the BBCSport dataset of articles occurrences of words are collected and bag of words is prepared by taking 'N' most frequent words.Here 'N' is taken as 2000, 3000 and 4000.Table VI shows 10-fold cross validation results on the BBCSport dataset of articles without summarization.A maximum classification rate of 84.93% is observed for AdaBoost classifier with N as 3000.
Next, the news articles are subjected to summarization using method described in Section 2. The summarization ratio of 25%, 50%, 75% is applied on each article.Using the sentiment analyzer VADER, the sentiment type of each article is determined.The 10-fold cross validations are performed on three classifiers Logistic Regression, Random Forest and AdaBoost classifier with varying 'N' as 2000, 3000 and 4000.In Table VII, the 10-fold cross validation results are presented.It is observed that as summarization ratio increases better sentiment classification rates are obtained.When summarization ratio is 25%, a maximum classification rate of 78.73% for AdaBoost classifier with 'N' equal to 4000 is observed.For summarization ratio of 50%, maximum classification rate 83.06% with 'N' 4000 on AdaBoost classifier is obtained.A maximum classification rate of 83.23% for AdaBoost classifier with 'N' 3000 is observed for 75% text summarization.

VI. CONCLUSION
In the recent years we are witnessing significant amount of data being generated in numerous forms such as social media, web blogs, web sites, Wikipedia, news articles and many more.Due this the end user is overloaded with data and there is a greater need for effective methods to help user to absorb the data.Data extraction and representation methods are highly desirable to assist user to comprehend the huge data.One of the effective methods to obtain brief overview is using text summarization.Also sentiment analysis and classification being used to determine opinion expressed in the text.In this research, the text summarization and sentiment analysis on BBC news articles is combined.BBC news articles are collected from [49], which consists of 737 news articles on various sports topics.Extractive based text summarization method is developed in this research which involves pronoun replacement with proper noun and form text summary.The sentiment analysis of BBCSport news articles is carried by VADER.The VADER provides various evaluated information including positive, compound, negative and neutral score along with count of neutral, negative and positive words in the text.Novel three dimensional visualizations are provided to depict sentiment information obtained on BBCSport.Later, using the summarization ratio of 25%, 50% and 75% the text summarization is carried out on news articles.On the dataset of news articles, the feature vector is formed using bag of words of N-gram size.The sentiment classification is carried out on news articles at first without summarization and later on summarized text of 25%, 50% and 75% ratio.Three classifiers are employed to perform sentiment classification such as Logistic Regression, Random Forest and Adaboost classifier with varying N as 2000, 3000 and 4000.When classification is carried out without summarization highest classification rate of 84.93% observed.For 25%, 50% and 75% summarized text a maximum classification rate of 78.73%, 83.06% and 83.23% are respectively obtained.[22] Jos M. Perea-Ortega, Elena Lloret, L. Alfonso Urea-Lpez, Manuel Palo-
(a) negative sentiment score versus positive sentiment score for each article is represented.In this figure, along x-axis the proper noun with maximum frequency, along y-axis the negative score and along z-axis the positive sentiment score are shown.Fig. 5(b) shows maximum occurring proper noun and count of that noun along x-axis and y-axis against compound sentiment score along z-axis.Fig. 5(b) highlights the compound score on an article with respect to the noun having maximum occurrence and its count, hence showing the importance of the noun as a subject in that article.Fig. 5(c) provides 3D visualization of negative score versus positive score for Athletic articles.In Fig. 5(d) noun occurrences versus compound score is represented for 20 Athletic articles.The novel 3-D visualizations are developed to represent the compound sentimental score as shown in Fig. 6.In these figures, compound score versus count of positive and negative sentiment words are shown.Fig. 6(a) show the 3-D representation for compound sentiment of all Football articles from the BBCSport dataset.In Fig. 6(a) highest compound score of 2.927 having the number of negative words 6 and number of positive words 14 is observed.Fig. 6(b), 6(c) and 6(d) represent the compound scores for articles on Cricket, Athletic and Rugby respectively are shown.These 3-D visualizations signifize the changes in compound score that can occur when count of positive or negative sentiment words vary.
) the graph shows positive sentiment words for Football articles.In Fig. 8(b), 8(c) and 8(d) showing words with positive sentiment for Cricket, Athletic and Rugby news articles.Fig. 9 depicts top ten negative sentiment words for news articles.

Fig. 5 .
Fig. 5.The 3-D representation of sentimental information for 20 news articles.(a) Negative versus positive score for Football articles.(b) Noun occurrences versus compound score for Football articles.(c) Negative versus positive score for Athletics articles.(d) Noun occurrences versus compound score for Athletics articles.

Fig. 6 .
Fig. 6.The 3-D representation of compound score versus count of positive and negative sentiment words.(a) For article on Football (b) For article on Cricket (c) For article on Athletic (d) For article on Rugby.

Fig. 7 .
Fig. 7. Representing sentimental count information as 3-D visualization.(a) For article on Football (b) For article on Cricket (c) For article on Athletic (d) For article on Rugby.

TABLE I .
EXAMPLE FROM VADER LEXICON

TABLE II .
KEYWORDS AND WEIGHTAGE

TABLE IV .
NEWS SUMMARIZATION AND SENTIMENT ANALYSIS

TABLE V .
NEWS ARTICLE WITH NOUN FREQUENCY IN THE ARTICLE ALONG WITH SENTIMENT SCORES

TABLE VI .
PERFORMANCE OF SENTIMENT CLASSIFICATION