A Hybrid Ensemble Word Embedding based Classification Model for Multi-document Summarization Process on Large Multi-domain Document Sets

Contextual text feature extraction and classification play a vital role in the multi-document summarization process. Natural language processing (NLP) is one of the essential text mining tools which is used to preprocess and analyze the large document sets. Most of the conventional single document feature extraction measures are independent of contextual relationships among the different contextual feature sets for the document categorization process. Also, these conventional word embedding models such as TF-ID, ITF-ID and Glove are difficult to integrate into the multi-domain feature extraction and classification process due to a high misclassification rate and large candidate sets. To address these concerns, an advanced multi-document summarization framework was developed and tested on number of large training datasets. In this work, a hybrid multi-domain glove word embedding model, multi-document clustering and classification model were implemented to improve the multi-document summarization process for multi-domain document sets. Experimental results prove that the proposed multi-document summarization approach has improved efficiency in terms of accuracy, precision, recall, F-score and run time (ms) than the existing models. Keywords—Word embedding models; text classification; multidocument summarization; contextual feature similarity; natural language processing


I. INTRODUCTION
Machine learning (ML) has become a key approach to problem solving and data predictions. Machine learning allows a classifier to learn a set of rules, or the criterion of decision, from a set of labelled data that an expert has annotated. This approach enables better scaling and reduced time when classifying topic domain data as compared to a system that relies only on manual input. Most of the research was done on binary classifiers in the field of machine learning based on the classification of multi-domain document data. In many fields, the purpose of using machine learning for pattern mining has an important role in decision-making systems. A set of input documents is split into two or more classes in the text classification (TC) process [1], with each document belonging to one or more classes depending on its contents. Document clustering [2] is the method of categorizing text documents into a hierarchical cluster or category, so that the documents are identical in the same cluster, whereas the documents in the other clusters are different. It is one of the vitals of text mining processes. In particular, text mining has gained significant significance and involves various tasks, such as the development of granular taxonomies, document summarization, etc., to produce knowledge of higher quality from text. The supervised strategy is utilized to solve the problem if we have a predetermined class or classes. A prediction-based model is a decision tree. It is distinguished by a tree-like system of rules and is mostly used to solve classification problems. The decision tree is built using data from training. With this strategy, a tree is built to represent the categorization problem. The majority of previous works [3] used single-document summarization. Approaches based on sentence extraction from documents are used in singledocument summarization. Most single-document summarization systems employ a simple method for summary generation, which consists of extracting the first sentence from each paragraph and placing them in the same order as they were written. Later on, the presence of multiple sources delivering the same information causes problems for news providers' end users, who must read the same material repeatedly. As a result, recent work [4] has centred on multidocument summarization. To combine information held in distinct documents for multi-document summarization, valuable procedures are necessary. This usually means that some operations, such as key matches, matching terms, sentence position, and sentence length, must be performed below the sentence level. As a summary [5], multi-document summarization may successfully handle the concerns by generating shorter summaries including the important points of the original documents using criteria for decreasing redundancy and maximising variety in the selected articles. Before reordering the phrases into the document's original sequence, most extractive summary optimisation algorithms score them based on their value. Without access to the real summary analysis mechanism, it is not always possible to build partial rank lists of sentences using only the original document and the summary. The two major types of text summarization are abstraction and extraction [6]. The sentence with the highest score among the other sentences is chosen www.ijacsa.thesai.org during the document extraction process. Whereas, abstraction entails employing linguistic techniques to create something new, which may or may not be present in the source, and substituting it for the summary without altering its original meaning. The entire collection is searched for important objects in the extractive summarization task, with no changes to the objects themselves. Conciseness, accuracy, and objectivity are three qualities of a good summarizer. The goal of this paper's proposed methodology [7] is to create an extractive text summarizer that can generate variable-length summaries. According to [8], the summary frequently includes sentences that are not closely related to one another. This can be handled by generating the sentence set with a sufficient threshold. As a result, one of our issues is deciding on a sufficient threshold. The order of the sentences in the summary is the next problem. Another challenge with news summarization systems is how to handle huge feature sets, as the complexity of weight adjustment increases exponentially as the number of features increases. As a result, higherperformance systems with more useful features are required. Among the three types of summarization systems, extractive summarization is perhaps the most investigated. Although the phrase is most commonly used to refer to sentence extraction and reordering, numerous extractive approaches also focus on sub-sentence extraction. An extractive system can be topicbased, centrality-based, or a combination of the two. The relevance of particular words or phrases is prioritized in topicbased systems.
Although specialized machine learning techniques such as neural networks (NN) and support vector machines (SVM) are used in many fields to classify data into one or more classes, traditional models must be improved on large datasets with high dimensionality. Some demonstrations of supervised learning include Linear Regression, Logistic Regressions, Decision Trees, and SVM. These are some demonstrations of supervised learning. Classification [9] can be defined as the procedure of classifying objects of interest into different previously defined categories or classes.
Recently, extractive single document summaries have been generated using machine learning methods. Nave-bayes, Hidden Markov Model (HMM), and log-linear models are some of the methodologies that fall within machine learning approaches. Automatic text summarization using artificial intelligence and neural networks has been the subject of a few studies. Given a set of features, the Hidden Markov model [HMM] estimates [10] the posterior probability that each phrase is a summary sentence or not. This model has fewer assumptions of independence than the naive Bayesian approach. The number of terms in the sentence, as well as the likelihood of the terms given the baseline of terms (Baseline term probability) [11] and the document terms (Document term probability).
Wrapper techniques use a black box for a single learner to evaluate the function subsets on the basis of their predictive effectiveness. The embedded techniques select the features in the integrated phase and are generally particular to one individual instance. PSO and neural action provide a possible optimization solution [12]. Each particle accelerates during each iteration towards the best global location discovered by the representative points. Scalability is inefficient at identifying the globally optimal solution. Dynamic goals and connectivity are taken as tasks rather than restrictions. The Multi-Objective Data Relations (MODP) approach is used to resolve all existing problems in order to improve anomaly relations. Further work can be undertaken in the future in order to significantly reduce the normalized root-mean-square error. Recently, ensemble learning models have become popular and widely accepted for high-dimensional and imbalanced datasets. Most of the traditional ensemble classification models are processed with limited feature space and small data size. As the size of the feature space increases, traditional ensemble classifiers select a predefined number of features for classification. The main objective of the ensemble learning models based on feature selection is to classify highdimensional features on high-dimensional datasets with high computational efficiency and a high true positive rate [13]. Severe problems such as performance and scalability may result from learning classification models with all their highdimensional features. Many textual content classifiers [14] have been proposed in the literature, including those that use machine learning techniques, probabilistic models, and so on. Decision trees, nave-bayes, rule induction, neural networks, nearest [15][16][17], and, most recently, guide vector machines are some of the techniques used.
The main contributions in this paper are: 1) Proposed a hybrid multi-domain glove optimization model on the large document sets.
2) Proposed a multi-document clustering method for the document summarization process.
3) Implemented a hybrid multi-document Bayesian approach based document summarization process on large document sets.
The main sections presented in this paper are: Section II describes the overall literature work of the word embedding models and multi-document summarization. In the section III, a hybrid word embedding measures are proposed in order to classify the multi-domain features for the multidocument summarization process. Also, a hybrid multidocument cluster based classification model is proposed in the section 3. In the section IV, experimental results and its discussion are discussed. Finally, in the section V, conclusion of the work is presented.

II. RELATED WORK
Wu et.al, proposed key extraction by combining multidimensional information, and they named their proposed system MIKE. They used two datasets from the ACM world wide web to form the ACM knowledge discovery and data mining. They compared their results to the TF-IDF and TextRank algorithms to assess their performance [18]. LAKE is a key phrase based summarizer system that extracts relevant key phrases from documents using statistical analysis. In terms of text summarization methods, neural networks outperform other traditional methods in terms of extractive methods for handling semantics and redundancy, but fall short in terms of coherence when compared to abstractive methods. There are various approaches to abstractive summarization, www.ijacsa.thesai.org including linguistic-based approaches, semantic graph-based approaches, and hybrid extractive/abstractive approaches [19]. Syntactic representations and tree structures are used in linguistic-based approaches, but semantic meanings are not abstracted. As discussed in a previous study, semantic graphbased approaches focus on semantic role labelling to determine the abstraction of input to core meaning to form graphs to filter out redundancy, followed by a text generator to build summaries. Extractive methods are used in hybrid approaches to obtain an output summary that is fed into a text generator to build non-key words and phrases to improve sentence coherence and readability.
SUMMARIST [20] is a key phrase summarizer used to find the boundaries of extraction using a rank-based abstraction approach. The FEMsum summarizer is used to create summaries using a graph-representation and to identify the relationships between the candidate sentences, as well as a syntactic and semantic representation of the phrases. The data structure required for recognizingtopics in document sets and creating various forms of summaries is built by using a fuzzy co-reference cluster graph technique [21]. The intra-and interdocument co-reference chain families generated by a coreference method under various (fuzzy) clustering criteria are given as input to this algorithm. In other words, each cluster assigns a topic to each document: some themes appear in all documents (common topic), while others appear in only a subset or a single document (contrastive/distinctive topic). In [22], a set of distance functions for assessing structural similarity between online documents is analyzed. They analysed different Tag Frequency Distribution Analysis (TFDA), parametric functions, and edit distance between documents as three distinct ways of defining similarity. [23] proposed a label discovery technique that uses a hierarchical structure to express the relationship between text data in online documents collected from the web. Their programme correctly classifies web pages by discovering similar labels that describe the same type of content. [24] utilised a model that combined documents from various taxonomies. For the classification challenge, their model used the Nave Bayes algorithm. Content-based classifiers are clearly used by some research tools, such as NewsDude, to select valuable articles and to remove articles that appear to be excessively repetitious of previously read articles. [25] proposed employing a support vector machine (SVM) classifier to identify web pages based on both text and context features. They tested their online classification methods using the WebKB dataset, and the results demonstrate that using context features, particularly hyperlinks, can greatly enhance classification performance.
Conventional statistical methods have been included in many models. The main drawback of conventional statistical methods is the rigidity of dynamic situations and therefore the difficulty of optimal modelling. Most of the traditional ensemble classification models are processed with limited feature space and small data size. As the size of the feature space increases, traditional ensemble classifiers select a predefined number of features for classification. [26] proposed a novel discretization approach to continuous attributes for decision tree learning. The main issue with traditional decision tree models is that each attribute is assumed to be either nominal or categorical. To overcome this issue, a dynamic discretization model on the continuous label is applied to each attribute during the tree construction process. Traditional decision tree models such as CART and C4.5 use discretization methods in the preprocessing phase along with noise removal methods. But, the main limitation of this model is that the data should be of a continuous type and it doesn't support mixed types.
Feature selection is a process that selects an optimal feature sub-set based on a particular requirement. The measuring feature subsets are specified in the criterion. The criterion will be selected according to the purposes for which the feature is selected. For example, an optimal subset can be a minimum subset. It can provide the best predictive accuracy estimate in a sub-set. In some circumstances [27], a subset with the specified number that meets the criterion can be found in view of the number of features. Rough Sets Attribute Reduction (RSAR) is a filter-based tool for feature reduction used to extract data and maintain information while reducing the amount of knowledge involved. Analysis of Rough Sets is performed on the basis only of the data provided, and no external parameters are required to operate [28]. This makes use of the data granularity structure. It does, however, continue to assume the model that there is some information available with every item in the discourse universe that truly and accurately reflects the real world. The ideal criterion for the selection of Rough Sets is to find the shortest or minimum reductions while obtaining high grades for the selected features. The redundancy of a feature or feature subset is determined. A feature is declared relevant if the decision feature is predictive, otherwise it is irrelevant. A Principal Component Analysis (PCA) approach to a reduction in dimension is achieved by building main components that are linear combinations of the original predictor or the explaining variables. The PCA approach is based on the supposition that large variance in characteristics provides useful information, and, in contrast, small variance is considered less useful. Ortholy-linear combinations have been designed to maximize features in the linear combination of explicative variables. There are two basic stages of Fuzzy ELM (F-ELM), [29] called preparation and prediction. P. Verma and H. Om [30] proposed the Correlation-based Feature Selection (CFS) method. The correlation between the attribute and the class is calculated by this approach, with the hypothesis that an optimal collection of features should be strongly correlated with the class but not correlated with other features. This is to ensure that redundancies and feature numbers [31][32][33][34] (explaining the pattern with as few features as possible but still maintaining high performance) are reduced. Artificial intelligence is a notion that today has a lot of excitement around it. They trained the decision tree using a rotated feature space. Hence, they proposed the rotation forest algorithm. In this method, [35][36][37]samples from the main datasets are obtained. These samples form a new subset which is fed into a new feature space.

III. PROPOSED MODEL
Initially, document sets are taken as input for text preprocessing. In the preprocessing phase, each document is preprocessing using the Stanford NLP library. This library is www.ijacsa.thesai.org used to perform various operations such as document tokenization, stemming and stop word removal on different domain fields. After performing the data pre-processing operations on the large document sets, word embedding model is used to optimize the document to word vectors. In this work, a hybrid multi-domain word embedding model is proposed in order to optimize the word embedding key words on large document-sets.
Proposed multi-domain glove optimization model is designed to find the main and its contextual key features on large document sets. Multi-document contextual features are extracted using the main words of the glove model. A boosting contextual similarity is computed based of the main words, contextual words, string hash similarity and multidocument contextual similarity features to filter essential top k voted features in the document sets. In the next step, a multidocument clustering approach is developed on the filtered top k-contextual voted features for the multi-document summarization process. In the multi-document clustering process, an efficient KNN distance measure is used to compute the nearest clusters by using the structural similarity between the main and contextual scores. Each document and its key features are labelled with the cluster class for the multidocument summarization process. In the proposed multidocument summarization, a hybrid Bayesian probability based classification approach is developed to find the multidocument summarization process as shown in Fig.1.

A. Multi-Document Glove Optimization Word Embedding Model
In the multi-document glove optimization model, each preprocessed document is given as input to compute word cooccurrence matric. Let X ij represents the word occurrence matrix in order to compute main word and contextual word on large document set. W i and W j represent main and contextual word vectors of W c.

∑ ∑ √|| || || ||
3) The Proposed Multi-document word embedding model is optimized by using the partial derivative w.r.t main words and contextual words as shown below.
In the above optimized multi-domain glove optimization model, the cost function and its constraints are improved in order to find the essential key contextual features among the multiple domain document sets. Here, the multiweight factor is used to find the weighted document features among the main and contextual feature vectors. Finally, the multi-document cost function is based on multiweights, main and contextual feature vectors on large contextual co-occurrence matrix.

B. Boosting Voting based Word Embedding Contextual Similarity
In this phase, a voted boosting method is used to compute the best similarity measure based on the multi-document glove main and contextual key vectors. In this phase, hash based similarity, string similarity and proposed multi-document main and contextual similarity measure are used to choose the majority voted similarity on the glove main and contextual feature vectors. The proposed main and contextual similarity measure is computed by using the following formula.

C. Multi-Document Clustering based on KNN Approach
In this phase, a hybrid multi-document clustering based KNN approach is developed on the main and contextual key similarity features. This approach is used to group the multidocuments based on the domain main and contextual similarity vectors. Let k defines the user defined number of knearest objects for grouping. 1) Read input data MD t 2) Initialize k clusters for KNN and perform traditional kmeans document clustering algorithm.
3) In the proposed document clustering approach, instead of using the conventional distance measures, a hybrid weighted distance measure is proposed between the main and contextual word vectors. www.ijacsa.thesai.org

4)
The weighted multi-document pair distance between the main and contextual word vectors is given as ). ).
The kscore is used to find the document classification score in each domain filed for the class label prediction on the new test data. The kscore measure is computed using the following formula.

D. Multi-Document Conditional Bayesian Estimation based Classification
In the multi-document summarization phase, the clustered training data which is generated in the previous section are taken as input to the multi-document base multi-domain classification process. Proposed Bayesian classification model is used to classify the key phrases for the multi-document summarization process. In this phase, two optimizations are performed on the traditional Bayesian text classification model. In the first optimization, a hybrid prior multi-document probability is developed to predict the multi-domain phase on the large textual document sets. In the second optimization, a hybrid posterior probability is proposed on the main and contextual word vectors in each class category. The main steps used in the proposed multi-document summarization are 1) Read contextual and main words clustered labelled document sets as input.
2) Compute prior multi-document classification probability as:

| |
3) Predict the posterior multi-document estimation using the maximization of the class labels as: To each document in the training documents sets,Merge the phrase with high posterior probability and contextual-main word similarity scores for summarization process.

IV. EXPERIMENTAL RESULTS
The performance has been evaluated using the multi- In the experimental study, word embedding features, classification metrics and multi-document summarization rouge metrics are used to evaluate the performance of the proposed model to the conventional models. Table 1, illustrates the performance evaluation of the proposed multi-domain glove optimization model to the conventional approaches for contextual similarity computation on DUC 2002 dataset. As represented in the table, the proposed approach has improved evaluation than the previous models in terms of contextual similarity between the main and contextual word vectors.   Table 2, illustrates the performance evaluation of the proposed multi-domain glove optimization model to the conventional approaches for contextual similarity computation on multi-news dataset. As represented in the table, the proposed approach has improved evaluation than the previous models in terms of contextual similarity between the main and contextual word vectors. Table 3, illustrates the performance evaluation of the proposed multi-domain glove optimization model to the conventional approaches for contextual similarity computation on multi-biomedical dataset. As represented in the table, the proposed approach has improved evaluation than the previous models in terms of contextual similarity between the main and contextual word vectors.  TestCat2-1  TestCat2-2  TestCat2-3  TestCat2-4  TestCat2-5  TestCat2-6  TestCat2-7  TestCat2-8  TestCat2-9  TestCat2-10  TestCat2-11  TestCat2-12  TestCat2-13  TestCat2-14  TestCat2-15  TestCat2-16  TestCat2-17  TestCat2-18  TestCat2-19  TestCat2-     From the figure, it is observed that the proposed approach has improved evaluation than the previous models in terms of contextual keywords filtering among the large number of candidate key features space. Table 4, illustrates the performance evaluation of the proposed multi-domain glove optimization model to the conventional approaches for filtering optimal key features count for document clustering process on multi-news dataset. As represented in the table, the proposed approach has improved evaluation than the previous models in terms of contextual keywords filtering among the large number of candidate key features space. Table 5, illustrates the performance evaluation of the proposed multi-domain glove optimization model to the conventional approaches for filtering optimal key features count for document clustering process on biomedical document sets. As represented in the table, the proposed approach has improved evaluation than the previous models in terms of contextual keywords filtering among the large number of candidate key features space.     For experimental evaluation, we use ROUGE (Recall-Oriented Understudy for Gisting Evaluation) in order to find the performance of the proposed multi-doc summarization process on various traditional models.      Table 9, represents the performance evaluation of the proposed multi-document based Bayesian summarization model to the conventional models for average rouge metrics on DUC 2004 domain database. From the table, it is noted that the proposed multi-document based Bayesian summarization approach has improved average rouge metrics than the previous approaches on different DUC 2004 document sets.

A. Results Interpretation
In this work, different multi-document features and its correlated main and contextual words are used to analyze the multiple documents for summarization. From the experimental results it is noted that the average accuracy, recall and precision of the proposed multi-document summarization is better than the conventional models with nearly 1% improvement. Also, the contextual features of the proposed glove model has better optimization for the word to vector generation process.

V. CONCLUSION
Multi-document summarization plays a vital role in the multi-domain document sets due to variation in the feature space and inter and intra document cluster variations. Since, most of the conventional multi-document summarization models have large number of candidate feature sets for document clustering and classification process. In this work, a hybrid multi-document based glove optimization model is proposed in order to filter the key features on multi-domain document sets. Also, a hybrid document clustering and multidocument Bayesian classification model for multi-domain document summarization process is proposed on large document sets. Experimental evaluation represent the performance of the proposed Bayesian multi-document summarization approach has improved rouge evaluation metrics than the previous models with nearly 2-3% improvement on large multi-domain document sets. In the future scope, this work can be extended to improve the multilevel based dynamic multi-domain feature extraction and summarization process using the parallel processing framework.