A Hybrid Document Features Extraction with Clustering based Classification Framework on Large Document Sets

As the size of the document collections are increasing day-by-day, finding an essential document clusters for classification problem is one of the major problem due to high inter and intra document variations. Also, most of the conventional classification models such as SVM, neural network and Bayesian models have high true negative rate and error rate for document classification process. In order to improve the computational efficacy of the traditional document classification models, a hybrid feature extraction-based document cluster approach and classification approaches are developed on the large document sets. In the proposed work, a hybrid glove feature selection model is proposed to improve the contextual similarity of the keywords in the large document corpus. In this work, a hybrid document clustering similarity index is optimized to find the essential key document clusters based on the contextual keywords. Finally, a hybrid document classification model is used to classify the clustered documents on large corpus. Experimental results are conducted on different datasets, it is noted that the proposed document clustering-based classification model has high true positive rate, accuracy and low error rate than the conventional models. Keywords—Classification; document feature extraction; document similarity


I. INTRODUCTION
Machine learning algorithm currently finds wide-spread use in the principles of data mining, where text classification is also a part of it. Work is currently being conducted on the use of machine learning methods to improve efficiency and raising the complexity of computations. Literature in [1] is reviewed about the approaches to machine learning in text classification. The benefit of the suggested solution was that it factored both local and global characteristics, and had the potential to be noise resistant. The suggested solution was shown to work better than traditional SVM methodology, using comparative studies on different datasets. Document representation in vector model is also an essential part in the document clustering algorithms. Several text representations models such as a n-gram models, bag-of-words and feature filtering, etc., have been widely used for large collection of documents. Generally distributed documents share some common context for clustering and categorization [2]. These contexts are represented using documents key terms. Textual data are represented using words and phrases as features in a high dimensional vector space. Document clustering is the collection of a large number of documents into a set of useful cluster sets where each cluster represents a specific topic or context. The documents within the group should have a high degree of similarity while the degree of similarity among distinct document clusters should be reduced. Traditional clustering methods used to cluster the documents without paying much attention to the contextual information of the document set. For example, if two or more documents are representing same topic using different terminology which are semantically same, the documents are bagged under different clusters. This kind of clustering may lead to inefficient information retrieval. So, document clustering has become an increasingly important for improving the documents sharing and communication in distributed environment. Document clustering has many applications in the area of information retrieval and data mining [3]. For this purpose, different document clustering techniques emerged to better perform the clustering so as to overcome the limitations that are there in the traditional ones.
Clustering methods can be categorized into two main categories. They are Generative approaches (model based) and Discriminative approaches (similarity based). Model based techniques are used to develop extended models from the peer document sets, with each model randomly assigning one particular document cluster. In the similarity based approach [4], function involving pairwise document similarities can be optimized and aiming to optimize the average cluster similarities within the peer overlay clusters.
Moreover, the different levels of analysis are not disjunct. For instance, semantics plays an important role in the syntactic analysis. NLP is a subfield of artificial intelligence and linguistics. In IR, NLP is often used as a pre-processing step. When a system wants to find the most important information in text and then wants to retrieve the information found, it first has to define the most important parts. It is important to note that text mining, IR and NLP are different fields. Sophisticated NLP techniques are frequently used in IR to represent the content of text in an exact way (e.g. noun and verb phrases being the most important ones), extracting the main points of interest, depending on the domain of the IR service. However, NLP is not only used in parsing the documents, but also for handling the user queries. The important information has to be parsed from the user queries in a similar way [5]. NLP techniques are used in almost every aspect of the text mining process, namely in Named Entity Recognition.
The remainder of the paper is described as follows. Section 2 presents detailed information about related work and advances in the field. The proposed hybrid clustering based classification is presented in Section 3. Experimental results are elaborated in Section 4. At the end, conclusion of the paper is detailed in Section 5.
II. RELATED WORKS Two model architectures are available for representing the text in its vector form. The first predicts the current word from the related terms in the immediate surrounding area while ignoring the order of terms and the second predicts with the aid of the current word the surrounding background words. CBOW is quicker to train when skipping -gram is slower, but better in terms of word weight for Word2Vec. Vectors in Google's Word2Vec are trained from Google News documents on 100 billion words, and are open to all publicly. Such vectors have 300 dimensions and are trained using a continuous bag -ofword model, meaning terms. WordNet is a broad database of English-speaking lexical terms [6]. Here the terms connected with each other are grouped into a collection known as Synset. Lexical categories such as nouns, verbs, adjectives etc. form different synsets and are related through conceptual-semantic relationships and lexical relationships. It is a kind of dictionary and thesaurus which can provide meanings to the terms and comparisons with other terms. Words are thus like nodes, and the connections reflect the relations between them. Word types found in various synsets are of different meanings. WordNet's new online edition is 3.1. Similar to WordNet, ConceptNet is also a broad semantic network composed of the principles relevant to our everyday lives. The ideas apply to the principles of commonsense, and this information is derived from the experiences of average people over the Internet. It is the largest shared information base accessible to the public, consisting of over 2,50,000 relationships. These approaches are generally implemented using domain independent approach in order to result better optimization solutions. Different evolutionary approaches such as genetic algorithms, Rough-set, SVM, etc. are used to classify the document sets from large corpus. Genetic algorithms are implemented to find the complex patterns and classification rules on huge datasets [7]. In some of the hybrid approaches, genetic algorithms are integrated with decision tree schemes to generate an optimized decision tree. Classification models such as Naïve bayes with ensemble decision tree models namely CART, C4.5, Bayesian tree and random forest are used to classify document and feature extraction. They concluded that no single traditional model existed to handle uncertainty for document prediction with large number of attributes set. Hadoop is a software framework used for efficient scalable and parallel programming applications in java and is responsible for processing huge amount of data. It operates on distributed environment with specific clusters, provides results with fault tolerant. It can integrate multiple cluster node's computation and storage data in an efficient manner. Traditional document clustering algorithms are compared in distributed biomedical repositories for efficient document feature extraction [8]. An advanced three-layer biomedical framework has been implemented to cluster the set of documents [9]. This framework is based on a multi-layer neural structure of neighborhood peers. Many overlay peers which act as the representative object of its lower neighborhoods are clustered to form higher level clusters. The basic limitation of this model is selecting an optimal threshold for a dynamic size overlay network. Also, it is very hard to balance the structure size and peer documents.
A model using a parallel approach is implemented to cluster the multiple document collections [10]. The key issue is to find automatic document clusters in large text corpus and it is very high cost to compare documents in a high dimensional vector space. This algorithm tries to minimize the distance computations and cluster size in the training dataset documents, called pivots. They used parallel algorithm in an efficient way to optimize a complex data structure which affords efficient indexing, searching and sorting. Traditional probability estimation techniques such as Naïve bayes, markov model, Bayesian model [11] are used to find the highest probability estimation variance among the gene and its related disease sets in biomedical document sets. Classification is the process of finding and extracting the main contextual meaning of the gene or disease patterns from the distributed document sources and it has become an integral part of day to day activities in all domains like cloud, forums, social networking and medical repositories. Automatic text Classification fulfills certain goals by implementing Classification techniques at the user end to find relevant summaries of the large document sets. Document summaries represent sentences or phrases extracted from different sources without any subjective human intervention or editorial touch and thus making the end product completely unbiased. Classification is a highly interdisciplinary field involving areas like information extraction, text mining, and information retrieval, natural language processing and medical databases. Currently, many scholars at home and abroad have studied the technology of text classification using key methods like conventional machine learning and the deep learning that is currently common. They define the "Clustering of full-subtopic retrieval with keyyphrase-based search results," in that Consider the problem of multiple documents related to the individual subtopics of a Web query, called "complete child retrieval". They present a new algorithm for grouping search results to solve this problem which generates clusters labeled with key phrases [12]. The key phrases are extracted from the search results generic suffix tree and combine into a grouping enhanced by a hierarchical agglomeration process. They also presented a new method to assess the success of complete recovery subthemes, namely "look for secondary duration arguments under adequate documentation". They used a test set explicitly designed to assess the recovery of the subthemes, they found that our algorithm passed all other clustering algorithms of existing research results as a method of redirecting search results underlines the diversity of the results (at least for k>1), that is to say when they are interested in recovering more than one related sub-theme document).
Kostkina et al. [13], they suggested a new approach which expanded the features of short text based Wikipedia and Word2vec. The first phase was the creation of Wikipedia's semantine related definition sets. The semantic relationship 365 | P a g e www.ijacsa.thesai.org between the goal and related concepts was measured; the authors received articles which were highly applicable to the Wikipedia concepts. The author then expanded the applicable notion sets to short texts, and it was noted that this methodology could achieve greater semantine relatedness compared to traditional similarity calculation principles using statistical approach. Experimentally it was shown that the precision of classification could be invented by extending the features of short texts.
Mishra et al. [14], a new system of the Word Embedding Function Extension for Short Text (WEFEST) that extended short texts using word embedding for classification is presented. The proposed WEFEST was embedded in a deeplanguage model in which word corrections were used to learn a new embedding space. Thanks to the phase the new function vectors space is picked. The use of pre-trained word function embedded in each short text in the training dataset has been enhanced, the authors made use of the nearest neighboring algorithm to achieve short text classification, and the effectiveness of the suggested technique has been validated by the empirical results on Chinese news websites containing title datasets for text classification. They applied the various function extraction methods, feature representation methodology, and text classification approaches. The proposed work was focused on forensic autopsy knowledge to find suitable methods for extraction of features, meaning of features and categorization of texts. From the empirical findings it has been discovered that the unigram features outperformed bigram, trigram, and unigram, bigram, and trigram variants. Compared with normalized TF-IDF structures, the TF and TF-IDF value representation approach works efficiently. LDA was used to extract the thematic details. The authors could add features that were relevant to the subject to the document defined by feature set to enhance the classification of the text. The authors [15] explored various forms of terms frequency and topic-related data, and these were considered traits for supporting vector machine. The experimental results on three companies showed that the accuracy of text classification could be improved by combined features. Unlike the supervised selection technique, which includes category information in the training data, Park et al. [16] proposed an unsupervised feature selection technique in which no information based on categories was needed. This helped the framework to include more framework scenarios, since labeled data was both expensive and not very reliable. Like the other unsupervised methods, this technique made use of embedding terms to identify terms that had virtually the same semantic meaning. The word embedding maps the terms into vectors, preserving the semantine relationships between terms. Many of the words were not used as features to prevent redundancy; the writers chose the most suitable word with similar semantic meaning. Sinoara et al. [17] proposed feature selection technique that was based on Kullback-Leibler (KL) divergence. The purpose of this technique is to evaluate the current association between each class and subclass through KL divergence. The Mutual information method was used for calculating the correlation between each feature and subclass; Term frequency probability was used for measuring the importance of subclass characteristics, so that, for parent class node, a superior discrimination set of features could be selected. The authors used hierarchical feature selection techniques and SVM classifiers on two organizations for purposes of hierarchical text classification tasks. Experiments showed that the proposed algorithm was successful compared to the cchi square statistics (CHI), information gain (IG), and shared knowledge (MI) directly used to pick hierarchical features.
Jiang et al. [18] proposed a novel text classification algorithm, based on the Ant Colony Optimization (ACO). It abused the discreteness of the features of the text document and the value the ACO provides in addressing discrete issues. The behavior of the ant population having the class information was used for classifying the text in order to find a suitable route matching during the process of iterating the algorithm. A score of connectedness between two concepts was high if there were several paths between them (which consisted of direct / indirect hyperlinks). Now TD and WD were concatenated vertically to form the TD&WD matrix, which was used for classification purposes. Of the grouping, they used majority voting methodology. At Reuters (0.9331), 20 Newsgroup (0.7563), and RCV1 (0.5198), their scheme registered appropriate classification accuracies.
Song et al. [19] used NPE for selection of features and applied the PSO classifier for classification of documents. NPE is a better feature-selection scheme than Latent Semantic Indexing, they stated. LSI has proven to be a powerful tool for various information retrieval tasks but it may not be a successful discriminating feature selector for classifying documents into different categories. Feature extraction is a core concept in the text classification process. To build lexical chains, Ravindran tap semantine tools such as synonyms and identity [20]. Based on lexical chains, a two-pass algorithm generates feature vectors, first generating all possible lexical chains and then selecting the longest chains. Through removing unimportant strings, they achieve a reduction of 30 per cent in function vector dimensions and an increase of 74 per cent in execution time. Apart from English, it is also used in the classification of sentiments in the Chinese language text. Likewise, it was used for recurrent neural networks and the findings outperform the standard techniques in both cases. A further attempt was made to identify document using word2vec in combination with the LDA method [21], which also provided better results. In addition to text classification, word2vec has been used in many other areas of application such as improving medical knowledge through unsupervised medical corporate learning [22], answer selected from possible collection, good, poor in a question -response method [23], etc. Word2Vec is an unsupervised model of writing, writing the semantic context associated with the text. They developed a framework by using Information Retrieval (IR) strategies to extract information in biomedical domain [24]. According to the relevance degree, their framework can rank the documents. Their framework can extract relevant documents as well as can diversify the information. The authors presented two labelling methods and merged some IR models. They validated their theory by experimenting on TREC Genomics datasets and result enhanced performance.
Ma et al. [25] presented an algorithm for biomedical documents classification by using Medical Subject Headings (MeSH) and MEDLINE indexing. The author considered 50 366 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 7, 2020 articles from MEDLINE and classified these documents by the above said algorithm which uses Natural Language Processing scheme. After calculating precision and recall manually for individual documents, he calculated average precision and recall. The author also identified three major flaws for this approach. Those are: 1) Precision and recall are decreased significantly because of the exact matching.
2) The given algorithm is unable to classify an abstract of biomedical documents. 3) Because blogs terminologies are not MeSH headings, the algorithm can't be applied to the articles of blogs.
To overcome these flaws, future research and work is necessary for this method. Park et al. [26] proposed an algorithm that results the top search results for IR queries. For their approach Boolean interface is used without ranking functions. Through conjunction of queries ranked documents are resulted using relevance metric. By the efficient use of probabilistic modelling, the researchers formed their algorithm. The above algorithm sets off a minimum cut-off for the documents to be categorized under high ranking. They argued that, their technique supports monotonic ranking of various keywords and the respective interface uses Boolean expression of keywords. The authors validated their methodology by experimenting on PubMed database and TREC dataset and got enhanced results.
Rashid et al. [27], introduced two new deep learning approaches. They calculated the embedded words and analyzed that with other modelling approaches. As compared to other deep learning approaches of biomedical documents mining, the above said algorithm results better performance. They categorized their research into three sub-categories: 1) Various domain-specific representation are analyzed and a new word embedding approach is proposed. 2) DBN-base DDIE model and RNN-based NER model are introduced through which the process of word embedding is done, and it is compared with skip-gram, CBOW, GolVe, etc. 3) This technique shows significant results in word embedding with better recall. Extraction of keywords from text data is an important technique used by search engines and indexing services to quickly categorize and locate relevant data based on keywords explicitly or implicitly provided. In this section, the literature review involves the various methods used to locate and identify keywords in the individual papers, social networking sites, lecture audio archives, speech transcripts, website database etc. It is important to note that most of the algorithms to be considered in the analysis used an external corpus of documents to check and assess the algorithms' performance. Similarly, these algorithms relied on a weighted function that combined some measure of the presence of a word or phrase within a text with a similar measure from the body. The most common measurements used were word frequency, word distance, document position of terms, co-occurrence with other terms, word-to-word relationship (lexical chains), key phrases, etc. They suggested a system for extracting keywords that would work on individual documents. They followed a textoriented approach in which, irrespective of the current state of a corpus, the same keywords are extracted from a text The DIKpE algorithm was evaluated on a publicly accessible keyyphrase extraction dataset containing 215 full-length documents from various computer science subjects for its effectiveness and performance. DIKpE was evaluated by measuring the number of matches automatically extracted between the key phrases attached to the text and the keyphrases. DIKpE was found to have clearly outperformed the other two algorithms in extracting the keyphrases, although no training activity was undertaken. They discussed many techniques for automated (unsupervised) keyword extraction for voice transcripts. He found the multiparty meeting domain in particular, and explored the suitability of certain algorithms that were successfully used in meeting transcripts for automated keyword extraction of written text. To test these keyword extraction algorithms, they used transcripts from the ICSI meeting corpus. They also integrated POS filtering, word clustering, sentence salience score into the TF-IDF system and evaluated the outcomes. The accuracy and efficiency of the thematic classification were determined. Two unsupervised discriminative terms were used to automatically classify transcriptions which were extremely incomplete. Term Frequency -Inverse Document Frequency (TF-IDF) using the Gini Purity criteria approach was used to identify the transcription themes. They discovered that the Wikipedia page redirects to automatically gain language-independent variations in morphological character. Four languages have been used for research, namely 3.83,000 Arabic documents, 50 million English documents, 50,000 Hungarian documents and 2.11,000 Portuguese documents. For performance measurement, standardized discounted cumulative benefit and mean average precision were used. The authors in [29] performed Arabic stemming responsive material and substantial progress in English retrieval, outperforming words and stems. Classified news articles using the KNN approach to machine learning. Naïve Bayes term graph model, K-nearest Neighbours (KNN), was used by the authors as a hybrid approach to obtain accurate results. The authors used Reuters dataset with 21578 articles, in which 9603 were training papers and 3299 were test papers. Specific pre-processing methods were used to achieve better results for the documents. A model program for the Vector space was developed and relevant documents were collected for the query. The authors clarified the methods used in text classification such as, K-Nearest Neighbors, Regression Models, Decision Trees, Decision Rules, Naïve Bayes and Bayesian Networks. Significant division of news articles based on classifying documents into various categories, the relevant document was displayed when entering keywords. For each document Association Rule Mining algorithm was applied to find the frequently co-occurring terms and then mapped to a weighted and guided graph. Unsupervised approaches typically include assigning each candidate's sentence a saliency score by considering various features. Supervised machine learning algorithms have been proposed to identify a candidate's phrase either into a main phrase or not using features such as occurrence frequency, POS details and position of the term in the text. Both of the above methods make use of the document text only to produce key phrases and cannot (as-is) be used to produce label-specific key phrases. The keyword extraction model was developed using both statistical as well as pattern features inside words. The algorithm is independent of language and does not require a semantic dictionary to obtain the semantic features [30] suggested an improved extraction method for the keyword (Extended TF). Document clustering by consensus and 367 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 7, 2020 classification (DCCC) model is implemented in [28] to perform cluster based classification on the limited high dimensional datasets with limited dimensions. Fig. 1 describes the proposed cluster-based document classification framework on large document sets. Initially, document pre-processing is applied on the input document sets. Here, doc-1, doc-2, …, doc-n represents the input documents for document filtering and feature extraction. Each document is filtered using the Stanford NLP library. In the pre-processing phase, each document is tokenized for word vector generation, stop word removal, stemming and n-gram processing. After the document pre-processing phase, each filtered document is given to hybrid Glove optimization model. The main and contextual key phrases of the glove optimization function are given to similarity measures for key phrase ranking. These main contextual ranked key phrases are given as input to clustering based KNN model and hybrid probabilistic based naïve Bayesian models. These models are used to improve the prediction rate or to minimize the error rate on the large documents sets. In this work, an advanced cluster-based classification model is designed to improve the document cluster quality and classification accuracy. Most of the traditional document clustering-based classification models are independent of multi-class document classification due to high computational time and accuracy. In this work, a hybrid clustering measure for document classification problem is proposed to minimize the runtime(ms) and classification accuracy.

III. A HYBRID DOCUMENT CLUSTERING BASED CLASSIFICATION FRAMEWORK
Most of the information content available on the internet is in the form of text data, so handling of text data is imperative. The method of extracting useful and non-trivial information and knowledge from unstructured text is commonly referred to in data mining. Categorization of text is a key area of study within the field of text mining. The basic purpose of categorizing text is to identify, grasp and organize volumes of text data or documents. The key problems are the complexity of natural languages, and the incredibly high dimensionality of the document feature space that solves this question of classification. Machine learning thus has a dual role: Firstly, we need an efficient data representation to store and process the vast amount of data, as well as an effective learning algorithm to solve the problem. Secondly, to identify unknown documents the accuracy and efficiency of the learning model should be high. The aim is to reduce the dimensionality curse to produce better classification accuracy as well as time consumption due to excessive processing. For the purpose of classification of text documents, the methods for sub-set selection of features employ an evaluation mechanism that is applied to each single word often known as words. There are a variety of factors in assessing the classifier's performance, such as training time, testing time, precision, precision, recall, etc. Proposed model selects a document class based on analyzing the words in the text, which consists mostly of nouns, verbs, and adjectives associated with those nouns. A similar method has been proposed with the use of POS (part-of-speech) tagging where the POS tagger can classify terms in documents by computer via the tags attached to them. A drawback of the Doc2Vec model is the high computational cost of the model construction for each document compared with Word2Vec, GloVe, and fastText. The Doc2Vec model creates a one-time only model for the n-gram text representation. Every term of the document representation vector is a collection of two or more adjacent words in a repository of documents. Another similar approach is to use a fixed section of letters, in which single pieces of letters reflect elements of every document's function vector.
Glove optimization model is used to extract n-gram local word vectors on the filtered data. This model extracts main words and its associated contextual features of main words. Finally, these main words and contextual key word vectors are given to adaptive contextual similarity and string similarity measures. GloVe encodes significance as vector offsets in an embedded space. This model measures the frequency of word co-occurrences in a broad text corpus within a specific window to produce linear significance and uses the factorization of global matrixes and local window modes. The model also has a local cost function and a weighting function to offset uncommon co-occurrences.
Proposed Glove algorithm consists of following steps: 1) Parameter initialization: Let X is the word cooccurrence matrix and each element Xij represent how often word i appears in context of word j.  ( , b )) θ 2) Define soft constraints for each word pair:

∑∑
The Proposed Glove model is optimized by using the following formula.
In the contextual similarity measure, the similarity between the glove features are evaluated to find the contextual phrases in the biomedical or any textual document sets. Here, the dissimilarity index is used to compute the non-correlated features among the large number of candidate patterns. Finally, contextual glove similarity index is computed by using the dissimilarity measure. 369 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 7, 2020 Hash based Similarity Measure The hash-based string similarity is given as: Let p and q are the big integer which are randomly selected, k is the big prime integer and x is the input word vector then the hash integer of each word in the word vector is given as

H(x)=((px+q).m)xor(k)
Hash based similarity measure is used to find the connecting string similarity between the key phrases. Thus, if a sentence starting with the connecting word is included in the key-phrase, its preceding sentence is also included in the key phrase despite of its rank.

Hybrid Cluster based KNN
Input: Let k be the number of nearest neighbor documents, D t be the input training documents set.

Procedure:
1) Read 'k' value and input training D t .
2) Apply k-means document clustering algorithm by using the optimal weighted term distance. Compute the weighted term distance to each contextual key phrase of hybrid glove method to the test documents. ).

4)
To each test document in the training data, compute classification score using the following formula.

Hybrid Bayesian Estimation model
In the proposed probabilistic based Naïve Bayes is used to predict the contextual main word in the hybrid glove method in the given cluster C. In most of the test mining models, main contextual words are independent to each documents cluster. Also, as the feature space is increasing in size, the computation of priori estimations is performed using the following equations: One of the simplest machine learning algorithms is the nearest neighbor algorithm. The purpose of the education process is to store the vectors and class marks of the training documents. Documents are converted into text-classified representations in the phrase of training. The most frequently used vector space model is document representation. Each document in this model is represented by a vector, which shows the weight of one word in a document in each entry. One weighing approach is tf-idf (duration frequency-inverse frequency of the document) and the wij (duration: frequencyinverse frequency of the document).

IV. EXPERIMENTAL RESULTS
The experiments are conducted on various data sets for the main sentence extraction. Every dataset is pre-processed by deleting the stop words and word stemming. In this article,we use the Wikipedia 2014 Glove and Gigaword with 5 billion vocabulary tokens. In http:/nlp.stanford.edu/ projects/glove /size, the developers of the Glove provide the term embedding vectors. We have used a window size 15 and a minimum size 10 to know the GloVe vectors. The similarity between objects is determined in the input vectors as a dot product. Cooccurrence is a strong foundation which encompasses many forms of element similarity. For word similarity, we used the stringsim and contextual similarity measures to demonstrate our model's capacity, a well-known dataset for the English evaluation of similarity. For the evaluation of the proposed model, experimental results are being simulated on text documents such as real time biomedical databases, ChEBI, biocause, PHAEDRA corpus. The cluster score and main contextual feature obtained in iteration 1 for ChEBI is furnished in Appendix-I. Table I illustrates the experimental analysis of proposed clustering-based document classification model to the conventional classification algorithms using true positive rate. In these results, proposed hybrid cluster based naïve Bayesian model has better true positive rate than the conventional models on different text document datasets.    Table II illustrates the performance analysis of proposed clustering-based document classification model to the conventional classification algorithms using accuracy measure. In these results, proposed hybrid cluster based naïve Bayesian model has better average accuracy rate than the conventional models on different text document datasets. In these results, proposed hybrid cluster based KNN approach has better average accuracy rate than the conventional models on different text document datasets. Fig. 4 illustrates the performance analysis of proposed clustering-based document classification model to the conventional classification algorithms using error rate measure. In these results, proposed hybrid cluster based naïve Bayesian model has better average accuracy rate than the conventional models on different text document datasets.    Table III describes the performance analysis of proposed clustering-based document classification model to the conventional classification algorithms using error rate measure. In these results, proposed hybrid cluster based KNN approach has better average accuracy rate than the conventional models on different text document datasets.

V. CONCLUSION
Document classification is one of the major problems in large and high dimensional feature space. As the size of contextual features in the documents sets increases, it is difficult to classify the documents using the traditional glove, TF-D and word2vec methods. In this paper, an advanced document clustering-based classification model is implemented on the large inter and intra feature variation document sets. In this work, a hybrid document clustering similarity index is optimized to find the essential key document clusters based on the contextual keywords. Experimental results show that the clustering-based document classification models have better statistical performance than the conventional approaches on large document sets.