Document Classification Method based on Graphs and Concepts of Non-rigid 3D Models Approach

Text document classification is an important research topic in the field of information retrieval, and so it is how we represent the information extracted from the documents to be classified. There exists document classification methods and techniques based on the vector space model, which doesn’t capture the relation between words, which is considered of importance to make a better comparison and therefore classification. For this reason, two significant contributions were made, the first one is the way to create the feature vector for document comparison, which uses adapted concepts of non-rigid 3D models comparison and graphs as a data structure to represent such documents. The second contribution is the classification method itself, which uses the representative feature vectors of each category to classify new documents. Keywords—Document classification; graphs; non-rigid 3D models; Universidad Nacional de San Agustı́n de Arequipa (UNSA)


I. INTRODUCTION
Nowadays with the increase of the use of technology, a great amount of textual information is generated as well as the need of innovative methods and techniques for its analysis, comparison, and classification, being the latter defined as the assignment of a category to an unclassified document finding similarities between this and the documents of the different known categories.
There is a wide variety of document classification algorithms, plenty of them are based on similarity comparison techniques [1], whether they are based on the vector space model [2] which treats words independently and does not capture the semantic relations between documents; or methods that do consider them important, which create graphs from the relation between words inside a document [3], [4], [5], [6].
The efficiency of these methods depends mainly of the representation of the documents to be classified, so in this paper it was decided to follow the path of [5] in the utilization of graphs as structure to represent said documents.
Graphs are data structures that are used to represent complex non-structured information about entities and the interaction between them. On the other hand, documents can also be represented as graphs using account concepts of frequencies and relationship between words. Finally, this graph can be used to apply techniques similar to that of three-dimensional meshes for classification.
Also, other areas in computer science can provide some applicable ideas and concepts to the information retrieval field, for example the approach in which this method is basing, uses adapted notions of the area of computer graphics to do a better document similarity comparison, resembling the definition of isomorphism with document semantic similarity. This method takes into account both the individuality and the relations between words, which is used for the document classification since the documents belonging to one category have a very high similitude between one another because when talking about the same topic, exists a very high quantity of words that appear in many documents inside this category, just like consecutive words, which will be detailed in Section IV.
In this paper we propose two significant contributions, the first one is the modification of the work of [5] to obtain feature vectors and the second is the classification method itself, which is based on the obtaining of representative feature vectors per category.
The general objective of this work is to develop a new method of document classification, based on rigid models analysis concepts in geometry processing. The steps to follow are these: • Select documents to create the training and testing sets.
• Adapt the document comparison approach proposed by [5] to obtain a feature vector representing each category. • Analyze the new document to obtain its feature vector.
• Apply the proposed classification method to the feature vector of the new document using the feature vectors of all the categories. • Identify the category the new document belongs to.
• Experiment with the testing set. This rest of the paper is organized as follows. Section II presents previous concepts. Section III provides an overview of the state of the art. Section IV describes the methodology. Section V evaluates experimental results and we present conclusions on Section VI.

II. PREVIOUS CONCEPTS
For a better understanding of the problem and the proposed solution, we define the following concepts.
• Keypoint: In 3D models, a keypoint is a point which is distinctive in its locality and it is present at all different instances of the object [7]. inside a neighborhood. Such that, its frequency and the grade in which it is related to its neighbor words are high [5]. • K-rings and neighborhood: In 3D models a k-ring R k (v) of a profundity level of k with center on the vertex v is defined by: Where C(v , v) is the shortest path from vertex v to v and | C(v , v) | is the size of the path C(v , v). It is important to mention that the size of an edge is always 1 [8].
Then we adapted the concept of k-ring so that in documents it is called neighborhood. • Document graph: According to the work of [5], a document graph G(N, A, W ) is a representation in which the vertexes N are the terms of a document, the outgoing edges A of each node represent the existing relations between them, while W are the weights of the edges which indicate the importance of a relation. Fig. 1 shows an example of a document graph.

III. STATE-OF-THE-ART
In the past years, the document classification task has been widely studied, including approaches of machine learning like Bayesian classifiers, Decision trees, K-nearest neighbor (KNN), Support Vector Machines (SVMs), Neural Networks, [9], [10], [11], [12], [13], [14], among others. This paper focuses on supervised classification since it requires a learning or training process by the classifier. The main idea of supervised classification techniques or algorithms is to build a pattern from each class or category to then find the similitude between this and the new document to be classified.
To perform a better text document classification these can be represented in plenty of ways, this is done to reduce their complexity and to make them easier to handle. The more commonly used representation is the vector space model [2], in which the documents are represented as vectors of words. This model does not capture the relationships among words, or the semantic relations between them, for this reason, there exist methods of term weighting a matrix as it is shown in Fig. 2 [15]. A big problem of this representation is that because each entry represents a word of the document, and not all the words appear in every document to be classified, this becomes highly dimensional resulting in a very large disperse matrix [15].
Likewise, documents can be represented using structures like graphs, which demonstrate to better capture the relations among words or terms according to the edges between its vertexes. There are several related works that use this representation [5], [1], [3], [16], [17], [18], [4].

A. Subgraphs and Term Graphs
In the work of [17], they state that a document D i is represented as a vector of terms D i =< w 1i , . . . , w |T |i > where T is the ordered set of terms that appear at least once in a document inside a collection of documents. Each weight w ij represents how much a term t j contributes to the semantic of the document. The weight of each term inside a collection of documents is found by building a term graph. The relations between terms are captured using the frequent itemset mining method 1 .
In the work of [3], they also use a graph-based approach to classify documents. Their algorithm W-gSpan (weighted subgraph mining algorithm) is applied to identify the subgraphs with frequent weights of the documents, these subgraphs are then used to generate a set of binary feature vectors (one per document), which then serve as entry to the TFPC classifiers (a mining classification association rule), Naive Bayes and decision tree classifier C4.5 showing as a result, a percentage greater than 84% of classification precision using two methods described as follows.
The first classification method consists in treating each term of a graph as a web page to find a PageRank score, which is a method that consists in the idea that if a web page is pointed by several other web pages, then its ranking will be high, or if pages with a high score point to it. Then a rankings vector representing the document is created, and the category whose ranking co-relation coefficients (found with the Spearman algorithm) are higher with this test document is assigned. This vector is used with SVM, obtaining an average of 92% of precision.
The second method is based on the term distance matrix and the distance-weight similarity function. Given a distance matrix set {T 1 , T 2 , . . . , T n } representing the categories {C 1 , C 2 , . . . , C n } and a test document D, the document will be classified into the category C i if and only if the distanceweight similarity of C i and D is the longest among all the (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020 categories. This method obtained an average precision of more than 60%.
Also, there exists other methods that combine this subgraph and term graph approaches to perform the classification task [19], [20], [21].

B. Graphs and Graph-Kernels
In the work of [18], they consider the text classification task as a graph classification problem, model text documents as a graph-of-words, which correspond to a graph in which vertexes represents unique terms of the document and the edges represent co-occurrences between the terms inside a fixed size window. An example of this graph is shown in Fig. 3.
Then, they used linear SVMs to perform the classification because the objective was discovering and exploring new characteristics. To perform the characteristics extraction the used gSpan (graph-based Substructure pattern) to get frequent subgraphs, the minimum quantity of these depends of a parameter known as support, the optimal value of this parameter can be learned trough cross-validation to maximize the prediction precision of the classifier, turning this whole process in a supervised process. When reducing the graphs, it is necessary to keep the more dense parts for which they extract its main cores. This method obtained results of up to more of 90% of precision. Author in [4] present a similarity measure based in the def-inition of a graph-kernel 2 between pairs of documents, using the terms contained in the documents and the relations among them, representing them as a graph-of-words. Specifically they capitalize on the kernel and modify it to compare the graph representations of a pair of documents.
The method takes as entry a pair of documents and automatically computes how similar are one of another based only on their content. This method was tested by doing text categorization for which a SVM classifier was used taking as entry the kernel matrix of the training set, showing results of up to 77% of precision in one database and more than 91% in the other three.

IV. METHODOLOGY
Due to the wide variety of techniques that have been developed to solve the document classification problem, is that this paper adopts the innovative approach of Lorena et al. [5] of document similarity comparison using graphs and concepts of 3D models and applies it towards document classification.
At the time of obtaining feature vectors from the document graphs, what we look for is to capture better the relation among words inside a document, extracting this way a semantic representation of it, to then be able to use them with the classification method.
A general diagram of the document classification process for a new unclassified document is shown in Fig. 4.
As it was mentioned previously, this paper modifies a previous work approach. Then, its general functioning is explained as well as the modification to obtain the feature vectors and the classification method. The steps performed are enumerated according to Fig. 4.

a) Preprocessing and graph construction:
For the preprocessing phase, first we do the cleaning step, which consists in the elimination of stop words 3 . Then the Porter algorithm is applied for the stemming step, which preserves only the roots of the words to avoid the different time, gender and number variations; and because there will be repeated roots we proceed to the ID Assignment step, which assigns numeric IDs to each root, to later be inserted on the list L. b) Graph construction After the preprocessing step, we proceed to build the graph G(N, A, W ) where N are the nodes of the graph, which represent the elements of the list L, A indicates the edges which are the existing relations between the elements of the list L, and W are the weights of the edges. The protruding edges of the nodes represent the grade in which these are related with their neighbors, as it is shown in Fig. 1.

c) Comparison:
Following the approach of [5], to perform the comparison between two document graphs G 1 and G 2 , first we obtain a list of keywords (L kw ) of each graph, which are the µ nodes with greater weights. Then we found a list with the intersection of both lists, which will contain the common keywords between both graphs as it is shown in Equation  3.
Where max µ represents the µ higher values, G 1 and G 2 are the graphs that represent two different documents and finally KW (G 1 , G 2 ) is the set of common keywords between G 1 and G 2 . Given that w is the number of times that a relation between two words (a, b) appears on the text, to find the distance between the nodes that represent these words the Equation 4 is applied.
Then we use the Equation 5 to find the neighborhood.
Where F ρ (L kwj ) = {n ∈ G 1 , G 2 : D(n, L kwj ) ≤ ρ}, D denotes the shortest distance between the node n and L kwj applying the Dijkstra algorithm, n are all the nodes which distance D is shorter than a radio ρ.
Subsequently, instead of obtaining a comparison coefficient per each pair of documents as the authors do in [5], we perform the comparison between them following the Equation 6, obtaining the comparison vectors B which are the union of the keywords in common plus the neighborhood of these, keeping like this more information than just a coefficient. This is performed for every document inside each category. To obtain the representative vectors Γ 1 , Γ 2 , . . . Γ n where n is the number of categories, we considered to apply the intersection of the vectors B; this concept was initially considered to obtain the common IDs of all vectors, but because of the low probability of a word being considered a keyword and also appear in every document inside a category, this idea was dismissed, also because in the experimentation step the results of the intersection came to be 0 or the size of the resulting vector was too small.
Instead of this, we obtain the occurrence frequencies δ of each word of the dictionary of all vectors B. This frequencies vector is then ordered in a decreasing way to obtain the words with higher frequencies according to a threshold φ which is passed by parameter. Finally, the resulting vectors Γ are obtained using the Equation 7, each vector will represent a category and will contain the IDs of their more representative words.
Where n is the number of the obtained feature vectors.

e) Feature vector of a new document
In order to obtain the feature vector Z of a new document, first we do the preprocessing and graph obtaining steps. Then, each ID will be placed as a position of the vector as it is shown in Equation 8 to then perform the classification method.
Where t is the total number of obtained IDs of the document.

f) Classification Method
Once obtained the vector Z of the new document and the representative vectors Γ of each category, we find the intersection of this vector with all the vectors Γ, to get this way the belonging grade X with each category. Then, to obtain X two methods are proposed. a) Method 1: X is the number of elements of the intersection between Z and Γ.
b) Method 2: X is the sum of the frequencies of the words in Γ that are in the intersection with Z.
By last, the category to which the new document will belong to, will be the one with which it obtained the higher belonging grade X.

V. EXPERIMENTS AND RESULTS
For the experimentation phase, we used the amazon database [22], from which we randomly chose 4 categories and 8000 documents, being 2000 per category. The set of documents was then divided in 1600 training documents and 400 testing documents. After this the next steps were performed: First we get the vectors B from the training set. Then, by analyzing the obtained results we can assign the value of the threshold φ, which controls how many IDs will be extracted for the classification method. The results of the category vectors B showed values of δ superior to 3000 and 10000 becoming these the assigned values to the threshold φ to then get the representative vectors per category Γ. Tables I and II, the values of the diagonals represent the percentage of correct classified documents as well as the error percentage, this is to say, documents assigned (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020   to an incorrect category; for this we assigned different input parameters like ρ = 2 and ρ = 3, keywords number of kw = 15 and kw = 10, and grade k = 2.  Tables I and II. We can observe that in Fig. 5, the results achieved using Method 1 with different input parameters φ, ρ, and k, tend to have less variation between them in most categories in comparison with the results obtained with Method 2. We can note that this behavior persists if we vary the number of categories, as showed in Fig. 6.

Next, in the
Also, in Fig. 5 we can see that the results of Method 2 were higher in some experiments in comparison to Method 1, these results vary if we change the input parameters, for example, the results of the category baby differ from 67% up to 89.25% as shown in Table I.

VI. CONCLUSIONS
In this paper, we presented a text document classification method based on a similarity comparison approach, which adapts concepts taken from the analysis of non-rigid tridimensional models and uses graphs as the structure to represent (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 6, 2020  Furthermore, by the time of getting the comparison vectors B per category, their sizes can be different as well as the size of the representative vector Γ, because unlike the vector space model, in this method it would not be necessary to complete elements inside these vectors to unify their sizes according to the dictionary of words.
Worth noting that the obtained words in this representative vectors Γ are those who keep more information about the category.