Reducing Dimensionality in Text Mining using Conjugate Gradients and Hybrid Cholesky Decomposition

Generally, data mining in larger datasets consists of certain limitations in identifying the relevant datasets for the given queries. The limitations include: lack of interaction in the required objective space, inability to handle the data sets or discrete variables in datasets, especially in the presence of missing variables and inability to classify the records as per the given query, and finally poor generation of explicit knowledge for a query increases the dimensionality of the data. Hence, this paper aims at resolving the problems with increasing data dimensionality in datasets using modified non-integer matrix factorization (NMF). Further, the increased dimensionality arising due to non-orthogonally of NMF is resolved with Cholesky decomposition (cdNMF). Initially, the structuring of datasets is carried out to form a well-defined geometric structure. Further, the complex conjugate values are extracted and conjugate gradient algorithm is applied to reduce the sparse matrix from the data vector. The cdNMF is used to extract the feature vector from the dataset and the data vector is linearly mapped from upper triangular matrix obtained from the Cholesky decomposition. The experiment is validated against accuracy and normalized mutual information (NMI) metrics over three text databases of varied patterns. Further, the results prove that the proposed technique fits well with larger instances in finding the documents as per the query, than NMF, neighborhood preserving: nonnegative matrix factorization (NPNMF), multiple manifolds non-negative matrix factorization (MMNMF), robust non-negative matrix factorization (RNMF), graph regularized non-negative matrix factorization (GNMF), hierarchical non-negative matrix factorization (HNMF) and cdNMF. Keywords—Data mining; non-integer matrix factorization; Cholesky decomposition; conjugate gradient algorithm


INTRODUCTION
Computing application in several fields generates numerous data over several instances.In order to extract knowledge from such instances, solutions are used conventionally with data mining tools.However, the large datasets with numerous instances poses severe challenges and that leads to improper processing of such huge data volume.The reduction of datasets or improved mining algorithm can overcome such challenges [1].The reduction of improper values from the datasets provides a greater impact and this increases the performance of processing the large data [2]; hence, the improved mining approach is not useful in some cases [3].
The data reduction is the process of reducing the size or dimensionality of the data, however, the representation of the data should be retained.Selection of instance is one better way to reduce the data by reducing the total number of instances.In spite of many efforts to deal with such instances, data mining algorithm, however undergoes severe challenges due to nonapplicability of datasets with large instances.Hence, the computational complexity of the system increases with larger instances [3], [4] and leads to problems in scaling, increased storage requirements and clustering accuracy.The other problems associated with larger data instances include, improper association or interaction in the feature space, lack of ability to handle the large datasets with discrete variables, inability to classify the data and poor knowledge generation for a given query, and finally poor computation due to missing variables.
Recently, there are significant developments in NMF for various clustering problems in data mining, defined above.The NMF process is used to factorize the input matrix into two matrices of non-negative variables in a lower rank order [5]- [8].Several applications of NMF include: chemometrics, environmetrics, pattern recognition, text mining and summarization [9], multimedia data analysis [10], analysis of DNA gene expression [11], analyzing the financial data [12], and social network analysis [13].Several algorithms are designed to overcome the problems associated with objective functions [14], classification [15], collaborative filtering [16] and computational methodologies.
Thought, NMF is used for data analysis, the recent trends has improved the discoverability and learning ability of NMF in the data mining to solve the problems associated with larger datasets.In order to avoid limitations associated with larger dataset, the following consideration are made in the present study: This proposed method uses NMF to study the feature vector of a text document and Cholesky decomposition is used to avoid the non-orthogonality problem in the NMF.Further, to avoid poor decomposition using Cholesky decomposition, conjugate gradient is used, which avoids the rapid multiplication by the gradients in the feature space.
Since, the NMF algorithm learns both the data and feature vector in the dataset feature space, the proposed method implies the following contributions: www.ijacsa.thesai.org First, the metric using data matrix is estimated in the feature space using trained feature vector.
 Second, the Cholesky decomposition process is applied over the metric and the upper triangular matrix is identified.
 Third, upper triangular matrix is used a linear mapper for the associated data vector.
 Finally, conjugate gradient is applied to reduce sparse matrix through reduced multiplication and that avoids the NP-hard problem which significantly reduces the computational complexity [17], [18].
The outline of the paper is presented as follows: Section 2 discusses the related works.Section 3 provides the NMF model for clustering the larger datasets.Section 4 provides the modifications in NMF using Cholesky decomposition.Section 5 provides experimental verification of the proposed system over the associated datasets and section 6 concludes the paper with future work.

III. IMPROVED NMF METHOD
NMF is a non-negative low-rank approximation method associated with certain constraints that relates to the nonnegative elements in the data and feature vectors.Here, nonorthogonality problem exist due to the presence of nonnegative elements between the vectors and addition of linear combination, results in part representations.This interpretable and intuitive method for representing the text data elements is divided into two parts: 1) Data vector representation using Cholesky Decomposition (CD) with Conjugate Gradients (CG. 2) Feature vector representation using NMF.
The detail of these is shown in following sections: A. Fitness Function for NMF In NMF, it is assumed that matrices contain non-negative elements, hence, factorization is approximated.Let input data matrix is X = (x 1 , x 2 ,..., x n ), which carries n elements of input data vectors and the data matrix is decomposed into two matrices, T X FG   Where, X ∈ ℝ pn , G ∈ ℝ nk and F ∈ ℝ pk and ℝ is the set of non-zero real numbers, G = (g 1 , g 2 ,..., g n ) and F = (f 1 , f 2 ,..., f n ).In general, the value of p < n and the rank of F and G matrices is less than X i.e. k ≪ min(p,n).The rank F and G is generated using minimization fitness function and the sum of squared errors is used to evaluate the fitness function, which is represented as: The matrix normalization is obtained using Frobenius norm and the values of F and G are non-negative with nonorthogonal column vectors in its Euclidean space.The nondeficiency cases for rank R and G is generated using Idivergence fitness function: Here, when I(x) = x logxx + 1 ≥ 0, inequality holds i.e. x ≥ 0 and when x = 1, the equality holds.Hence, I-divergence using inequality condition is expressed as:

B. NMF Clustering
The initialization in NMF is an important process with clustering, similar to k-means clustering.However, the fitness function as a minimization function often undergoes local minimum problem [42], [43].Due to such constraint, even if the minimization function is convex, the intrinsic alternating function is non-convex.If a random initialization is considered, then the factor matrices are initialized as random matrices and hence, it become ineffective due to slow convergence to attain the local minima.If clustering process is used in NMF, the initialization is obtained from fuzzy [44], divergence-k-means [45] and spherical k-means [46], [47].However, the proposed method considers a simple strategy for document clustering, which is discussed below: NMF is applied to cluster the documents and number of features vectors in the document of each dataset is set as total clusters in a document [48], [49].Each cluster is assigned with individual instances and the representation g is considered maximum, which is represented as: Where, g c is considered as the c th element in g.

C. NMF Representation Learning
The representative learning, G is carried out by many supervised or unsupervised method using NMF, since it www.ijacsa.thesai.orgreduces the dimensionality in an effective manner.Certain other techniques uses Euclidean space to conduct learning on G [50].However, the non-orthogonality problem during the representative learning process is not dealt and hence, the proposed study uses such problem to reduce the dimensionality in large datasets.

IV. CHOLESKY DECOMPOSITION
The main reasons for the non-orthogonality problem during the representative learning (G) is the formulation of distance (squared) between the paired instances (g i , g j ) as (g i -g j ) T (g ig j ).The squared distance is implicitly assumes that g i lies in Euclidean space.In general, the learning of (f 1 ,...,f q ) using NMF are considered non-orthogonal to each other and the use of squared Euclidean distance is not appropriate during the representative learning by G. To solve this, generalized squared distance metric using Mahalanobis distance (M) measurement is used to solve the non-orthogonality of feature vector, which is represented as:


The NMF property is exploited to decompose the data matrix, X into, a) F with column vectors (f 1 , f 2 ,...,f n ) spans the feature space of the matrices, and b) G provides the feature space representation.
With such decomposition property, the cdNMF, 1) Initially, the estimation of the NMF metric is carried out in feature space using the feature vectors (trained).
2) Then, the Cholesky decomposition is applied over the NMF metric, which finds the upper triangular matrix.
3) Finally, upper triangular matrix is used to map linearly the data vectors.

A. NMF Metric Estimation
In NMF, the data matrix (X) is approximated and it is represented in the feature space as G and the feature representation in the data space is F. The normalization [8] of f results in f T f = 1 as the metric M is estimated as gram matrix FGF of the feature vector.


The metric estimation do not use label information for estimating M and the data vector is approximated over the feature space through u 1 ,...,u q and it is seen that M = F T F can be used to estimate the feature space metric.

B. Cholesky Decomposition over NMF Metric
The estimation of metric using (5) guarantees M as symmetric positive semi-definite matrix.Linear algebra guarantees M, which decomposes the upper triangular matrix T using Cholesky decomposition: M = T T T  By substituting (7) into (5), the Cholesky function to represent the upper triangular matrix T is given as:

C. Conjugate Gradients (CG)
Assuming, upper triangular elements to be sparse, hence linear representation of the data vectors is not considered valid.The use of CG for removing the sparse value in the matrices is found with the set of linear equations.The CG is applied on upper triangular matrix to remove the sparse value.The proposed method is used to utilize the trained representation using cdNMF, without any modifications in the algorithms over the learning representation.The elimination of sparse matrix is avoided by eliminating the rapid multiplication and clustering such data leads to increased convergence rate with faster association of elements in the dataset.Here, M = (TG T TG) -1 is the pre-conditioner to enhance the multiplication process, in case of incomplete Cholesky decomposition, where M = TG T TG defines the incomplete Cholesky decomposition.
Algorithm 1 cdNMF cdNMF (X, NMF, q, parameters) 1: Find X∈ℝ pn , NMF, q and parameter(NMF) 2: F, G := run NMF on X with parameter and q // metric estimation 3: Apply CF once the linear coordinates changes, x = TGy and det TG0 6: Use CG for solving TG T ATGy = TG T b 7: Set x = TG −1 y 8: Set the preconditioner M = (TG T TG) -1 9: Multiply TG by TG -1 10: Compute x = TG −1 y 11: Return M, x, TG This algorithm helps in reducing increased multiplication process and increases the convergence rate.The computation of x = TG −1 y is carried out only at the end of multiplying TG by TG -1 and the computation process is multiplied with M.

V. EXPERIMENTAL RESULTS
In the proposed system, the cdNMF is used to cluster the documents and compared with other algorithms to prove it effectiveness.The cdNMF system for evaluating the datasets is compared with conventional algorithms and that include: NMF [51], GNMF [5], NPNMF [6], MMNMF [7] and RNMF [8].

A. Text Mining Datasets
The proposed cdMNF with conjugate gradient is evaluated on text datasets: 20 Newsgroups data (Table 1), Reuters 21578 data (Table 2) and R52 data (Table 3).Each document is represented as standard vector model [1] that contains occurrence of classes and terms in a document.Each document is represented as single line in the file and represented using a word or document class with TAB character, delimiting spaces and the terms.A total of 5 classes are used from each dataset with a set of training documents, test documents and other documents.A cluster is created with 5 classes of 20 Newsgroups data, Reuters 21578 data and R52 data.
Hence, three clusters are used in this study that includes a set of 4248, 21453 and 5045 documents, respectively for 20 Newsgroups, Reuters 21578 data and R52 data.The clusters with sub-clusters are classes are used to create the samples and a total of 100 documents from each sub-clusters of all the classes form the sample.Likewise, 20 such samples are created from the text datasets.
Here, the each text sample is conducted with pre-processing operations that include: trunc5 stemmer [52] and POS Tagger [53] and removal of stop words and finally it selects a total of 30000 words with mutual information in a larger perspective.The selection of sub-clusters for the sample formation is shown in Table 4.

B. Clustering Metrics
The performance of the clustering metrics is evaluated using two metrics Accuracy (acc) and Normalized Mutual Information (NMI).The parameter, acc is used to estimate the overall performance of the cluster, which is defined in the form of a fraction metric, acc =  t /  ov , where,  t is the correctly clustered documents sample and  ov is the overall amount of samples.The Mutual Information (MI) finds the interdependency between the variables and if the text variables are equal, then MI is zero and it is defined as:

MI x y NMI x y E x E y 
Where, E(x) and E(y) are the entropy of the document x and y.

C. Evaluation and Comparisons
NMF is the baseline algorithm, GNMF uses KNN graph with regularization term for preserving the structure of geometry, NPNMF uses local linear embedding and graph approach in NMF uses trained regularization term, MMNMF uses an eleven graph for exploring the multiple manifold data structure, RNMF adds noise in NMF and HNMF encodes the geometry into matrix factorization using hyper graph.These systems are tested against accuracy and Normalized Mutual Information (NMI) over the sample datasets.
The results of acc from Table 5 show that the cdNMF performs well than conventional schemes.Here, the performance of cdNMF increases gradually from samples 1 to 20.It is inferred that if the documents of similar dataset are more, the accuracy is more and it reduces when the 20 sample documents are equally distributed from similar clusters.The overall accuracy of cdNMF is slightly higher than HNMF and The average values of NMI results claim that the interdependence of documents belonging to similar cluster during testing is also high.It is seen further the documents are equal in number in sample dataset and the interdependency is less for other conventional algorithms, however, cdNMF performs well.The average values of the acc and the NMI test results for individual dataset is shown in Tables 7 and 8.It is seen that proposed cdNMF performs well with better accuracy to cluster the documents than conventional ones.Finally, the comparison with baseline NMF proves that the proposed cdNMF has better acc and NMI rate for the individual datasets.

VI. CONCLUSIONS
In this paper, we present a new matrix factorization method called Cholesky Decomposition based non-negative matrix factorization.The Cholesky decomposition collects the data vector, specifically it avoids the non-orthogonality of the nonnegative matrix factorization due to its local representation.Also, the presence of non-negative constraints is avoided finally with upper triangular matrix representation for mapping the data vectors.Further, the sparse matrix is eliminated using conjugate gradients, which takes hold of the complex conjugate values from the data vectors.Finally, better accuracy and normalized mutual information is obtained during the experimental validation and it enables better learning of the text data elements with reduced redundancy.
In future, we would like to improve the proposed approach on a graph based NMF framework that could generate better patterns to improve the learning representations of NMI for text mining.

TABLE I .
ATTRIBUTES OF NEWSGROUPS DATA

TABLE II .
ATTRIBUTES OF REUTERS 21578 DATA

TABLE IV .
SELECTION OF 20 SAMPLES FROM THE DATASETS The MI provides information related to the amount of uncertainity measured between documents x and y and one documents reduces the uncertainity of the other documents.Entire information is shared between the documents if the value of MI is zero and vice versa.The Normalized MI or NMI is denoted as:

TABLE VII .
AVERAGE ACCURACY OF PROPOSED METHOD VS.EXISTING METHOD USING THREE DATA SETS