Exploiting Document Level Semantics in Document Clustering

Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus) into smaller, more manageable, subject homogeneous collections (clusters). Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1) to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2) to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical α, syntactic β and semantic γ features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical α, syntactic β and semantic γ. The main contributions from this research are (i) A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii) we propose a similarity function using these three, and (iii) we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: FMeasure and Purity, and have obtained encouraging results. Keywords—Document Clustering; Text Mining; Similarity Measure; Semantics


I. INTRODUCTION
Document clustering [9] can be defined as an unsupervised learning approach, which clusters the document repository into meaningful smaller and manageable sub-collections.These resultant sub-collections contain high intra-cluster similarity (that is, the documents in a single cluster are mainly similar in some sense), and low inter-cluster similarity (documents in two sub-collections are largely dissimilar).It has found its niche in management of large document repositories.Learning the common features for grouping implicitly is the main spirit of document clustering.Traditionally, document clustering algorithms utilized simple features present in the documents like (word, phrases and sequence of words) to cluster the documents.These simple features are independent of document context and thus the semantic of the document cannot be incorporated into the clustering process.In order to perform semantic viable clustering, we believe that the problem of document clustering has two main aspects: (1) to represent the document in such a form that it inherently captures semantics of the text.This may also help to reduce dimensionality of the document .Other, (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship.In general, the degree of similarity between documents is measured by the words and sentences or meaning of the document.A similarity measure should also address the problem of partial matching of documents.Several efforts have been made to address the problem of document partial matching, using the lexical features from a document like: keywords, concepts, nouns, verb etc. Documents are modeled in such a way that it allows the similarity methods to compute partial contribution of these individual units in the similarity values.The overall document similarity is obtained as a function of those partial measures.We have observed that while doing this partial document matching towards similarity function, there are two problems that have not been handled in previous works (i) Word Order problem i.e,In Human spoken languages, the selection of words is mainly base on the contextual information being contained in the document.Similarly, the order of the words appearing in the text influences the meaning of the text.For example, the sentences "A hires B" and "B hires A" are composed by selecting the same words, but the order completely changes their meaning and (ii) Semantic matching i.e, Sentences with the same meaning but different words.For example, the sentences "Joe is an intelligent boy" and "Joe is a smart lad" have similar meaning, if the context in which they appear does not change much.
According to the work in [5], the paper proposed a threelayer representation of documents.Unlike sentences, we apply these three layer transformation on an entire document.These layers are lexical, syntactic and semantic layers.Each layer extracts specific features from the same document.The lexical analysis is performed in the first layer in which we extract the bag of word vectors from the documents.Documents are preprocessed by removing stop words and stemming.The syntactic layer uses relations (predicates) extracted from the RDFs of the documents to handle word order problem.The RDFs of the documents are generated using an online tool Alchemy API [13].The semantic layer employs the Semantic Role Annotation (SRA) to handle the semantics problem.The SRA analysis is done using Fred API [11] which returns the meaning of the actions, the actor who performs the action, and the object/actor on which the action is being performed.FRED is a tool for automatically producing RDF/OWL ontology's and linked data from natural language sentences.The method is based on combinatorial Categorical Grammar, Discourse Representation Theory, Linguistic Frames, and Ontology Design Patterns.Results are enriched with Named Entity Resolution (NER) and Word-Sense Disambiguation (WSD).Below is an example based on a sample document; the three representations are presented next to each other.Below is an example of a simple sentence; D1 = "Pakistani boys love to play cricket and hockey"  We believe that the major confusion in clustering process is just because we consider lexical features alone.The syntactic and semantic features may guide us on the right decision to merge two documents or not, hence a good clustering arrangement is learned.We have carried out an extensive set of experiments with standard text mining data sets.Our proposed approach clearly surpasses the traditional document clustering methods on evaluation like: F-Score and Purity.The paper organized as follows: Section 2 discusses the related work in area of text document clustering, specifically in the realm of semantic based document clustering.Section 3 describes our proposed approach along with some examples.Section 4 presents the experimental setup, data sets, comparative algorithms, and evaluation measures.Section 5 discusses the experimental results.Conclusion is presented in section 6.

II. THE LITERATURE REVIEW
Data clustering [9] is an unsupervised technique which creates succinct sub-groups from the data for discovering valuable knowledge.Document clustering is a specialized data clustering problem where the objects are in the form of documents.The objective of the clustering process is to group the similar documents and separate different ones.The difficult part of this unsupervised task is to learn how many clusters of such groups exist in a given data set.Document Clustering aims to discover natural grouping among documents in such a way that documents within a cluster are similar (high intracluster similarity) to one another and are dissimilar to documents in other clusters (low inter cluster similarity).Exploring, analyzing, and correctly classifying the unknown natures of data in a document without supervision is the major requirement of document clustering method.Clustering is an effective method for search computing [1].It offers the possibilities like: www.ijacsa.thesai.orggrouping similar results [3], comprehending the links between the results [8] and creating the succinct representation and display of search results [3,4].Document clustering has three main steps: (i) document representation model, (ii) similarity measure between a pair of documents in selected form of representation and (iii) clustering algorithm that produce the final clustering arrangement.Document representation is very sensitive for the task of document clustering.Traditionally, document clustering algorithms mainly use features like: words [7], phrases [2], and sequences [6,10] from the documents to perform clustering.These algorithms generally apply simple feature extraction techniques that are mainly based on feature counting frequency distribution of the features.The approach in [6] proposed a frequent itemset-based representation of documents for clustering (FIHC).Motivated from the idea of market basket analysis, the authors considered a document as basket and the terms used in the document are considered as itemsets present in it.A representation based on frequent items (frequent phrases) is proposed.The work in [10] proposed two solutions to document representations (i) frequent word sequences (CFWS) and (ii) frequent word meaning sequences (CFWMS).The two approaches first parses a given document to get the frequent word sequences of some arbitrary length (2word set for their experiment).The first uses the frequent word sequences and the second uses an external lexical database WordNet [4] to annotate the word with their meaning to cover the word meaning problem, such as synonymy, polysemy, and hyponymy/hypernymy.Their experimental studies have shown an improvement in F-Measure for both CFWS and CFWMS over FIHC.Although these approaches use phrases or order of words in representation of documents, their results are still fallible on semantics of the clusters produced.These techniques simply perform clustering independent of the context.Document written in human language contains a context and words that are largely depending on it.A more recent approach to represent a document is based on dependency graph (DGDC), proposed in [14]; each document is parsed to form a dependency graph.This dependency graph captures the semantic representation of documents; thus, it offers more semantic rich clustering.It also introduced a novel similarity measure based on common features of the two corresponding graphs of the documents.One more recent approach to capture semantic representation of documents in document representation model is introduced in [12] in which the authors proposed a topic maps based representation by using an online tool Wandora for extracting topics from a document.They also reported encouraging results for document clustering based on semantic notions.We conclude that there are features like frequent item sets, common frequent sequences or word meaning sequences, dependency graphs, and topic maps that can be used to reduce the dimensionality of document space and at the same time offer more semantics in representations over simple Bag-of-words (BOW); These approaches still fail to incorporate semantics on larger scale.The phrases or sequence are a good measure for identifying semantics of the text.We believe that a sentence specific measure will be more semantic rich, and extending a sentence level similarity to a complete document is a challenging aspect of semantic oriented document clustering.A sentence similarity can easily capture similarity between phrases and sequences, but this similarity should also address the issue of partial information like: when one sentence splits into two or more short texts and phrases that contain two or more sentences, it should assign partial score to matched phrases or sequences.The score should directly proportionate to a number of such units found in the two sentences.In [5] authors describe a sentence similarity measure that uses three-layer of sentences meta descriptor to capture the semantic in the similarity measure.We have been motivated by this idea and extended it to a full document for eventually performing the task of clustering.A document is transformed into three meta-representations based on lexical, syntactic and semantic layers.A similarity measure for each representation is defined based on features extracted from each layer.Cosine similarity is used for each pair of documents in all the three layers; hence, we get three similarity values in each of the three layers, that is lexical, syntactic and semantic meta-descriptor.Final document clustering is performed on N x N matrix(containing the vectors< α, β, γ > from three layers) by candidate based document clustering algorithm.We have conducted an extensive set of experiments with standard text mining data sets.Our proposed approach clearly surpasses the traditional document clustering methods on evaluation like: F-Score and Purity.

A. Document Representation
In this paper, we propose a representation of each document based on three levels, namely lexical, syntactic and semantic levels.Each level produces a separate document-todocument similarity score.We generate a vector of similarity scores based on these three levels i.e.

V ector(D
Where α, β and γ are the similarity scores of lexical, syntactic and semantic level from Document a to Document b. 1) Extraction of lexical features: Bag of Words are extracted as the features of the document.From all the documents in the set, a dictionary of words (vector) is built.For each document a vector is built which contains the common tokens between the document and the dictionary.The vector contains values formed by calculating T F * IDF for each token.T F * IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.Stop word removal rules out words with little representative value to the document, e.g.articles and pronouns, and the punctuation.Stemming is a pre-processing service, which translates the tokens in its basic form.For instance, plural words are made singular and all verb tenses and persons are exchanged by the verb infinitive.
2) Extraction of syntactic features: Alchemy API has been used for all the documents to extract the Subject, Action and Object of each sentence in every document.The syntactic analysis represents an order relation among the extracted features of the documents.It describes the syntactic structures of the language; and decomposes the text into syntactic units in order to know the arrangements of syntactic elements.Such kind of relations could be used in applications, as for instance, Fig. 3. DLS-DC approach automatic text summarization, text categorization, information retrieval, etc.
3) Extraction of semantic features: Semantic representation is built using Fred API.The API gives a semantic rich representation of a document in terms of RDFs.FRED is a tool to automatically transform knowledge extracted from text into RDF and OWL, i.e. it is a machine reader for the Semantic Web.It is event-centric; therefore it natively supports event extraction.The API uses word sense disambiguation.FRED has got precision, recall, and accuracy largely better than the other tools attempting event extraction.Semantic annotating features are then extracted using SPARQL queries and saved as triples.For example, in a certain document, the object Event is the annotating feature for its subjects call, do, see and think, which means that all these words graphically point to the object Event.

B. Similarity Measures 1) Lexical similarity:
The similarity is calculated using Cosine Similarity measure which takes two document vectors and returns a similarity score between the documents.The vector size of each document is the size same as of the dictionary vector.The similarity is calculated using the formula below; Where 'V' is T F * IDF Vector.
2) Syntactic similarity: Word Order Problem has been handled by assigning equal weights to each of the three features given by Alchemy.Predicate is checked before the Subject and Object.For example, in the sentences "Joe killed Mary" and "Mary killed Joe", the predicate 'killed' is similar.As a result, it is assigned a weight of 1 * 0.33, whereas 'Joe' as a subject in the first sentence does not match with 'Mary' as a subject in the second sentence and same goes for 'Mary' as an object in the first sentence with 'Joe' as an object in the second.Consequently, they are assigned a weight of 0 * 0.33 each.This gives us 0.33 + 0 + 0 = 0.33 as a similarity score between the two sentences.
This was for sentence to sentence measure.On document to document, the similarity is calculated through the following formula: Where S ai is the i th sentence of document 'a' and S bi ...S bn are all the sentences in document 'b'.Every sentence of document 'a' is matched to every sentence in document 'b' and the maximum similarity scores of all sentences are averaged by the number of sentences of the document which has larger number of sentences in it.
3) Semantic similarity: Event {Call, Do, See, Think}, where Event is the annotating feature and the words in the braces are its subjects, i.e. all the incoming nodes to the object.
Similarly, there are objects (annotating features), generated by the FRED, as Activity, For and others, depending upon the document.To build semantic representation of a document, all the annotating features along with their subjects are extracted using SPARQL.
While comparing two documents for similarity, the algorithm first checks if the object (annotating feature) of one document matches with the other.If it matches, the similarity www.ijacsa.thesai.org is computed on the subjects of both the objects using Cosine Similarity measure.All the subjects in all annotating features are matched based on the condition above and the similarities of all the features are summed up.The document to document similarity score is calculated by dividing the final similarity sum with the average of the total number of objects in both the documents.The above representation also captures word sense disambiguation.For Example, D3: "I am doing research on Semantics.D4: "My research is on Semantics The semantic score for the above documents will be 1 by using this representation.Below is an algorithm for document to document similarity calculation.

C. Candidate Based Clustering Algorithm
Here is an algorithm for candidate based document clustering.
The algorithm takes a similarity matrix of size D × D containing a vector of size three on each value.M prev and M curr are the matrices taken for keeping the record of updated matrix each time after the matrix is merged in each iteration.DocPair is a pair of two documents that is extracted through the matrix containing the vectors.Example of a DocPair 103124 -102616.C pair is the final Cluster Pair decided after making the candidate based decisions.The C pair is sent to the MERGE and UPDATE function which reduces the matrix by one column and one row and updates the values of the matrix using Average Linkage Strategy.Clusters is a list of Clusters that initially contains D clusters and the two Clusters based upon the decision are merged using ClustersUpdate function in order to get the final K level clusters.

A. Implementation of Algorithm
The proposed algorithmic approach has been compared with a number of recently proposed document clustering algorithms on the popular standard dataset of NEWS20 and Reuters21578 for the problem of document clustering.The DLS-DC is implemented in Java programming language.The experiment is executed on a Dell 5547 Notebook with Intel Core i7 processor and 8GB of RAM with 1TB of Hard Disk Storage.

B. Datasets
We have used the popular text data sets NEWS20 and Reuters21578 for our experiments.

C. Evaluation
We justify the effectiveness of our proposed method by using standard cluster quality measures like 1) F-Measure: The F-measure uses a combination of precision and recall values of clusters.The F-measure, F (i, j), of a class 2 * prec(i, j) * rec(i, j) prec(i, j) * rec(i, j) (4) www.ijacsa.thesai.org The F-measure for entire clustering result is defined as 2) Purity: Purity can be defined as the maximal precision value for each class j.We compute the purity for a cluster j as: We then define purity of the entire clustering result as: 3) Baseline: The baseline for this experiment is set using the bag-of-words representation for documents.We are using T F * IDF based representation for document vectors and cosine measure to create a clustering arrangement for our baseline.

D. Comparative Work
We would like to compare our proposed approach to the three recent approaches that claim that they produce semantic rich clustering.The approach in [6] proposed a frequent item set-based representation of documents for clustering (FIHC), the second is from [10] from where we only compare with frequent word sequences (CFWS), and third and final is from [12] where authors used topic maps based representation of documents.We have implemented the proposed approaches as described in [6,10,12].

V. RESULT & DISCUSSION
In this paper, we present a new approach to cluster the documents based on semantic rich features using a vector representation.The inferred knowledge from the three representations is used to define the similarity measures between the pair of documents.Each of these measures is used in a matrix with each value containing a vector of size three to cluster the set of documents by using Candidate Based Clustering Algorithm CBCA.First, we would like to discuss F-measure of Hierarchical Clustering on individual levels with Candidate Based Clustering Algorithm CBCA.The higher purity values by candidate based clustering algorithm is an indication of producing high-quality clusters which is again due to the fact that a combined representation scheme is used for clustering.The experimental results show that DLS-DC performs better than comparative algorithms of this study in terms of quality of the clusters produced.Increased cluster purity clearly establishes the fact that the features extracted from the three representations capture the semantics of the documents.The three approaches FIHC [6], CFWS [10] and TMHC [12] produced F-measure for the data sets (See Table IV  The proposed approach clearly had shown improvement in most of test cases.This is due to the fact that the multiple representations of documents in the collection capture the semantics in a better way, and are able to produce high F-Measure which is an indication of balance precision and recall (See Figure 6).Similarly, another evaluation that is very instrumental in identifying the better clustering is purity.The proposed approach produces better purity values when compared to the comparative algorithms (See Table V  The purity with the proposed approach DLS-DC indicates that our idea of different representation of the same document (with different focus) has produced better understanding at representation level.Hence, the automatic clustering process implicitly identifies the common attributes to produce better purity values (See Figure 7).In most of the approaches DLS-DC is performing well as evident in results of purity and F-Measure.The dataset classes D1 to D5 are created manually from the complete dataset of news20.Due to confusion between documents in dataset classes D2 and the graph shows slightly lower purity F-measure values.

CONCLUSION
We propose an approach that exploits document level semantics in document clustering.The representation of document comprises of three levels namely: lexical, syntactic and semantic, that are defined for the document.In lexical Fig. 7. Purity different approaches on test datasets representation, we only use lexical features.The syntactic representation comprises of syntactical features through transformation using Alchemy API.We also cater Word Order Problem in Syntactic analysis.The Semantic representation is defined by FRED API and RDF based annotated structures that are extracted from each document by using SPARQL queries and the similarity is calculated by using the algorithm defined.Clustering is performed by using a candidate based clustering approach.The proposed approach clearly surpasses purity and F-measure in comparison to recently proposed approaches like (FIHC, CFWS and TMHC), which is an indication of better clustering results.We like to extend this research in a number of ways.First, we would like to introduce document constraints in the clustering approach.Secondly, we would like to introduce some methods to increase the weight given to the semantic representation, while making decisions in the clustering algorithm.Moreover, we would like to further investigate the combined representation of document using the three representations because it seems a more challenging aspect for good clustering.

Fig. 1 .
Fig. 1.Three layer representation of a sample document Below is an example document from NEWS20 data set, the meta-descriptor clearly contains three types of features just after another.Here is an example from document 103122; D2 = "Most people who go fast wear goggles.So do most of helmetless motorcyclists"

Fig. 2 .
Fig. 2. Three layer representation of an example document 103122

TABLE I .
SAMPLE DATA SETS FROM NEWS20 AND REUTERS

TABLE V .
PURITY FROM DIFFERENT APPROACHES ON TEST