GPLDA: A Generalized Poisson Latent Dirichlet Topic Model

The earliest modification of Latent Dirichlet Allocation (LDA) in terms of words or document attributes is by relaxing its exchangeability assumption via the Bag-of-word (BoW) matrix. Several authors have proposed many modifications of the original LDA by focusing on model that assumes the current topic depends on the words from previous topic. Most of the earlier work ignored the document length distribution since it is assumed that it will fizzle out at the modelling stage. Thus, in this paper, the Poisson document length distribution of LDA model is replaced with Generalized Poisson (GP) distribution which has the strength of capturing complex structures. The main strengths of GP are in capturing overdispersed (variance larger than mean) and under dispersed (variance smaller than mean) count data. The Poisson distribution used by LDA strongly relies on the assumption that the mean and variance of document lengths are equal. This assumption is often unrealistic with most real-life text data where the variance of document length may be greater than or less than their mean. Approximate estimate of the GPLDA model parameters was achieved using Newton-Raphson approximation technique of log-likelihood. Performance and comparative analysis of GPLDA with LDA using accuracy and F1 showed improved results. Keywords—Bag-of-word; generalized Poisson distribution; topic model; latent Dirichlet allocation


I. INTRODUCTION
In recent years, a stochastic generative model that has been used widely in the field of computer science with the focus on text mining and information retrieval is referred to as a topic model. Since the early proposition of the model, it has been used by many researchers in several fields such as text mining [2], computer vision [1], population genetics, and social networks [3].
Topic modelling can be traced to latent semantic indexing (LSI) by [4]. It is the basis of the developing topic models. However, LSI is not a probabilistic model. Hence uncertainty is not quantifiable. After the era of LSI, towards the search for a realistic probabilistic model, probabilistic latent semantic analysis (PLSA) by [5] was developed and served as the basis of modern topic models. As a further earlier extension of PLSA, [6] proposed latent Dirichlet allocation (LDA). The model was referred to as a complete generative stochastic model. Nowadays, there is a growing number of probabilistic models that are based on LDA via combination with particular tasks.
Since the introduction of topic models, researchers have introduced this approach into the fields of text mining. Because of its superiority in the analysis of large-scale document collections, better results have been obtained in such fields as text mining [7] and clinical informatics [8][9]. On the other hand, most of these studies follow the classic text-mining method of a topic model.
In LDA, we let denotes the document indicator, for the topic, for word and consequently is the number of words in a specific document . Also, we define as the conditional distribution of topic in the document and as the conditional distribution of words in topic . The two conditional probability distributions, and , are presumed to follow multinomial distributions such that the topics in the entire documents have common Dirichlet prior distribution and the word conditional distributions on topics have common Dirichlet prior [10]. After the selection of appropriate prior hyperparameters and for a document , a conditional distribution of topics with parameter is formed and it is assumed to be multinomially distributed from the Dirichlet distribution . Also, for a specific topic , a conditional distribution of words are formed, and it is assumed to be multinomially distributed from the Dirichlet distribution . The Dirichlet prior distribution is choosing because of the conjugacy property between the multinomial and Dirichlet distribution which thus makes the statistical inference of LDA easy. www.ijacsa.thesai.org

II. RELATED WORK
The earliest modification of LDA in terms of words attributes is relaxing the exchangeability assumption of LDA via the BoW matrix by [11]. Wallach proposed a model that assumes that the current topic depends on the words from the previous topic. The method involves using a hierarchical procedure by combining the n-grams statistics procedure and latent topic models. Specifically, Wallach [11] extended the unigram topic model to include the properties of a hierarchical Dirichlet bigrams model. The author reported that the hybrid model is better than either of the unigram topic model or the Dirichlet bigram model. The results were inferred from two datasets consisting of 150 documents each. The model was supported by [12] with the claim that it is unrealistic to impose the exchangeability of words as orders of words matters when dealing with words contexts. Hu et al. [8] countered the class of models that either supported the exchangeability assumptions or relaxes it. The authors claimed the models are not interactive but rather employ several apriori fixes that are unrealistic. In addition, Inouye et al. [13] also exemplify that these class of models do not incorporate word dependencies within a topic but rather incorporates inter-topic word correlation which is the major strength of models such as Bigrams language model by [11].
Reisinger et al. [14] modified the word absences drawback features of LDA. The algorithm specifically improved the accuracy of LDA in terms of increasing the possibility of modelling rare words. The procedure addresses the use of multinomial draws by proposing the Von-Mises Fishers distribution for topics.
Most of the existing modifications targeted one or the other loopholes in LDA, but none has considered the overdispersed or under dispersed drawback that is inherent in text data. Thus, in this paper, the Poisson document length distribution of the Latent Dirichlet Allocation (LDA) model is replaced with Generalized Poisson (GP) distribution which has the strength of capturing complex structures. The new model referred to as GPLDA was tested on the 20-newsgroup dataset to facilitate comparison with the LDA.

III. GENERALIZED POISSON DISTRIBUTION
Suppose we have N documents that assumed Poisson distribution with rate , the probability mass function of having n realizations of N is given by [15]: The Generalized Poisson (GP) [15][16][17][18] which is the extension of (1) can be defined in terms of additional dispersion parameter as: (2) It is obvious from (2) that GP can be reduced to Poisson when . The behaviour of the dispersion parameter tell about the direction of disparity. If , underdispersion is suspected and overdispersion is suspected.

A. Generalized Poisson Latent Dirichlet Allocation Model (GPLDA)
The GPLDA assumes the same structure as LDA except for the change in document length distribution. Mathematically, the joint distribution of document N, topics , word and topic mixture is defined as: The marginal distribution of document D can be obtained by marginalizing the joint distribution as follows:

B. Parameter Estimation of GPLDA
In this section, we present an approximate procedure for estimating the parameters of the GPLDA model. The Newton-Raphson approximation technique is employed by obtaining the log-likelihood of the distribution. The log-likelihood of the distribution of corpus of words in document D is: The procedure involves obtaining the first and second partial derivatives which are intractable from the  Fig. 1 shows the behavioural patterns for both under dispersed and overdispersed scenarios. All analyses were carried using the R package.

V. PERFORMANCE EVALUATION USING 20-NEWGROUP DATASET
Performance evaluation of the GPLDA algorithm was achieved using the 20-Newsgroup dataset [19][20][21][22]. There are 18846 documents in the dataset, and it cut across 20 different topics categories. The topics in the classes include sports, politics, religion etc., which is diverse enough. The Precision (P) was used as class-specific index while Recall (R) (also known as sensitivity) is the proportion of the total amount of relevant cases that were actually retrieved [23][24][25][26][27][28][29]. The F 1 is a measure of the accuracy of the test dataset and it is defined as:

VI. RESULTS AND DISCUSSION
The plot in Fig. 1 confirms that when the Generalized Poisson distribution reduces to Poisson distribution and consequently the GPLDA will reduce to LDA. The underdispersed situation yields observations with high probability of having values close to zero than Poisson while the overdispersed situation yields observation with low probability of having values close to zero than Poisson. The graph also confirms that the Poisson distribution only assumes the midpoint position by averaging the scenarios, this may be true but not in all cases. The predictive classification results performance analysis in Fig. 2 showed that the GPLDA algorithm results are high in terms of precision, recall and F 1 scores in 14 of the 20 classes but average on the other 6 topics/class. Performance comparison with LDA in Table I shows that the algorithm showed significant improvement over LDA. For Accuracy, GPLDA has about 83.3% percentage increased from the LDA result likewise for Micro F 1 45% increase and 75% increase for Macro F 1 .

VII. CONCLUSION
This paper considered a new class of LDA using the Generalized Poisson distribution to model the length of a document. The Poisson distribution assumed by LDA has many stringent assumptions which are often violated in most real-life data. Thus, we propose the Generalized Poisson LDA (GPLDA) in order to provide a better fit. Estimation procedure was achieved using Newton-Raphson procedure and data calibration was done with the 20-Newsgroup dataset. The results from the simulation show that the Poisson distribution only assumes the midpoint position by averaging the scenarios which are not always correct. The results from the classification of 20-Newsgroup dataset showed that the GPLDA has an improved prediction over LDA. The results also established that the diversity in the Generalized Poisson over Poisson resulted in significant improvement. The GPLDA can be combined with the distributed learning system such as word2vec [10] to form a hybrid system like lda2vec by [30].