On Exhaustive Evaluation of Eager Machine Learning Algorithms for Classification of Hindi Verses

—Implementing supervised machine learning on the Hindi corpus for classification and prediction of verses is an untouched and useful area. Classifying and predictions benefits many applications like organizing a large corpus, information retrieval and so on. The metalinguistic facility provided by websites makes Hindi as a major language in the digital domain of information technology today. Text classification algorithms along with Natural Language Processing (NLP) facilitates fast, cost-effective, and scalable solution. Performance evaluation of these predictors is a challenging task. To reduce manual efforts and time spent for reading the document, classification of text data is important. In this paper, 697 Hindi poems are classified based on four topics using four eager machine-learning algorithms. In the absence of any other technique, which achieves prediction on Hindi corpus, misclassification error is used and compared to prove the betterment of the technique. Support vector machine performs best amongst all.


I. INTRODUCTION
Most of the past and contemporary research works have targeted English corpus document classification and prediction. In online and offline systems, documents are continuously generated, stored, and accessed every day in large volumes. Classifying text according to the contents present helps to produce groups based on tokens present in the text. The maximum work is done in text classifiers focuses on English corpus, but text in Hindi on the web has come of age since the advent of Unicode standards in Indic languages. The Hindi content has been growing by leaps and bounds and is now easily accessible on the web at large. Generally, researchers have focused on Hindi text but only for Natural Language Processing (NLP) activities like word identification, stemming and summarization [1].
Classification or supervised learning groups the labeled data based on the features of data. The data is partitioned as training and testing. Classifiers are broadly divided into eager and slow learners. Eager learners require a long period for training and less time for predicting. For slow learners data gets trained early but it takes more time for a prediction. Eager classifiers give better results than lazy classifiers for text data, so these classifiers are chosen. Naive bayes, Support Vector Machines, Neural Network and Decision tree are popularly used eager classifiers. A decision tree is a classifier, which generates several rules and tables. As a result, rules are placed in the form of decision trees.
Artificial neural network (ANN) has minimum three layers , input, hidden and output. Depending upon the input given and its respective output the network consisting of nodes gets trained. All nodes and layers are interconnected with each other and pass the values generated through the functions, it means that every node present in layer n is connected to various nodes present in tier n-1, inputs connected to respective nodes and nodes present in layer n+1. Output nodes show the classes to which a particular input object belongs.
Classification and regression is carried out through "Support Vector Machine" known as a supervised machine learning algorithm. Each data item is plotted as appoint in n dimensional space. It considers features of the object which are represented by coordinates of a point. SVM differentiates points in different hyperplanes.
Naive Bayes works with text classification. Every unique term is treated as a feature while processing text. Naive Bayes is an eager learner and simple algorithm and termed as strong performer to achieve the classification of text. Naïve byes works best when features are dependent on each other [2].
To apply any classification algorithm on text data first it needs to be converted into structured form. There are several techniques like bag of words, term frequency inverse document frequency and so on, which selects important terms from the corpus based on the frequency of the terms [3].

II. BAKGROUND
In spite of Hindi being used for communication by a large number of people in the world, lots of research work in the field of text classification [4][5][6] focuses on English. The reason may be processing Hindi corpus is a difficult task.
Topic models are built on Hindi corpus using algorithms, namely Latent Semantic Indexing (LSI), Non-negative Matrix Factorization (NMF), and Latent Dirichlet Allocation (LDA). Many visualizations in the form of trees were used to focus the analysis and results. The outcomes of Hindi text topic modelling gives best results as compared to some outcomes generated on English corpus [7]. To apply any classification techniques the data should be in the tabular form. Various techniques are available to store such types of data for example bag of words [8]. But it creates dimension curse, as all terms in the corpus are considered. High dimensions affect the performance of the algorithm. To reduce high dimensions, only significant words need to be considered. Classification will execute in less time if the top significant words are selected. *Corresponding Author. www.ijacsa.thesai.org To improve the classification process, the text is preprocessed by removing stop words, etc. [9][10]. Generally (TF-IDF) is a popularly used technique that transforms text data into matrix form. The measure represents the significance of the token with respect to text documents considering the entire corpus. In document processing, it acts as a weighting unit. In spite of increasing word count proportional to the number of documents in which it is present, The TF-IDF ignores the most commonly occurring words, by offsetting count of the words in the entire corpus. [11]. Accuracy, and misclassification errors are used to evaluate classifiers. Hindi is a morphologically rich language. Hindi words have many morphological variants that present the same concept but differ in tense, plurality, etc. A lightweight stemmer is proposed for Hindi, which conflates terms by providing suffix list. The stemmer has been evaluated by computing under stemming and over stemming figures for a corpus of documents [12][13][14].
Various methods like simulated annealing, genetic algorithms and differential evolution are used which finds out the required solution. Multi-parent mutation and crossover operations are used by the differential evolution algorithm. Results of the methods are input to Naïve Bayes classifiers and its different variations. [15][16][17][18]. The proposed algorithm works well in case of text classification as compared with other existing algorithms.
ANN is used to classify the text present in the Arabic language. ANN model is generated for an Arabic corpus. Document representation using different methods along with the feature weights [26] are used and results into identifying important terms. Each Arabic document is represented by the term weighting scheme. The term weighting scheme is used to represent the document. To choose the most significant terms, SVD is used to avoid dimension curse.
Back-propagation neural network (BPNN) and modified back-propagation neural network (MBPNN) are proposed to categorize the text. To avoid dimension curse and to improve the efficiency of algorithm an efficient feature selection method is used.
Training time required for BPNN is slow thus it is modified to enhance the speed required to train. Instead of using a vector space model which is based on term frequency, latent semantic analysis is used. LSA uses only important terms and considered a semantic relationship between the terms and builds concept space. The news dataset is used to prove the efficacy of prosed technique [27].
Different Machine learning algorithms are used to classify the text present in different questions. Two approaches namely Bag-of-words and bag-of-grams are used to construct vector space. Syntactic terms present in the question are identified using a kernel function. Comparative analysis of algorithm performance is being carried out [28] Classification of Hindi text documents includes dividing the documents as training and testing corpus and applying classifiers on the labeled text. Handwritten and printed text documents are partitioned into specific classes. The algorithm is implemented on Hindi text which has Hindi printed and handwritten. The system will be useful for discrimination between handwritten and printed text [19][20][21].
The text is classified based on emotional features present into it. There are nine categories of emotional features. One category represents one class. Term frequency is used to handle overlapping features. Naïve byes and support vector machines are executed on a set of 55 poems having 10531 words [22][23][24][25].
This research is unique because 1) Prediction of Hindi poem using four eager classifiers is achieved.
2) Performance evaluation of the classifier is carried out.
3) Scalability is achieved by processing 697 poems.

III. RESEARCH METHODOLOGY
The proposed approach initiates with corpus removal of stop words and finds out top N frequent terms using TF-IDF weights on the corpus of poems having three groups. The N value is called a threshold, which is 50 % of maximum TF-IDF weight. Stemming and lemmatization are not used. It effectively removes all unuseful words. Different classifiers are available in the literature, the proposed approach applies all eager classifiers on the term document matrix and the model is built using each classifier. Naïve byes and random forest algorithms are applied. Their performance is evaluated using accuracy. Support vector machine performs best in comparison with remaining algorithms. Fig. 1 depicts the research methodology.
In the paper terms, dimensions, words and tokens are used as synonyms, interchangeably. The paper is organized as follows. The work done by other researchers on the topic is presented as a background in the next section. The third section presents the methodology; the fourth section depicts Results and discussions. The paper ends with a conclusion and future directions. Table I shows steps in the proposed approach.

1) Corpus collection and preparation:
The proposed approach initiates with data collection and preparation. It includes the process of generating, loading and preprocessing of the corpus. Corpus containing Hindi Text is preprocessed to remove the stop word. It is then partitioned into training and validation sets. The corpus comprises of poems belonging to three categories. The classes or categories are " " ("Bal geet") means children's" poems, " " ("Updesh geet") means life lesson teaching poem and "भजन" ("bhajans") means devotional songs. " भ " (Desh Bhakti) means patriotic songs. The size of the corpus is 697 and it is downloaded from different websites [29].
2) Converting unstructured data into structured data: Converting unstructured data into the structured one is the next corpus of poems is converted into a vector space model. TF-IDF is used on a set of documents, and token weight is calculated. Terms or tokens having a weight greater than or equal to the threshold are considered. The Document term matrix (DTM) is input to the classifier algorithm. This step selects important tokens present in the corpus and selected significant tokens are further used to form a vector space model.

3) Model training using different classifiers and evaluation:
The labeled dataset or corpus is trained based on different values of input and its corresponding output. Eager Classifiers are applied on the DTM. Models are generated and trained using the training corpus. A confusion matrix is found out for all four algorithms and misclassification error was used to evaluate the performance of the algorithm. The best classifier is selected to predict the category of the new poem. The best performing classifier is used to predict the category of a poem. It was observed that the support vector machine predicts the class of a poem in a more accurate way. Fig. 2 shows a decision tree for Hindi poems" corpus along with token weights. The corpus of 697 poems is used to build the model. Each token"s significance with respect to each category is generated by a decision tree. The figure depicts a particular node represented as "Bal geet" category. The rules based on the weighted tokens for each category are generated. Fig. 3 shows Naïve bayes classification. The model is a plot for weighted token 4 on the Y axis, it represents a density of Weighted token4 for different categories of poems. The graph clearly represents four different categories of poems namely Bal geet, Bhajan Updesh geet and DeshBhakti geet. and Updesh geet are classified as Bhajans.     5 shows the confusion matrix along with the Prediction of type of poem carried out using SVM. It is clear that the class accuracy is 0.96, also actual and predicted results are shown that is 11 poems actually belonging to Bhajan class are classified as Updesh geet. All categories of poem can be seen in plot represented by different colours. Fig. 6 represents the neural network generated for all categories of the poems. Four significant tokens are acting as input to a network. Weights applied by two hidden layers are shown in the figure. The network is trained to identify the tokens most helpful in an accurate classification. These inputweight products are summed and then the sum is passed through a node"s activation function. Accuracy of the prediction is calculated comes out to be 0.88 for 500 poems id depicted. Blue coloured lines represent hidden layers. Table II

V. CONCLUSIONS
The current study achieves the prediction of a class of Hindi poem, unlike the other published research works, which have focused on classification of English text. Additionally, the contribution of this study is the exhaustive evaluation of the eager classifiers. The formation of the classes was achieved through the TF-IDF. Government and non-government agencies can use the approach to classify reports, initiatives, different schemes, etc. Experiments are conducted on a corpus of 697 poems. The current work is the first of its kind in the world, which employs prediction and performance evaluation for Hindi corpus comprising of verses.