Feature based Entailment Recognition for Malayalam Language Texts

Textual entailment is a relationship between two text fragments, namely, text/premise and hypothesis. It has applications in question answering systems, multi-document summarization, information retrieval systems, and social network analysis. In the era of the digital world, recognizing semantic variability is important in understanding inferences in texts. The texts are either in the form of sentences, posts, tweets, or user experiences. Hence understanding inferences from customer experiences helps companies in customer segmentation. The availability of digital information is ever-growing with textual data in almost all languages, including low resource languages. This work deals with various machine learning approaches applied to textual entailment recognition or natural language inference for Malayalam, a South Indian low resource language. A performance-based analysis using machine learning classification techniques such as Logistic Regression, Decision Tree, Support Vector Machine, Random Forest, AdaBoost, and Naive Bayes is done for the MaNLI (Malayalam Natural Language Inference) dataset. Different lexical and surface-level features are used for this binary and multiclass classification. With the increasing size of the dataset, there is a drop in the performance of feature-based classification. A comparison of feature-based models with deep learning approaches highlights this inference. The main focus here is the feature-based analysis with 14 different features and its comparison, essential to any NLP classification problem. Keywords—Textual entailment; natural language inference; Malayalam language; machine learning; deep learning


I. INTRODUCTION
Textual entailment (TE), also called natural language inference (NLI) is a relationship between a pair of sentences. It identifies the similarity between the sentences based on their inferential semantic content. A text is said to entail another sentence, called a hypothesis, if the hypothesis has its semantic content derived from the text. In the same way, the text contradicts the hypothesis if the semantic content of the hypothetical sentence is just opposite to the text. Both sentences remain neutral to each other if the hypothesis derives zero information from the text.
A classical definition for entailment is that a text t entails hypothesis h if h is true in every circumstance of a possible world in which t is true. This definition is too strict in applying to real-world applications. An applied definition says that text t entails hypothesis h if human reading t infers that h is most likely true. Mathematically computable definition for text entailment is provided as hypothesis h is entailed by text t if P(h is true |t) > P(h is true), where P(h is true |t) is the Entailment Confidence [1].
Semantic variability in expressions is an essential factor in any natural language processing application. NLI is also a necessary sub-task for almost all NLP applications such as multidocument summarization, question answering systems, information extraction, information retrieval. In multi-document summarization, the redundant sentences are identified using entailments, and those sentences can be removed. The answer to a question can be evaluated based on its entailment to the reference answer in the question answering system. In information extraction and retrieval systems, the text should entail the extracted information.
Natural language inference also finds application in analysis of user tweets, posts and experiences in social networks, where people share their thoughts, experiences in the form of texts in various languages. These texts are useful to relate between users by analysing inferences (entailment, contradiction and neutral) between the texts. This helps in customer segmentation, product analysis from the customer viewpoint as well as in recommender systems.
As information is available in digital text form in almost all languages, recognizing entailment is important for almost all languages. Text entailment is recognized in various languages, namely, English, French, Spanish, Italian, Japanese, Hindi, Swahili, Urdu. Very few works are reported for the Malayalam language.
In this work, we classify entailments for Malayalam, a South-Asian language from the Dravidian family. Malayalam is the language officially used and spoken in the state of Kerala. This language has its origin from the Dravidian scripts of Tamil. The language has various dialects, agglutinations, and inflectional word forms used in different parts of the state. This language also has very few resources in terms of datasets and other language processing applications and falls in the class of low resource languages.
The main contributions in this work includes: 1) The application of machine learning methods for Malayalam language textual entailment recognition, which is not attempted so far and also required for current literature and future research in this area.
2) A comparison between machine learning and deep learning approaches for Malayalam language entailment recognition.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 2, 2022 3) The limitations of feature based methods with increasing dataset size.
4) An inference that deep learning without explicit feature-based engineering helped in more accurate classification for datasets of larger size.
The rest of the article is organized as follows: Section II describes the related literature in English and other languages. Challenges and contributions in Malayalam language for entailment is provided in Section III. Textual entailment for the Malayalam language using feature set is detailed in Section IV. The experimental evaluations are in Section V. Section VI discuss the results and Section VII concludes the work.

II. RELATED WORK
Textual entailment has its inception in 2005 as PASCAL (Pattern Analysis, Statistical Modelling, and Computational Learning) challenge programme 'Recognizing Textual Entailment (RTE)' to develop systems that can recognize inferences from text fragments across various applications like multidocument summarization, information retrieval, information extraction and question answering systems.
In 2008, PASCAL RTE became a track at the Text Analysis Conference organized by NIST (National Institute of Standards and Technology), which brought different NLP communities to work on the textual entailment application scenarios. The earliest approaches for determining textual entailment include bag of words, logic-based reasoning, lexical entailment, machine learning methods, and graph matching [2].
The English language: The challenge started for the English language, and all major works are implemented in English language using RTE(Recognizing Textual Entailment), SNLI (Stanford Natural Language Inference) [3], MNLI (Multi-genre Natural Language Inference) [4] and XNLI (Cross-lingual Natural Language Inference) datasets. Lexical and syntactic similarity based entailment classification is done using rule-based similarity features such as unigram, skip-gram, longest common subsequence, stemming, subjectsubject comparison, subject-verb, object-verb comparison [5].
RTE datasets were used to train and test these systems. Entailment recognition is also attempted by resolving anaphoras in sentence pairs [6]. [7] does similarity metrics-based recognition of entailments in the text, where features like cosine similarity, unigram match, Jaccard similarity, dice similarity, overlap, harmonic mean, and machine translation evaluation metrics, namely BLEU and METEOR, are used for machine learning. Following are the other approaches: Bag of Words: In this approach, both text and hypothesis are represented as a collection of words. Every word from the hypothesis collection is compared with every word from the text collection. If the match between T and H is more than a preset threshold, then the sentence pair is classified as entailment, else, not entailment. It ignores the word order, syntax, and semantics of the sentences.
Lexical Entailment: Entailment is determined based on lexical concepts. A hypothesis is valid if its lexical components are true [1]. It is based on a probabilistic model and does not consider syntax and semantics.
Machine Learning approaches: Linear classifiers, logistic regression, support vector machines are classifiers used to train and learn from a dataset of text hypothesis pairs. It is a featurebased approach using similarity measures on words, stems, POS tags, chunk tags, negation, length ratio, of best partial match [8].
Graph based approaches: Text and hypothesis can be represented as directed graphs (dependency graphs), nodes representing words or phrases, and edges representing the relation between nodes [9]. Entailment is determined in these graphs using a matching cost based on vertex substitution and path substitution.
Deep learning approaches: The entailment recognition attempts in English from 2005 to 2015 are either rule-based or feature-based machine learning approaches. With the introduction of the SNLI dataset in 2015, a large dataset has enabled deep techniques for sentence representation using LSTM (Long Short Term Memory), CNN (Convolutional Neural Network) [10], BERT [11], and other transformer models and classification through deep neural networks [12]. Textual entailment is also used for fake news detection [13].

a) Datasets for Textual Entailment in English:
The current works are mainly carried out in datasets, namely, RTE, SNLI, MNLI, and XNLI and in legal texts [14]. The collection of RTE datasets with their specifications are mentioned in the Table I. Other NLI datasets are SNLI (Stanford Natural Language Inference) dataset which is a collection of 570k English sentence pairs collected using Amazon mechanical trunk [3], and MNLI, Multi-Genre Natural Language Inference dataset is a collection of 433k sentence pairs from multiple genres [4]. b) Other languages: Entailment recognition in Japanese, Simplified Chinese, and Traditional Chinese language is attempted with RITE (Recognizing Inference in Texts) dataset [16], which has forward entailment, reverse entailment, bidirectional entailment, contradiction, independence as different classes for the Chinese sentence pairs. Surface textual features, lexical-semantic feature, syntactic feature, linguistic feature are used for classification using an SVM model [17].
Italian dataset is used in EVALITA campaign 2009 to recognize entailments in Italian text pairs [18]. Arabic dataset for textual entailment is detailed in [19]. Traditional features and distributed representations are used for recognizing textual entailment in Arabic [20]. Cross-lingual natural language inference dataset (XNLI) derives its collection from MNLI dataset and contains translated pairs in 15 languages, namely French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu, out of which Hindi, Swahili, and Urdu are Indian languages [21]. Textual entailment for Indo-Aryan languages like Hindi is important to the language community of Northern parts of India.
In this attempt we focus on Malayalam language from the Dravidian family. The Dravidian languages are mostly spoken in southern parts of India and has very minimal contributions when considering inferences. Attempts to different families of languages helps to gather significant contributions which are specific to those languages or language family and generic to all languages.

III. CHALLENGES AND CONTRIBUTIONS
The Malayalam language is a South Indian Dravidian language, which has minimal works for textual entailment. The automatic and manual translation of SNLI pairs with linguistic corrections by experts forms the basis for the MaNLI (Malayalam Natural Language Inference) dataset. Prior work in Malayalam textual entailment reports the use of different embedding techniques, namely, Doc2Vec (paragraph vector), fastText, BERT(Bidirectional Encoder Representations from Transformers) and LASER(Language Agnostic SEntence Representations) for embedding sentence pairs for classification through Densenet [22]. Another attempt use siamese networks for binary classification of inference in texts [23]. The accuracy measure in Table II shows that LASER embedding based classification achieves the best results.

A. MaNLI Dataset
The development of language resources for Malayalam is in a progressing stage by different organizations and individual contributors. The Malayalam Natural Language Inference dataset is a dataset developed for natural language inference in the Malayalam language. It is created by manual and machine translation of text hypothesis pairs from the SNLI (Standford Natural Language Inference) dataset. Certain incorrect translations were corrected through manual efforts. Olam dictionary [24] is also used to get common substitutes for the English words. The details of the dataset are in Table III. The dataset is created because an adequately annotated and linguistically correct entailment dataset is unavailable in this language. Hence, the translation method with linguistic corrections from language experts is adopted as one method to produce this dataset. This method involves less time and cost than creating an entirely new dataset that requires more time and human involvement to create sentence pairs and annotations.
The MaNLI dataset [22] [25] is a collection of 12K texthypothesis pairs classified into entailment, contradiction, and neutral. Translations are done in such a way that the semantic content is maintained the same. Hence the annotated class labels are reused. It has been manually verified by linguists from the Thunchath Ezhuthachan Malayalam University, Kerala. The sentence length distribution for text and hypothesis sentences from the corpus is shown in Fig. 1. The premise sentences have word length between 5 and 15 whereas the hypothesis word length varies between 5 and 10 for most cases. Textual entailment or natural language inference in English is attempted using machine learning and also deep learning approaches. But feature based machine learning approaches are not reported for the Malayalam language. In this work, we aim to develop systems for the Malayalam language using feature-based machine learning methods, which is essential to understand any classification problem. Also, comparison of feature-based models with deep learning methods became more feasible and realistic.
The design of the proposed work is shown in Fig. 2. Input pairs of text and hypothesis are preprocessed, and various lexical, semantic, and set-based features are extracted. The machine learning module classifies the text hypothesis pairs based on the extracted features using ML algorithms, such as Logistic regression, Support Vector Machine, Decision tree, Random Forest, Multinomial Naive Bayes and Adaboost.

A. Preprocessing
The sentence pairs are split into tokens, and prefixes and suffixes are removed in the preprocessing stage through tokenization and stemming. Tokenization is the process by which the words in the sentences are split into individual units called tokens for processing. The splitting is done using space as a separator. Stemming removes affixes from words. For example, the word 'flowers' can have its stem word as 'flower', removing the suffix 's'. For the Malayalam language, libindic stemmer [26] available online is used. It is a rule-based stemmer using iterative suffix stripping to handle inflectional words.

B. Feature based Classification
This section details the different features used for the entailment classification. The features fall into different categories, namely lexical features, semantic features, and setbased features. Semantic features: Semantic features deal with the semantics of the sentences. For this, we have used word vector representation and term frequency-inverse document frequency (TF-IDF) of sentences.

1) Word embedding similarity: Word vectors from
Word2Vec [27] are used to represent the words. Summation of word vectors of a sentence (text/hypothesis) is computed, and cosine between the two gives a similarity feature value. 2) TF-IDF similarity: Term Frequency -Inverse Document Frequency (TF-IDF) is a numerical statistic that evaluates the importance of a word in a collection. It is the product of term frequency and inverse document frequency. Text and hypothesis are represented using TF-IDF representation and cosine similarity is computed between the two.
Set/Distance based measures: Set/Distance based measures are the different types of similarities using counts for set-based unions and intersections. The various set-based similarities are: 1) Dice similarity: It measures the spatial overlap between two sentence pairs.
If X and Y are similar, Dice coefficient will be 1 and otherwise 0.
2) Cosine similarity: It measures the cosine of the angle between the two sentences.
3) Levenstein similarity: It measures the minimum number of insertions, deletions and substitutions required to transform one word to another. 4) NeedleWunsch similarity: It is a sequence alignment based similarity measure. It measure global alignment score by finding the no of edits required which is calculated from the alignment matrix.

5) Smith
Watermann similarity: It is a dynamic programming method that uses local alignment as a metric to measure similarity. The alignment matrix is created with no negatives and the scores are calculated.
6) Jaro Winkler similarity: It is also a string metric that measures the edit distance between two sequences from beginning to a set of prefix length.
where sim j is the Jaro similarity between strings s 1 and s 2 , l is the prefix length, p =0.1 (constant scaling factor).
where |s| is the string length, m = no of matching characters, t = no of transpositions. 7) Jaccard similarity: This metric has the ratio of similarity and dissimilarity of sample sequences.

C. Machine Learning Approaches
Inference in the Malayalam language is considered as binary and multiclass classification. Binary classes are entailment and contradiction. Multiclass includes entailment, contradiction, and neutral. The following machine learning algorithms are used to evaluate the performance. 1) Logistic Regression: Logistic regression has dependent variable in two classes. With two classes x1 and x2 and the binary response variable Y (p= P(Y=1)), Binary classification is done with liblinear solver and class weight is balanced. Multinomial logistic regression is used to predict the different possible outcomes of a categorically distributed dependent variable. The classifier with multinomial class weights and lbfgs solver is used for multiclass classification. 3) Random Forest: It is an ensemble learning method which constructs many decision trees at training. For classification task, output class is the class selected by majority of the trees.

4) Decision
Tree: It has a predictive modeling approach, start of the tree has different observations, that it traverse through the branches and ends in leaf nodes belonging to the target category for the sentence pair.

5) MultinomialNB
: It is a Naive Bayes classifier for multi class classification. The feature vector consists of frequencies or integer counts. It is based on the Bayes' theorem stated below: P(c |x) = P(x |c) * P(c) / P(x) where c is a class and x is the sample instance that is to be classified.
6) AdaBoostClassifier: Also called adaptive boosting, it consists of weak classifiers in which one of the classifier is used to train on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

V. EXPERIMENTAL SETTINGS AND EVALUATION
Implementations are done in Spyder integrated environment. The libraries used are Libindic stemmer for stemming, NLTK toolkit for extracting bigrams, text distance library for evaluating the distance between two or more sequences, and Scikit Learn for machine learning algorithms and classification reports. Grid searchCV is used for SVM classification. Table  IV shows the specific settings applied in Scikit Learn based classifiers. We have used different combinations of the feature set to arrive at the results. The different feature set configurations are in Table V.  Evaluation Metrics The classification performance is evaluated using the Scikit-Learn classification metrics namely accuracy, precision, recall and F1-score.
• Accuracy: Accuracy is defined as the ratio of number of correct predictions to total predictions. Accuracy =(tp + tn)/(tp + f p + f n + tn) • Precision: Precision is defined as the ability of the classifier not to misclassify samples (label negative sample as positive). Precision = tp/(tp + f p) • Recall: Recall is defined as the ability of the classifier to find all positive samples. Recall = tp/(tp + f n) • F1-score: F1-score is the harmonic mean of precision and recall. F1-score = 2.precision.recall/(precision + recall) where tp is true positive, fp is false positive, tn is true negative and fn is false negative. The results of the classification evaluated in terms of accuracy, precision, recall, and F1-score is shown in Table VI with the whole 7989 pairs for binary classification and Table  VII with 12k pairs for multiclass classification with the feature set configuration F7 having all the features. The train test split is 80:20. The performance of the rest of the feature sets (F1 to F6) is low compared to F7, hence we selected the feature set F7 for our study and comparisons. The performance of other feature sets is detailed in Section VI-B. From Tables VI and VII, it can be inferred that SVM, random forest and AdaBoost better classifies the Malayalam texts into entailment, contradiction and neutral classes.
We have evaluated our system with an increasing size of the data ranging from 2000 to 12000. The variation in the performance is shown in Fig. 3 for binary classification. The plot for mutliclass classification is shown in Fig. 4.

A. Effect of Increasing Size of Dataset
This section discusses the difference in the performance of deep learning and feature-based machine learning classification for binary and 3-way classification. As the size of the dataset increases from 2000 to 12k, there is a reduction in performance of feature-based classification. The features selected may be suitable for a few samples, but they can be misleading for other samples. Hence the model is not able to generalize with the samples.
LASER-based approach [22]: In the case of deep learning approaches with embedding that captures the context and places the sentences in semantic space, the model can generalize in a much more efficient manner. Prior work on entailment classification using LASER based sentence embedding has a BiLSTM encoder trained for 93 languages and includes Malayalam also. With character and word level representations, it produces sentence embeddings which are mapped in a semantic space. A feed forward neural network having sigmoid/softmax activations classifies the dataset into binary/ 3-class. It is more generic approach and the size of the dataset is immaterial when using a pretrained model. In Fig. 3, and Fig. 4, the notation 'LS' denotes the LASERbased approach using deep learning approach, and the rest are the machine learning feature-based methods. From the figure, it can be inferred that when the dataset size is around 2000, both machine learning and deep learning approaches perform similar classifications. As and when the data is increased, deep learning based methods become more suitable, and it is observed through the comparison with this feature-based machine learning implementation. It also supports the fact that earlier works in English with RTE datasets used feature based approaches.
With 2000 samples of data, we have obtained good results with feature based classification. As the sample size increases, deep learning methods became more efficient in classification supporting the related works with SNLI dataset. This work adds to the literature for Malayalam entailment or inference tasks as a baseline for machine learning based on the feature set approach, which is novel with respect to this language. As the dataset is generic in nature, the distinguishing characteristic of features becomes low, and this can lead to poor classification on large datasets. Thus the performance of feature-based classification is limited in terms of features that generalize well with datasets of high semantic variability. Hence, the rise in performance of deep learning approaches hints that these are methods that can be adopted from small to large datasets.

B. Ablation Study
The ablation study for this work includes the performance of different features contributing to the classification of inferences in text in the Malayalam language. With the set of features, namely, lexical, semantic, and distance measures, we have studied the performance of different feature set combinations, and the results are discussed here.  The chosen setting for the experimental results combines lexical, semantic, and distance measures (F7). Also, we have studied the model performance with only lexical (F1), semantic (F2), distance-based (F3), lexical and semantic (F4), lexical and distance (F5), and semantic and distance-based (F6). Based on Table VIII, the feature set that performs good on a  majority of classifiers is chosen for analysis and comparison. The feature set performance is evaluated as in Table IX. The feature set performance is evaluated as the ratio of the number of classifiers with maximum F1-score to the total number of classifiers. This justifies the selection of feature set F7, having maximum performance for experimental evaluations.

VII. CONCLUSION AND FUTURE WORK
In this work, textual entailment is recognized for the Malayalam language with a feature-based approach. A set of classifiers are used to evaluate the performance accuracy. The best feature set model is chosen based on the F1-score measures. It is the first feature based attempt in this language for textual entailment recognition. This method also helped us understand the significant performance of deep learning methods, which is evident in the comparison. Thus this work on feature-based textual entailment recognition for the Malayalam language is substantial to the language resources community. The work is also essential and useful in identifying inferences in Malayalam texts for various language processing and social networking applications. Future work can include deep learning models to recognize entailment and these systems can be used in language processing applications.