Reinforcement Learning-based Answer Selection with Class Imbalance Handling and Efficient Differential Evolution Initialization

—Answer selection (AS) involves the task of selecting the best answer from a given list of potential options. Current methods commonly approach the AS problem as a binary classification task, using pairs of positive and negative samples. However, the number of negative samples is usually much larger than the positive ones, resulting in a class imbalance. Training on imbalanced data can negatively impact classifier performance. To address this issue, a novel reinforcement learning-based technique is proposed in this study. In this approach, the AS problem is formulated as a sequence of sequential decisions, where an agent classifies each received instance and receives a reward at each step. To handle the class imbalance, the reward assigned to the majority class is lower than that for the minority class. The parameters of the policy are initialized using an improved Differential Evolution (DE) technique. To enhance the efficiency of the DE algorithm, a novel cluster-based mutation operator is introduced. This operator utilizes the K-means clustering approach to identify the winning cluster and employs an upgrade strategy to incorporate potentially viable solutions into the existing population. For word embedding, the DistilBERT model is utilized, which reduces the size of the BERT (Bidirectional encoder representations from transformers) model by 40% and improves computational efficiency by running 60% faster. Despite the decrease, the DistilBERT model maintains 97% of its language comprehension abilities by utilizing knowledge distillation in the pretraining phase. Extensive experiments are carried out on LegalQA, TrecQA, and WikiQA datasets to assess the suggested model. The outcomes showcase the superiority of the proposed model over existing techniques in the domain of AS.


I. INTRODUCTION
Question Answering (QA) systems, a notable application within natural language processing (NLP) and artificial intelligence (AI), facilitate enhanced human-computer interaction by efficiently processing expansive data and information.Two dominant strategies for developing QA systems include the deployment of Generative Adversarial Networks (GANs) [1] and the utilization of AS techniques.While GANs can generate rich and varied responses, their application comes with challenges related to ensuring grammatical and semantic accuracy in answers.In contrast, AS focuses on meticulously selecting the most apt response from a set of potential answers to a given query, taking into account the inherent variability and complexity of language and potential multiple suitable responses, thereby finding extensive application in various domains including machine comprehension [2].Both methodologies bring their respective benefits and limitations, influencing their applicability in different use-cases within the broader QA landscape.
Conventional and deep learning techniques offer various methods for AS according to existing literature [3].While traditional models, like those based on information retrieval, handcrafted rules [4], and machine learning (ML) methods [5] provide certain utilities, they also exhibit limitations in semantic understanding and generalization due to reliance on keywords, manual features, or rigid rules.SVM classifiers within ML approaches have been utilized to connect AS pairs through editing distance and implied matches, yet traditional ML methods often neglect semantic data and demonstrate confined generalizing capacity.Deep learning methods leverage LSTM or CNN architectures to extract semantic features, utilizing their ability to gauge semantic similarity between questions and answers [6].CNNs model hierarchical sentence structures, while LSTMs ensure representations contain coherent and pertinent information [7].Notwithstanding their advancements, deep learning models still face challenges in encapsulating comprehensive semantic relationships between questions and answers.To address this, new models, like BERT, harness next-word/phrase prediction and masked word prediction to assimilate complex linguistic relations, outperforming previous models and widely impacting the NLP field [8].
The success of deep algorithms relies heavily on factors like architecture, learning method, and training features, making network design a sophisticated optimization task.Various researchers [9] have addressed this by training neural networks with fixed topologies using several optimization approaches, such as tabu search, ant colony optimization, genetic algorithm, and simulated annealing [10].Critical to deep models' performance is the optimization of parameter sets, heavily influenced by their initialization [11].While gradient descent algorithms like Backpropagation (BP) and Levenberg-Marquardt (LM) [12] have been utilized for weight optimization in deep learning methods for AS, their sensitivity to initial weights may lead to local optimum issues.Addressing this, Pretraining weights using Population-based Meta-Heuristic algorithms (PBMH) [13], [14] like DE [15], which incorporates mutation, crossover, and selection steps, has www.ijacsa.thesai.orgproven effective for optimizing learning processes by avoiding local minima and ensuring generation of potentially promising solutions [16], [17].Furthermore, while BERT has established its dominance in NLP tasks due to its deep architecture and ability to capture bidirectional contexts in textual data, its complexity often renders it computationally expensive, especially for real-time applications.Recognizing this challenge, researchers introduced DistilBERT, a distilled version of BERT.The principle behind DistilBERT lies in the concept of knowledge distillation.This technique involves training a smaller model, in this case, DistilBERT, to mimic the behavior and performance of its larger counterpart, BERT.By transferring the knowledge from the cumbersome BERT model to the more lightweight DistilBERT, there is a significant reduction in model size-about 40% smaller than BERT [18].Notably, despite this reduction, DistilBERT retains a substantial portion of BERT's language comprehension capabilities, making it an efficient alternative for applications demanding both speed and accuracy.
The proposed AS methods utilize binary classifications defined in positive-negative pairs, presenting challenges due to data imbalances as the positive class tends to be smaller than the negative class.This imbalance can degrade model performance but can be addressed through data-level and algorithm-level approaches.Data-level strategies manipulate training data distribution via over/under-sampling of classes, using methods like Synthetic Minority Over-sampling Technique (SMOTE) [19] for creating new samples, and Near Miss [20] for under-sampling by randomly removing samples from the larger class.While under-sampling can omit valuable data, over-sampling might increase over-fitting risk.Algorithmic-level strategies emphasize minority classes through ensemble learning, decision threshold changes, and cost-sensitive learning, which penalizes misclassification of minority class samples.Ensemble learning, in particular, leverages majority voting among multiple classifiers.Additionally, Deep RL has shown promise in various [21] and can manage data imbalance by assigning higher rewards to minority classes in its reward functions [22].
In the realm of AI-driven AS, a pertinent inquiry often raised revolves around the possibility of artificial intelligence (AI) providing a singular, definitive answer.Traditional AS models, by design, sift through potential options to select the most fitting response.However, the rapidly evolving nature of AI and its profound capabilities in understanding intricate data patterns pose a thought-provoking question: can a sophisticated model predict just one conclusive answer, thereby eliminating the need for answer selection?In such a scenario, the fundamental nature of AS would undergo a paradigm shift.The model presented in this paper, with its intricate interplay of RL, DistilBERT word embedding, and enhanced DE, is primarily designed to make the most informed choice from a range of potential answers.While our model showcases efficacy in the AS paradigm, it's worth considering its adaptability in a landscape where AI's aim shifts towards forecasting a singular, precise answer.This perspective not only paves the way for further enhancements to the existing AS models but also encourages a rethinking of AI role in QA systems.
The work introduces an AS model that integrates RL, DistilBERT word embedding, and an enhanced DE method.The model employs two attention-mechanism-based LSTM networks and a feed-forward network, focusing on learning both positive and negative question-answer pairs, utilizing DistilBERT for semantic matching without pre-engineered features.An improved DE algorithm navigates the search space to apply BP algorithms in LSTMs and feed-forward networks, using a selective mutation operator and a novel updating strategy to generate candidate solutions.RL addresses data imbalance in the BP step by treating as a sequential decisionmaking process.The agent uses environment states for training examples to classify and earn rewards based on correct/incorrect classifications, favoring minority groups in the reward system.The efficacy of the method is demonstrated on three benchmark datasets: TrecQA, LegalQA, and WikiQA, showing superiority over existing models.
Our primary contributions are as follows:  The adoption of DistilBERT, the state-of-the-art language representation model, for the purpose of attaining sophisticated word embedding, which aims to enrich the semantic understanding of financial texts.
 The introduction of a novel model grounded in RL designed specifically to navigate and mitigate the challenges presented by data imbalance, thereby enhancing the reliability and robustness of the analysis.
 The deployment of an advanced DE algorithm for the crucial task of weight initialization, which is anticipated to augment the predictive accuracy and computational efficiency of the proposed model.
The remaining sections of this article are organized as follows.In Section II, a summary of the relevant work is provided; in Section III, the required background is presented; in Section IV, the structure of the proposed model is described; and in Section V, evaluation metrics, data sets, and results are provided.In Section VI, the study concludes by detailing the lessons learnt and suggesting further work.

II. RELATED WORK
The early studies on AS marked the initial attempts to tackle the task using feature engineering techniques.These methods, such as counting common words, Bag-of-phrases, and Bag-of-grams, provided a basic understanding of the structure and content of questions and answers [23].However, their reliance on surface-level features limited their ability to capture the deeper semantic nuances inherent in natural language.Recognizing the need to overcome this limitation, subsequent research endeavors delved into more sophisticated approaches for AS.Linguistic tools like WordNet emerged as valuable resources for incorporating semantic knowledge into the selection process [24].WordNet enabled researchers to enrich the analysis of questions and answers by considering the meanings and associations conveyed by individual words.Furthermore, researchers sought to exploit the syntactic structure of sentences to enhance AS performance.Techniques like dependency tree analysis and tree distance processing algorithms were employed to capture the relationships between www.ijacsa.thesai.orgwords and their syntactic roles within a sentence.By considering the hierarchical structure and dependencies encoded in these linguistic representations, researchers aimed to gain deeper insights into the meaning and coherence of questions and answers, enabling more effective selection algorithms.The incorporation of semantic and syntactic analysis in AS research represented a significant shift towards a more comprehensive understanding of language.These approaches recognized that the success of answer selection lies not only in surface-level matching but also in capturing the underlying meaning and context conveyed by questions and answers.As a result, the field witnessed the emergence of more sophisticated methods that combined linguistic tools, syntactic analysis, and semantic knowledge to improve the accuracy and relevance of answer selection algorithms.
In recent years, deep learning models have emerged as powerful tools for AS, leveraging automated feature extraction capabilities to improve performance and enhance the understanding of question-answer pairs [25], [26].When searching using question-answer pairs, researchers have explored two main approaches.The first approach involves calculating distinct elements in the Q&A pair, with deep networks generating independent representation vectors for questions and answers.To measure the interdependence between these vectors, various criteria have been employed, enabling the comparison and similarity assessment of questionanswer pairs [25], [26].For instance, Wang and Jiang proposed a comparative model that incorporates multiple indicators to measure similarity, taking into account different aspects of the question-answer relationship [25].Similarly, Yun et al.
showcased the advantages of language-based models, utilizing the language model Elmo to capture contextual information and semantic meaning in the question-answer pairs [26].The second approach treats the query and answer as standalone sentences, allowing researchers to employ specific techniques for their analysis.Severin and Moschitti utilized CNNs to assess the similarity between question-answer pairs, exploiting the local dependencies and patterns within the sentences [27].On the other hand, Van and Newberg utilized bidirectional LSTM networks, which consider the embedding of words in both directions to capture the contextual information of the question and answer [28].The resulting relation between the answer and the question is fed into a feed-forward network for further processing and classification.Siamese Networks have also gained popularity in QA tasks, providing separate representation vectors for questions and answers [29], [30].These networks enable the comparison of the similarity or dissimilarity between question-answer pairs by computing the distance or similarity metrics in the learned feature space.For instance, Yu et al. proposed a deep learning model for AS tasks, employing CNNs and logistic regression to capture the relevant features for answer selection [29].Similarly, Dryer et al. implemented a similar approach using CNNs and distributed vector representations, enabling the model to learn more nuanced features for question-and-answer representation [30].To further enhance candidate response selection, researchers have explored pre-processing operations.One such operation involves fixing named entities with unique tokens, simplifying the selection process and enabling better identification of potentially correct answers [31], [32].This pre-processing step can help alleviate the challenges posed by named entities and improve the accuracy of answer selection.In addition to preprocessing, attention mechanisms have emerged as a valuable strategy in AS research.Initially introduced for machine translation, attention methods have found applications in QA tasks [33]- [35].These mechanisms allow the model to focus on the most relevant parts of the question and answer by considering the contextual interplay between them.Researchers, such as Jan et al., have proposed using Recurrent Neural Networks (RNNs) with attention mechanisms for response selection, effectively capturing the informative parts of the question-answer pairs [33].Tay et al. suggested bidirectional alignment and a generalized method based on RNNs further to improve the attention mechanism's performance [34].Additionally, He et al. demonstrated that combining CNNs with attentional mechanisms can lead to improved performance compared to using RNNs alone [35].Knowledge-based approaches have also been explored, aiming to leverage external knowledge sources to enhance answer selection.Shen et al. developed a knowledge-based approach that utilizes an attentive bidirectional LSTM network combined with a knowledge graph (KG) to represent questions and answers, enabling the model to leverage structured knowledge to enhance the understanding and selection process [36].Other techniques have addressed specific challenges in AS, such as data imbalance.Researchers have utilized separate LSTM networks for questions and answers, followed by a Multi-Layer Perceptron (MLP) network for classification, and incorporated per-class penalties to tackle data imbalance and improve classification performance [37].Matsubara et al. utilized a search engine and a transformer model to select the correct answer, employing models like Jaccard Similarity and Compare Aggregate to assess the relevance of question responses [38].Furthermore, Kim et al. proposed an architecture based on proximity reference, using an attention mechanism to retain information and automatic encoders better to reduce the information volume, enhancing the model's efficiency and effectiveness [39].The recent advancements in deep learning models for AS have showcased the versatility and power of these approaches in capturing the complex relationships and semantics present in question-answer pairs.By leveraging techniques such as attention mechanisms, linguistic tools, knowledge graphs, and pre-processing operations, researchers have made significant strides in improving AS performance, ultimately enabling more accurate and relevant answer selection.
Despite the advantages of automatic feature extraction in deep models for AS, there are still several challenges that affect their performance.Typically, these models employ random weight initialization and are trained using the backpropagation BP algorithm to avoid local optima.However, they face difficulties in learning binary classification tasks, particularly when dealing with imbalanced datasets in the context of AS.

III. BACKGROUND
In this section, the prerequisites required to study the rest of the paper are briefly reviewed.www.ijacsa.thesai.org

A. Long Short-Term Memory (LSTM)
The LSTM framework, initially brought forward by Hochreiter and Schmidhuber [40], signifies a category of neural structure specifically formulated to proficiently manage the interrelationships within a chain of elements that don't possess a fixed length.The innovative structural design makes it distinctive from conventional neural structures by integrating a storage component within its concealed layer, granting it the capability to comprehend relationships within chains that extend beyond immediate surroundings.This feature equips LSTMs with a particular competence in modelling and interpreting extended dependencies.At the heart of the LSTM architecture is storage elements designed to retain and modify data over a period.This storage unit is made of three critical constituents, often referred to as controllers: the ingress controller ( ), the oblivion controller ( ), and the egress controller ( ).These controllers manage the stream of data within the LSTM unit, facilitating accurate regulation of what data is conserved, disregarded, and exported.The ingress controller ( ) establishes the extent to which fresh data is incorporated into the storage unit.It considers the present ingress ( ) and the preceding state of the storage unit ( ), and grounded on their interaction, resolves which data is pertinent to refresh the unit state.The oblivion controller oversees the volume of data preserved in the storage unit from the preceding moment.It assesses the current ingress and the preceding storage unit state and resolves what data should be forgotten or discarded from the unit.The egress controller establishes the volume of data from the storage unit that is passed to the egress and influences the concealed state of the LSTM.The egress controller considers the current ingress and the refreshed storage unit state and decides what data should be communicated as the egress.Through the amalgamation of these controllers, the LSTM network can selectively preserve and refresh data over time, equipping it to comprehend both short-term and extended dependencies within sequences.This ability to comprehend and retain pertinent data at appropriate time steps makes LSTMs remarkably competent in an array of assignments such as linguistic processing, speech recognition, and time series prediction.
Mathematically, the LSTM equations can be defined as follows: (1) A bidirectional LSTM (BLSTM) extends an LSTM network to process input from both sides.This can be useful in AS since the answer may be generated by moving the words in the question.In a BLSTM network, the state vectors ⃗⃗⃗ and ⃖⃗⃗⃗ are generated by parsing the input and combining them as ⃗⃗⃗ ⃖⃗⃗⃗ .LSTM and BLSTM networks treat all the input samples equally important, leading to network confusion.To cope with this problem, an attention mechanism can be considered.To this end, each state is accompanied by the coefficient so the final state for a sequence of length is computed as:

B. Differential Evolution
Differential evolution (DE) [41] has gained widespread recognition as a powerful population-based method capable of effectively solving a wide range of optimization problems [42] DE operates through three essential operations: mutation, crossover, and selection.The DE algorithm commences by initializing a population, usually obtained by sampling from a uniform distribution.This population serves as the foundation for the subsequent evolutionary process.The mutation operator plays a pivotal role in DE, as it generates a mutation vector that introduces diversity and exploration into the population.Through the mutation process, new candidate solutions are created by perturbing the existing individuals in the population.This perturbation is achieved by combining the information from multiple individuals and forming a new candidate solution, often through vector arithmetic operations.The mutation operator in DE typically involves randomly selecting a set of individuals from the population and using their information to compute the mutation vector.This is accomplished by multiplying the difference between two randomly selected individuals by a scaling factor and adding it to a base individual.The resulting mutation vector represents a potential new solution that explores the search space in an attempt to discover better regions of the optimization landscape.The mutation operator in DE plays a crucial role in maintaining population diversity and facilitating exploration.By introducing novel solutions, DE can effectively navigate the optimization landscape and overcome local optima.The quality and diversity of the mutation vector greatly influence the overall performance of DE and its ability to converge to an optimal solution.
The following is the mutation operator that creates a mutation vector: (7) where, , and three distinctive candidate solutions are randomly chosen from the current population, and is a scale factor.
Mutant and target vectors are combined during the crossover.This can be done using the well-known Binomial crossover: where, is the crossover ratio, is a random number selected from and is the dimensionality of a candidate solution.After performing crossover, the selection operator selects the target and trial vectors' best solution.
IV. PROPOSED MODEL Fig. 1 depicts the general framework of the suggested technique.Pre-processing, word embedding, and prediction are the three key stages of the proposed technique.As a www.ijacsa.thesai.orgpreliminary stage, unnecessary words and symbols are eliminated.Using DistilBERT, the embedding vector of each word is retrieved in the second stage, and the similarity between the two sentences is predicted.The suggested method employs a clustering-based differential evolution technique to determine the initial seeds of the network weights, while the RL-based algorithm is used to address the class imbalance.

A. Pre-Processing
Pre-processing is a vital part of any NLP system because the essential characters, words, and sentences identified in this stage are passed to the later stages.Therefore, the preprocessing output has a significant impact on the quality of the final results.
Common stop-word elimination and stemming techniques are employed in the approach.Stop words are part of sentences that can be regarded as overhead.The most common stop words are articles, prepositions, pronouns, etc.They should thus be removed as they cannot function as keywords.For decreasing the dimensionality of the term space, stemming is used to identify the stem of a word.For instance, the terms ‗go', ‗went', ‗going', ‗watcher', etc., all can be stemmed from the word -watch‖.Stemming removes ambiguity and reduces the number of words, time and memory requirements.

B. Word Embedding
Word embedding is used in deep learning algorithms to compare words with semantic vectors.The best technique to produce accurate context-based representations of highlighted words is to insert words.
Many experiments determine the most effective approach to represent words in neural network models.Recently, predefined language models (PLM), previous natural language information boxes, and tuning have been widely used for NLP activities.PLM models frequently use unlabeled data to learn about the model's parameters.
In this article, DistilBERT is considered as one of the newer methods of the PLM model for word input.DistilBERT is an interactive language model designed on large data sets, such as Wikipedia, in order to produce contextual representations.It is common practice to fine-tune the linear layers of DistilBERT for addressing different classification tasks.Some configuration tools teach classification tasks by extracting semantics from common semantic problems or contexts.Models other than DistilBERT build one-directional embeddings which ignore contextual differences.On the contrary, DistilBERT utilizes a bidirectional transformer by conditioning its representations on the left and right context simultaneously.

C. Prediction
Our predictive model comprises two attention-based BLSTMs and one feed-forward network.The two BLSTMs extract embeddings for the question and response sentences.The feed-forward network predicts the degree to which two sentences are similar.Consider and , where and represent the -th word in the question and response, respectively.
Because of the length restriction in BLSTM, and can include only and words, respectively (in this work, ).After feeding and into their respective BLSTMs, the attention mechanism computes their embeddings in the following manner: where, ⃖⃗ ⃗ and ⃖⃗ ⃗ represent theth hidden vectors in the BLSTM, and , are the -th attention weights for each unit in the BLSTM, calculated as: During pretraining, the augmented differential evolution algorithm is used to determine the optimal initial weights.The initial weights for the fine-tuning phase are the weights obtained during the pretraining phase.

1) Pretraining:
The weights of the LSTM, the attention mechanism, and the feed-forward neural network are initialized at this stage.To achieve this, an improved differential evolution method is introduced, incorporating a clustering scheme and a novel fitness function.A clusteringbased mutation and update technique is used in the changed DE algorithm to boost the optimization efficiency.
A promising region of the search space is distinguished by the suggested mutation operator, which was inspired by [40].The k-means clustering algorithm does this by dividing the current set P into k clusters, each representing a distinct region of the search space.The number of clusters was picked at random from √ .The cluster with the lowest sample means the fit is selected as the optimal group.The proposed clustering-based mutation is defined as: where, ⃗⃗⃗⃗⃗⃗⃗ is the most acceptable solution in the promising region, and and are two randomly determined candidate solutions from the current population.It should be noted that ⃗⃗⃗⃗⃗⃗⃗ is not always the population's most acceptable solution.The clustering-based mutation procedure is implemented M times.
The current population is updated when new solutions have been provoked through clustering-based mutation.The steps are as follows:  Selection: Generate individuals randomly as initial seeds of the k-means algorithm;  Generation: Generate solutions using clusteringbased mutation as set ;  Replacement: Choose solutions at random and determine as ;  Update: The best solutions from the determined as the .The new population is afterwards calculated as .
The fundamental structure of the proposed model comprises two LSTM networks with their respective attention mechanisms and a feed-forward network.As depicted in Fig. 2, in the proposed DE algorithm, all weights and bias terms are organized into a vector to generate a candidate solution.
To assess the quality of a candidate solution, the fitness function is defined as: where, is the total number of training samples, is theth desired target, and ̃ is model prediction.
2) Classification: An RL-based algorithm is employed to tackle the imbalance problem caused by varying data volumes in the classes.Each question-and-answer pair in the training dataset makes up a state of the environment, and the network is the agent that performs a sequence of classifications on all pairs.When the agent predicts the class label of a pair, it is taking an action: the pair seen at the time-step is the state , and the classification performed is .In return, the environment provides a reward, , to guide the agent.Reward values are assigned such that classifying a sample from the majority class garners a lower absolute value compared to the minority class.The reward function is: Encoding strategy in the proposed algorithm.www.ijacsa.thesai.org where, and are the means of the minority and majority classes, respectively.Correct/incorrect classification of a sample from the majority class yields a reward of , where .
The agent's objective in deep Q-learning is action selection such that the sum of discounted future rewards ( ) are maximized: where, is the discount factor, is the immediate reward at time step , and is the last time-step of the episode.Using , more importance is given to rewards in the near future (closer to the current time step ) compared to the distant future.Each episode is terminated if all of the samples are classified correctly or at least one sample from the minority class is misclassified.The expected return of taking action in state at time step t and following policy afterwards is computed as: where, is called the action-value function.At each state , the optimal action is the one that maximizes the actionvalue function: where, maximization is taken over all possible policies, the recursive form of equation 20 can be written as: The best action-value function can be estimated iteratively using the Bellman equation: = During training, upon observing state , the policy network outputs action .After executing this action, the environment returns a reward r, and the next state becomes .The tuple is then saved into the replay memory .Minibatches of these tuples are drawn randomly from the replay memory, which is used to update the network parameters via gradient descent.The update is done based on the following loss function: where, is the network parameters at -th training iteration, and is the estimated target for the function.The desired target is equal to the immediate reward for the stateaction pair plus the discounted maximum future Q value: = For terminal states, y is equal to r.At ith iteration, the gradient of the loss function is calculated as follows: The network weights are updated using the gradient of loss function computed as follows: where, is the learning rate.

V. EXPERIMENTAL RESULTS
In this section, the conducted experiments are detailed.

A. Datasets
The following three benchmark databases are used during the experiments (see Table I for their statistics):  TrecQA [43] is taken from the TREC trace dataset.Yao et al. [44] used two training datasets, TRAIN, and TRAIN-ALL, to construct an extended set of positive and negative pairs.The soundness of answers in the TRAIN-ALL dataset is verified automatically by matching pairs with regular expressions.The TRAIN, DEV, and TEST data set' responses were all manually assessed.To teach the model, the TRAIN-ALL set is utilized.
 LegalQA [45] is a database of legal question-andanswer submissions from the Chinese community.Inquiries were answered online by a licensed attorney.The four fields that make up LegalQA are Question Title, Question Text, Answer, and Label.A straight line designates real positive couples.
 A Wikipedia page that is regarded as a subject of the year is linked to each question in the open-source quality assurance dataset known as WikiQA.[46].To avoid ambiguity in the answer sentences, all the answers at the bottom of the page are the candidates' answers.

B. Metrics
According to earlier research, the most popular reference points for the answer-selection task are MAP and MRR [47].MAP evaluates the capacity to categorize responses and return solutions.If a high score match is found, the MRR is repeated.The average accuracy is derived using the Mean Average Precision (MAP) findings: where, is the questions set, is the number of relevant answers to -th question, and is the set of j best candidates selected from the available answers.The position of the first correct response is used to determine the Mean Reciprocal Rank (MRR) calculated as follows: where, denotes the first response's placement for -th question.Details of model The experiments were implemented using Python and PyTorch.For natural language processing in Python, the NLTK package was leveraged.A two-layer LSTM with a hidden size of 64 was employed.Also, because there are connections between the vectors in the two LSTM networks, the batch must be normalized before it is sent to the feedforward neural network.The tests were conducted using a computer with 64 GB Memory, a 64-bit Windows operating system, and a graphics processing unit (GPU).The most effective models for LegalQA, TrecQA, and WikiQA were identified after 50, 60, and 100 epochs.For the three datasets, training took 5, 20, and 60 hours, respectively.
Table III displays the extent to which the proposed method outperforms other methods.The proposed model consistently demonstrates a significant advantage over other widelyrecognized methods in the domain.When examining the MRR and MAP metrics specifically for the LegalQA dataset, the proposed model exhibits enhancements ranging from +0.077 to +0.231, with the most pronounced improvement observed against the DARCNN method.This consistent performance is evident across all datasets, underscoring the robustness and adaptability of the proposed approach.Notably, even the variants of the proposed model, such as "Proposed (no RL and DE)" and "Proposed (no RL)", consistently outpace most other techniques.These modified versions, despite lacking certain features, still deliver commendable results, emphasizing the intrinsic strength of the primary model.An intriguing point is the comparison between the BERT-Base and its more streamlined version, DistilBERT.While BERT-Base stands as a powerful model in the NLP realm, the margin table shows that the approach of the proposed model surpasses it, attesting to the innovative methods integrated into the new model.Addressing Imbalance: The performances of models like P-CNN and DARCNN, especially the substantial gains in specific metrics such as +0.285 for TrecQA (MRR) and +0.231 for LegalQA (MRR), shed light on the challenges presented by data imbalance in the AS domain.The resilience and adaptability of the proposed model to such challenges, coupled with its ability to deliver top-notch results, underscore its potential in addressing imbalanced datasets effectively.1) Comparison with other metaheuristics: In this section, a variety of meta-heuristic optimization algorithms are compared to the enhanced DE algorithm.To do this, a variety of meta-heuristics are employed while maintaining the integrity of the other model elements, such as pre-processing, word embedding, LSTM, network structure, and RL, in order to gain the initial model parameters.Eight different algorithms, including (standard) DE [57], grey wolf optimization (GWO) [58], bat algorithm (BA) [59], dragonfly algorithm (DA) [47], salp swarm algorithm (SSA) [60], cuckoo optimization algorithm (COA) [61], human mental search (HMS) [40], whale optimization algorithm (WOA) [62], and artificial bee colony (ABC) [63] are investigated.
The overall size of all algorithms and their predicted capacities were calculated to be 150 and 4,000, respectively.In Table IV, the default settings can be observed.Table V displays the findings for each of the three data sets.On every dataset, the suggested DE algorithm performs better than any other algorithm, as shown.Normal DE is the runner-up.2) Reward function: The reward function directs the agent toward achieving its aim by giving the right ratings to certain activities.±1 and ±λ were selected as the rewards for the minority and majority classes, respectively.The ratio of the sample size of the majority class to the minority class determines the value.As the majority/minority ratio rises, the value decreases.The majority class bonus is held constant, and is chosen from the set to see how changing affects the reward earned by the model.The evaluation findings for the three datasets are displayed in Fig. 3.The reward plots in Fig. 3(a), Fig. 3(b), and Fig. 3(c) all have an ascending trend for and a decreasing trend for .The relevance of majority classes is disregarded for = 0, while for = 1, both classes are regarded as equally significant.Even though the minority is more important to us, the impact of the majority should not be ignored.
3) Examples: A qualitative example is provided to evaluate the efficacy of RL in the model, focusing on the question -Who is the president or chief executive of Amtrak?‖ from the TrecQA dataset.The results of the top five answers retrieved by the model with and without using RL are shown in Fig. 4. As seen, models without RL are more likely to select negative answers.The model with RL gives the highest possible score for answering the question.Word embeddings.
In this section, performing the DistilBERT adopted in the method for word embedding is compared against five other word embedding methods.One-Hot Encoding [64] creates binary properties for each class and assigns values to the properties in each instance that corresponds to a specific class.CBOW and Skip-gram [65] use neural networks to compare words with insertion vectors.GloVe [66] is an unattended learning algorithm implemented for full word set statistics.FastText [67] [65] extends the Skip-gram model, in which each word is represented by an n-gram character instead of learning a vector for words.Table VI shows the results of the conducted experiment.As expected, One-Hot cryptography has the lowest performance among the evaluated methods.CBOW and Skipgram perform similarly, and both yield better performance compared to GloVe, while FastText gives better results.However, the best performance is claimed by the DistilBERT model, which is the motivation behind its use in the approach.

4) Discussion:
The proposed model in this study addressed the class imbalance issue in AS by employing a reinforcement learning-based technique.Unlike traditional methods that treat it as a binary classification problem, the proposed approach formulated it as a sequence of sequential decisions.An agent classified each instance and received a reward at each step.To handle class imbalance, the reward assigned to the majority class was intentionally lower than that for the minority class.The parameters of the policy were initialized using an improved DE technique.To improve the efficiency of the DE algorithm, a novel cluster-based mutation operator was introduced.This operator utilized the K-means clustering approach to identify the winning cluster and incorporated potentially viable solutions into the existing population.In terms of word embedding, the model employed the DistilBERT model, which reduced the size of the BERT model.To evaluate the effectiveness of the proposed model, extensive experiments were conducted using LegalQA, TrecQA, and WikiQA datasets.The results demonstrated the superiority of the proposed model compared to existing methods in the field of answer selection.However, it is important to acknowledge certain limitations of the proposed model, which can be considered in future work: a) Limited Scope: While the article introduces a novel reinforcement learning-based technique to address class imbalance in AS, it is important to acknowledge that class imbalance is a widely recognized challenge in machine learning, and various approaches have been proposed in the literature.A more comprehensive discussion that explores alternative methods, such as data resampling techniques (e.g., oversampling, under sampling), cost-sensitive learning, or ensemble-based methods, would provide a broader perspective on addressing the class imbalance in AS tasks.Comparing the proposed technique with these alternative approaches in terms of effectiveness and applicability would enhance the understanding of its limitations and potential alternatives.
b) Lack of Real-World Application: The evaluation of the proposed model on LegalQA, TrecQA, and WikiQA datasets provides insights into its performance within specific domains.However, it is important to recognize that these datasets might not fully capture the complexities and variations present in real-world AS scenarios.To overcome this limitation, future research should consider evaluating the proposed technique on diverse datasets from different domains, such as medical, finance, or customer support, to assess its generalizability and robustness across various realworld applications.This would provide a more comprehensive understanding of the technique's effectiveness and limitations in practical settings.c) Performance Metrics and Statistical Significance: While the article claims superiority over existing methods, it is essential to provide a detailed analysis of the performance metrics used for evaluation.Precision, recall, F1-score, and other relevant metrics should be reported, along with the corresponding confidence intervals or statistical tests, to establish the statistical significance of the results.A thorough analysis of these metrics would provide a clearer understanding of the proposed technique's performance and its potential limitations in different AS scenarios.
d) Computational Efficiency: While the utilization of the DistilBERT model is mentioned to enhance computational efficiency, it would be beneficial to provide more specific details about the computational resources required by the proposed technique.Comparing the computational requirements, such as memory usage and processing time, with other state-of-the-art AS methods would allow for a more comprehensive assessment.Additionally, considering the scalability of the technique for larger datasets or real-time applications would provide insights into its feasibility and practical utility in various contexts.
e) Interpretability and Explainability: The article lacks discussion on the interpretability and explainability of the proposed technique.In AS tasks, understanding the decisionmaking process and providing explanations for selected answers are important factors for trust and transparency.Discussing methods or approaches used to interpret and explain the decisions made by the reinforcement learningbased model would enhance its applicability in real-world scenarios.Consideration of techniques like attention mechanisms or post-hoc interpretability methods (e.g., LIME, SHAP) would provide insights into the reasoning behind answer selections and potential biases or limitations associated with the model's decisions.
f) User Feedback and Adaptability: The article does not discuss the potential for incorporating user feedback or adapting the AS system over time.AS models that can learn from user interactions, such as reinforcement learning with online learning or active learning approaches, have the potential to improve their performance based on user preferences and changing information needs.Investigating the integration of user feedback and methods for continuous adaptation would be valuable for enhancing the proposed technique's effectiveness and user satisfaction.
g) Comparison with Human Performance: The article focuses on comparing the proposed model with existing methods, but it does not include a comparison with human www.ijacsa.thesai.orgperformance.AS tasks often involve subjective judgments, and comparing the performance of the proposed technique with human experts or crowd-sourced annotations can provide valuable insights into the model's strengths and limitations.Conducting experiments that involve human evaluations would help contextualize the performance of the proposed technique and highlight areas where further improvements are needed.
h) Ensuring Data Quality and Model Performance: Another aspect warranting discussion is the challenge of recognizing datasets that may potentially misguide the classifier.Any model's efficiency is contingent upon the quality and reliability of its training data.Datasets that contain noisy, inconsistent, or unrepresentative samples can induce biases in the model, leading to flawed predictions.Regular monitoring of performance metrics on validation sets can provide early indications of a model being misguided by its data.A substantial divergence between training and validation performance may hint towards potential dataset issues.Tools like ChatGPT and other advanced language models can offer benefits in this scenario.These models, with their vast training on diverse textual data, can be harnessed to validate the coherence and authenticity of data samples.For instance, they could generate synthetic samples for augmentation, thereby balancing datasets and mitigating risks.They can also be employed to highlight potential anomalies or inconsistencies within a dataset, aiding in its refinement and preprocessing.In future studies, integrating insights from these tools could be an invaluable step for data validation, ensuring models are trained on high-quality, representative datasets.

VI. CONCLUSION
In this paper, an approach for efficient AS is proposed, which employs enhanced DE algorithms for pretraining and RL for instructing the BP algorithm.The method is based on LSTM with an attention mechanism and DistilBERT word embedding.The proposed model categorizes both positive and negative classes and comprises pairs of positive inquiries and detailed responses.Because the dataset contains many negative pairs, the proposed model produces an unbalanced classification.To address this issue, the approach is framed as a logical decision-making process.Correct classification of minority samples is rewarded with higher values at each episode step than the correct classification of the majority samples.Each episode is repeated until a minority sample is misclassified or all samples are correctly classified.The policy weights were initialized using an improved DE algorithm.The improved DE algorithm clusters the current population and finds promising regions in the search space using a new upgrade strategy.The evaluation of the proposed method was conducted using the LegalQA, TrecQA, and WikiQA datasets, demonstrating its superior performance compared to other methods.
In addition to the proposed classification approach, there are several promising avenues for future research in the field of Natural Language Processing (NLP).One area of interest is exploring the utility of the proposed approach in various NLP applications beyond answer selection.By applying the same reinforcement learning-based technique to tasks such as sentiment analysis, text summarization, or named entity recognition, Insights into the effectiveness and generalizability across different domains can be gained through the study.
Another promising direction for future research is the provision of candidate answers to given questions.While the proposed approach focuses on selecting the best answer from a given list of options, the generation of candidate answers could further enhance the AS process.One potential approach to generating candidate answers is through the use of Generative Adversarial Networks (GANs).GANs have shown promise in generating realistic and coherent text, and their application in generating diverse and plausible candidate answers could greatly enrich the AS process.Further investigation into the integration of GANs with the proposed classification approach could lead to more comprehensive and accurate answer selection systems.

TABLE I .
STATISTICAL INFORMATION FROM THE LEGALQA, TRECQA AND WIKIQA DATASETS

TABLE II .
PERFORMANCE COMPARISON OF THE PROPOSED MODEL WITH THOSE ALREADY IN USE ON THREE DATASETS: RESULTS USING THE DAG MARKER WERE FOUND IN EARLIER STUDIES

TABLE III .
MARGIN OF IMPROVEMENT OF THE PROPOSED MODEL OVER OTHER METHODS

TABLE IV .
SETTING PARAMETERS FOR META-HEURISTICS

TABLE VI .
RESULTS OF DIFFERENT WORD EMBEDDINGS ON THE THREE DATASETS