An Ensemble Approach to Question Classification: Integrating Electra Transformer, GloVe, and LSTM

Natural Language Processing (NLP) has emerged as a crucial technology for understanding and generating human language, playing an essential role in tasks such as machine translation, sentiment analysis, and more pertinently, question classification. As a subfield within NLP, question classification focuses on determining the type of information being sought, a fundamental step for downstream applications like question answering systems. This study presents an innovative ensemble approach for question classification, combining the strengths of Electra, GloVe, and LSTM models. Rigorously tested on the well-regarded TREC dataset, the model demonstrates how the integration of these disparate technologies can lead to superior results. Electra brings in its transformer-based capabilities for complex language understanding, GloVe offers global vector representations for capturing word-level semantics, and LSTM contributes its sequence learning abilities to model long-term dependencies. By fusing these elements strategically, our ensemble model delivers a robust and efficient solution for the complex task of question classification. Through rigorous comparisons with well-known models like BERT, RoBERTa, and DistilBERT, the ensemble approach verifies its effectiveness by attaining an 80% accuracy score on the test dataset.


INTRODUCTION
There are many areas where machine learning has completely changed how we solve problems.These include healthcare, banking, and natural language processing [1], [2], [3].It has made it possible for computers to learn from data on their own, making choices, predicting trends, and even finding patterns that are too complicated for humans to understand.NLP is the study of how computers and people use language.With the rise of machine learning, big steps forward have been made in NLP, especially in areas like mood analysis, machine translation, and summary [4], [5], [6], [7].One of the most important things that natural language processing does is sort questions into groups.In the real world, this job is very important for many things, such as search engines, virtual helpers like Siri or Google Assistant, and customer service bots.Question sorting that is done right can lead to more accurate and useful answers, which improves the service these apps can provide.Think about a medical robot that can correctly classify a health question and give a possibly lifesaving answer, or a virtual tourist helper that can tell the difference between questions about food and questions about historical sites.It's not just handy that the good effects happen; they often have big effects [8], [9], [10].However, the complexity of human language, which includes subtleties in syntax, meaning, and pragmatics, makes it very hard to get very accurate question classification [11], [12].Support Vector Machines, Random Forests, and other machine learning models have been used for this, but new developments in deep learning and transformer models like BERT, RoBERTa, and ELECTRA have shown that they work even better than expected [13].These models are very good at understanding the meanings and contexts of words and sentences, which is a key part of question classification [1], [14], [15], [16] and [17].Here, we show a new method that combines three strong tools: the ELECTRA model for contextual embeddings based on transformers; Global Vectors for Word Representation (GloVe) for creating semantically rich word vectors; and Long Short-Term Memory (LSTM) networks for capturing sequence dependencies.The Text REtrieval Conference (TREC) dataset, which is a common standard for question classification tasks, is used to train and test our ensemble model.The main thing that our work adds is that we combine several different but useful techniques in a way that makes them work better together than current best models at classifying questions.
This study is organized into the following taxonomy: Section II starts by doing a full literature review of earlier work that looked at question categorization and related ensemble methods, Section III shows a full explanation of the method used is given, which includes the ELECTRA model, GloVe embeddings, and LSTM networks, Section IV presents the proposed approach, Section V describes how the experiment was set up, what the results were, and why we came to the conclusions we did, and in Section VI, we talk about the results, the limits, and the opportunities for more study.

A. Previous Work
In NLP, question categorization has been a major area of study for twenty years, with many researchers working on it.Over the years, techniques in this area have changed a lot, from simple machine learning methods to the most advanced deep learning models used today.Support Vector Machines (SVM) and other well-known machine learning methods were used in the early stages of this study.For example, Zhang and Lee used SVMs to sort questions [18].www.ijacsa.thesai.orgDeep learning methods came out as machine learning got better.These made models more stable.Kalchbrenner et al. were the first to use convolutional neural networks (CNNs) to tag words with questions and put them into groups.After that, scientists studied Recurrent Neural Networks and various types of them, such as Long Short-Term Memory networks.After Zhou et al. used LSTMs well to find the long-term connections in question replies, they came up with some hopeful results [19].
When language models like BERT, RoBERTa, and ELECTRA came out, they were the next big step forward in the field of NLP.A lot of natural language processing jobs, like question classification, were done better by these transformer-based systems.Devlin et al. created BERT and showed that it could record context-rich embeddings [18].While Liu et al. worked on RoBERTa and Clark et al. worked on ELECTRA, they pushed the limits of efficiency [20], [21].
Individual models have worked well on their own, but group methods have become popular as a way to combine the different strengths of these models.Vaswani et al. suggested a group that combined transformers and LSTMs, which showed a big improvement in performance compared to using just one model [22].However, ensemble methods that are specifically made for question classification have not been widely used.This points to an interesting area for future study.
The role of word embeddings, especially GloVe, is another part of this changing environment.When Pennington et al. first presented GloVe, it quickly became a mainstay in many NLP tasks, such as question classification [23].
Before they come up with a new type of feature based on question patterns, Nguyen and Le look at lexical, syntactical, and semantic features.The writers came up with a way to choose features that would work for different types of questions.They used the TREC dataset and Support Vector Machines (SVM) for classification to show that their plan worked [24].
Chotirat and Meesad use two datasets-TREC-6 (English) and a Thai speech dataset-to test different machine learning models.The combined CNN-BiLSTM model did better than the other models, according to the findings.These results show that deep learning methods, especially mixed models, can improve the accuracy of question sorting in a lot of languages.The addition of Part-of-Speech tagging was a key factor in this speed boost [25].
The real-world data that Madabushi et al. give show that their system works better.When fine-grained question classification is paired with deep learning models, they show big improvements in how well the answers are chosen.The new taxonomy and object recognition system worked better than earlier models, showing that their way works.These results show how important it is to include question classification in deep learning systems for jobs like answer choice [26].

B. Rationale for the Proposed Approach
Combining Electra, GloVe, and LSTM in a new way, we describe a new ensemble method for question classification, this method was chosen because it can work well with others to help with the complex nature of understanding questions, with its transformer-based structure, Electra is great at handling complex language tasks and fully understanding their context, GloVe adds to this by providing detailed word-level meaning models that describe the complexity of how language is used, and LSTM helps by correctly simulating long-term relationships in text, which is very important for understanding how questions are asked in a certain order.These models work together to get around the problems that separate models like BERT and RoBERTa have, especially when it comes to handling complicated question forms and changing contexts.As you can see from our positive test results, our approach uses the strengths of each model to make question sorting more accurate and faster.This combination not only makes performance measures better, but it also makes it possible to analyze questions in a more detailed and full way, which is a big step forward in natural language processing.
Different modeling strategies have their own pros and cons, and there hasn't been much research on how to combine them into a single model for question classification, our work introduces a new ensemble method that combines ELECTRA, GloVe, and LSTM, the objective is to create a new style for grouping questions into different categories.

III. BACKGROUND
This section provides a comprehensive overview of the primary components of our ensemble model: the ELECTRA model, GloVe word embeddings, and LSTM networks.

A. ELECTRA
ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) is a transformerbased model developed for natural language processing tasks, proposed by researchers at Google Research in 2020, ELECTRA uses a novel approach to training known as Replaced Token Detection [17].
Traditional transformer models, such as BERT [1], utilize masked language modeling as a pre-training task, where some percentage of the input tokens are masked and the model is trained to predict the original tokens.ELECTRA, on the other hand, introduces a different mechanism.It consists of two parts: a generator and a discriminator.The generator is a small masked language model that suggests replacements for some of the tokens in the input.The discriminator is then tasked with predicting whether each token in the sequence was replaced by the generator or not.
This training mechanism can be described with the following steps: 1) The generator G, a small BERT-like model, is used to replace some tokens in the input sequence.
2) The discriminator D, a larger BERT-like model, then attempts to predict for each position whether it contains the original token or a replacement.
The main advantage of this approach is that it allows for the entire input sequence to be utilized during pre-training, as www.ijacsa.thesai.orgopposed to just a small masked portion, making the training process more efficient and effective.

B. GloVe
GloVe is an unsupervised learning algorithm developed by the Stanford NLP Group for obtaining vector representations for words.The primary idea behind GloVe is that the cooccurrence statistics of words in a corpus capture a significant amount of semantic information [23].To construct the GloVe representations, the following steps are carried out: 1) A global word-word co-occurrence matrix is constructed from the corpus, where each element `Xij ` represents the frequency with which word `i` appears in the context of word `j`.
2) The objective of GloVe is then to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence.
Mathematically, this is represented as: where V i and V j are the word vectors for words i and j, and P(i|j) is the probability of i appearing in the context of j.

C. LSTM
LSTM networks are a type of recurrent neural network (RNN) architecture [27], specifically designed to address the vanishing gradient problem of traditional RNNs and to better capture dependencies in sequential data [28].In an LSTM, the hidden state h t is updated via a series of gating mechanisms: 1) The input gate i t determines how much of the new input will be stored in the cell state.
2) The forget gate f t decides the extent to which the previous cell state c (t-1) is maintained.
3) The output gate o t controls how much of the internal state is exposed to the external network.
The state update equations are as follows: Here, σ represents the sigmoid function, tanh is the hyperbolic tangent function, * denotes element-wise multiplication, and `.` represents matrix multiplication.The variables W and b are the learnable weights and biases, respectively, of the LSTM.
By employing these gating mechanisms, LSTMs can effectively learn what information to keep or forget over long sequences, making them particularly efficient for tasks involving sequential data.
The combination of ELECTRA, GloVe, and LSTM in our ensemble model aims to leverage the efficient pre-training and high performance of ELECTRA, the rich semantic information encapsulated by GloVe embeddings, and the sequence modeling capabilities of LSTM.This synergistic integration seeks to enhance the performance of question classification tasks by capturing the semantics, context, and sequence information embedded in the questions [29], [30], [31].

IV. PROPOSED APPROACH
The proposed approach is designed to amalgamate the capabilities of multiple state-of-the-art language models and embeddings, namely Electra, GloVe, and LSTM, to enhance the classification performance on questions from the TREC dataset.The architecture employs a dual-branch neural network with each branch responsible for processing a different type of embedding-Electra for one and GloVe for the other.Subsequent to this, LSTM layers are applied to the concatenated embeddings, leading to the final classification output.

A. Source of Data
Based on the TREC question classification dataset, which has text-based questions and their related broad terms like "location," "person," etc., the experiment was carried out.

B. Text Standardization
The TensorFlow method tf.strings.lower()was used to change all of the raw text strings to lowercase.

C. Tokenization and Sequence Padding
There were two different tokenization processes for the raw texts: one was made for Electra and the other was made for GloVe.Through padding, a set sequence length of 512 was kept.

D. Architectural Elements: In-Depth Exploration Electra Sub-model: Capturing Contextual Relationships
Electra is the main tool used to find complex and detailed trends in searches.When it comes to Electra, the discriminator is very good at figuring out what a sign means in relation to its surroundings.This is very important for question classification because questions often have clues in the environment that help with classification.For instance, the use of "when" or "what year" could mean a question about time, which Electra is very good at spotting.

E. GloVe Sub-model: Leveraging Global Statistical Information
GloVe is useful because it can gather global statistical features of words based on data about how often they appear together.GloVe, unlike local context, records long-term ties like synonyms or similar ideas, which can be very helpful for finding the right questions.Electra can understand how the words in a question work together in complex ways, but GloVe takes it a step further by understanding the bigger language features of the words used.www.ijacsa.thesai.org

F. LSTM Layers: Accounting for Sequential Dependencies
After integration, LSTM networks are used to find the sequence-based relationships in the incoming text.Questions naturally go in a certain order, with "wh" words like "who," "what," and "where" at the beginning and a subject or object at the end.Figuring out this process can often help you figure out what the question is really asking.These gates in LSTMs help them successfully capture long-term relationships, which makes them perfect for this job.The two LSTM layers, which have 256 and 128 units, are set up to add another level of abstraction and pick up more complex models.

G. Classification Layer: Mapping to Categories
The last Dense layer is a classifier that turns the complicated feature representations learned by the layers above into classification choices that can be used.In this case, a softmax activation function is used because the job is classified.There are 6 units in this layer, and each one represents a different type of question in the TREC dataset.The softmax function makes sure that the result can be understood as odds that add up to 1. It's easy to put each question into one of the six broad groups this way.

H. Model Synergy: The Bigger Picture
It is important to note that the architecture is not just a random group of techniques; it is a carefully put together set of techniques that are meant to work around the weaknesses and make the most of the strengths of each part.Electra gathers background, GloVe adds breadth, and LSTMs record how things change over time.These steps work together to make a complete plan for learning how to classify questions.
To put it simply, each design part was carefully chosen and put together in a way that makes a whole model that can change, understand, and do a great job of question classification.

A. Experimental Setup
To thoroughly test how well our suggested ensemble model, Ensemble Electra + GloVe+LSTM, worked, we set up our tests on Google Colab Pro and used its GPU features to make the computations go faster.We put our ensemble model up against Electra and other cutting-edge language models [17], BERT [32], RoBERTa [33], and DistilBERT [34].

B. Mathematical Overview of Models
1) ELECTRA: Electra employs a discriminative training mechanism, where the model learns to distinguish between "real" and "fake" tokens in a sentence.Formally, for a given input X = [x 1 , x 2 ,…, x n ], a generator G proposes replacements x i for masked tokens, and a discriminator D estimates the probability P(D(x i ) = 1| X) that each token is real.The objective is to minimize -log(D(x i )) for real tokens and -log(1 -D( ̃i)) for fake tokens.
2) BERT: BERT uses a masked language model (MLM) for pre-training, where a certain percentage of input tokens are masked.The model aims to predict these masked tokens based on their context.Mathematically, for an input sequence X, the loss L is calculated as -log P(x i | X -i ; ), where are the model parameters.
3) RoBERTa: RoBERTa extends BERT but employs dynamic masking and removes the next-sentence prediction objective.Its objective function remains similar to BERT, focusing on masked token prediction.
4) DistilBERT: DistilBERT is a distilled version of BERT, trained to approximate BERT's output.For each token x i in the input X, the model aims to minimize the difference between its output O(x i ) and that of BERT B(x i ), typically using the Kullback-Leibler divergence.

C. Evaluation Metrics
We used several metrics to evaluate the performance of each model: Loss, Accuracy, Precision, Recall, and F1 Score.
1) Loss: Represents the error between predicted and actual labels.Lower values are better.
2) Accuracy: Measures the ratio of correctly predicted samples to the total samples.( 8) 3) Precision: Indicates the percentage of positive identifications that were actually correct.

D. Results
Our ensemble model, which is a combination of Electra, GloVe, and LSTM, outperformed all other models.The superior performance of our ensemble approach can be attributed to the complementary strengths of the constituent models.Electra, with its discriminator-generator setup, excels at understanding the context of the language.GloVe, on the other hand, captures semantic relationships between words by considering the global word-word co-occurrence statistics.LSTM effectively handles the sequence nature of the language data.Together, they give a complete approach to text classification and lead to great results on the TREC question classification task.This experimental evidence supports our theory that an ensemble of models can significantly improve question classification task performance over standalone models.By leveraging the strengths of each model, we were able to achieve superior results, showing that our proposed ensemble approach works.The results of the experiments are shown in Tables I and II and Fig. 1, 2, 3, 4 and 5. www.ijacsa.thesai.orgAll of the comparison data show that the Ensemble Electra + GloVe + LSTM model does better than all of the evaluation factors.This victory isn't just a small step forward; it's a huge step forward from solo ideas.

A. Generalization and Overfitting
The ensemble model's ability to transfer from training data to test data is one of the most interesting results.With a training accuracy of 0.999 and a test accuracy of 0.8, the ensemble model shows that it can successfully apply learned patterns to data that it has never seen before.This even result shows that the model does not overfit, which is a common problem in machine learning [35].

B. Error Analysis
The ensemble model stays ahead when it comes to Mean Squared Error (MSE).The model's predictions were very close to the real results, with a training MSE of 0.001 and a test MSE of 1.51.Standalone models, like Electra, BERT, and others, have much higher MSE values on both the training and test sets, which means they make more mistakes when making predictions.

C. Precision, Recall, and F1 Score
The ensemble model also keeps its high scores in the F1 score, precision, and recall.A high accuracy score means that the ensemble model correctly finds relevant examples on a big scale, and a high recall score means that the model most of the relevant events.The F1 score, which is a fair way to measure precision and recall, shows that the model is wellbalanced.

D. Comparative Model Analysis
Although RoBERTa seems to do better than the other models that work by themselves, it is still not as good as the ensemble model.The ensemble model is the only one that can get Electra's understanding of context, GloVe's semantic depth, and LSTM's sequential reading all at the same time.

E. Synergistic Strength
The enormous success of the ensemble model shows that combining parts that are similar to other cutting-edge models can create something new.For the TREC question answering test, it does very well because it knows data very well in both its specific and broad parts.The ensemble model does a great job of categorizing questions, and these results suggest that it could also help with other natural language processing issues.

VII. CONCLUSION
In conclusion, our results show that an ensemble model with Electra, GloVe, and LSTM does a better job of classifying questions than other models on the TREC dataset.We tested our ensemble method against other advanced models like BERT, RoBERTa, and DistilBERT and found that it regularly did better than them.It achieved high accuracy, precision, recall, F1 score, and lower mean squared error.Electra, GloVe, and LSTM all have properties that work well together in the ensemble model.Combining different models and methods into ensemble methods, which we found, can lead to big performance gains, making them a reliable and effective way to handle difficult tasks like question categorization.Even though these results are positive, we know that there is still room for improvement and adjustment.For instance, different groupings of ensembles and model designs could be looked into, along with more advanced training methods.In the future, researchers may look into how this ensemble method can be used to solve other natural language processing problems besides question classification.Overall, this study adds to the progress being made in natural language processing and lays the groundwork for more research and development of group methods in question categorization and other areas.

VIII. CONFLICT OF INTEREST
The authors declare that there is no conflict of interest in this paper.

Declaration of Generative AI and AI-assisted technologies in the writing process
During the preparation of this work the authors used Quillbot in order to proofread the manuscript.After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

( 9 ) 4 )
Recall: Shows the percentage of actual positives that were identified correctly.(10) 5) F1 Score: Harmonic mean of precision and recall, a balance between the two.(11) Where: TP: True Positive, TN: True Negative, FP: False Positive and FN: False Negative.

TABLE I .
THE ACCURACY AND MSE OF THE MODELS