Research on the Application of Random Forest-based Feature Selection Algorithm in Data Mining Experiments

—Handling high-dimensional big data presents substantial challenges for Machine Learning (ML) algorithms, mainly due to the curse of dimensionality that leads to computational inefficiencies and increased risk of overfitting. Various dimensionality reduction and Feature Selection (FS) techniques have been developed to alleviate these challenges. Random Forest (RF), a widely-used Ensemble Learning Method (ELM), is recognized for its high accuracy and robustness, including its lesser-known capability for effective FS. While specialized RF models are designed for FS, they often struggle with computational efficiency on large datasets. Addressing these challenges, this study proposes a novel Feature Selection Model (FSM) integrated with data reduction techniques, termed Dynamic Correlated Regularized Random Forest (DCRRF). The architecture operates in four phases: Preprocessing, Feature Reduction (FR) using Best-First Search with Rough Set Theory (BFS-RST), FS through DCRRF, and feature efficacy assessment using a Support Vector Machine (SVM) classifier. Benchmarked against four gene expression datasets, the proposed model outperforms existing RF-based methods in computational efficiency and classification accuracy. This study introduces a robust and efficient approach to feature selection in high-dimensional big-data scenarios.


I. INTRODUCTION
High-dimensional big data poses significant challenges for Machine Learning (ML) algorithms due to the "curse of dimensionality," a phenomenon where the computational complexity and resource requirements increase exponentially as the number of dimensions (features) grows [1].Traditional algorithms can struggle to make accurate predictions as they become lost in the vastness of the feature space, leading to issues such as overfitting, where the model captures noise instead of the underlying data structure.To mitigate these problems, various dimensionality reduction techniques like Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and autoencoders have been developed to compress the feature space while retaining as much of the meaningful information as possible [2].Additionally, feature selection methods like the Least Absolute Shrinkage and Selection Operator (LASSO), Mutual Information (MI), and Chi-Square Test (CST) are used to identify the most informative features.More sophisticated ML models, such as Deep Learning (DL) models, are also designed to automatically capture hierarchical representations of the data, thereby mitigating some of the challenges posed by high dimensionality.
The Random Forest (RF) technique is a collective learning approach that integrates numerous decision trees to build a more robust and precise forecasting model.Functionally, each tree in the ensemble is built from a bootstrapped sample of the data, and during the tree-building process, a random subset of features is chosen at each node split [3].This randomization not only decorates the trees but also makes the ensemble less prone to overfitting, enabling it to perform well on unseen data.As a predictor, RF is renowned for its exceptional accuracy, capability to process vast datasets with extensive dimensionality, and capability to handle missing values.One of the lesser-known but advantageous features of an RF model is its innate capability for Feature Selection (FS) [4].During training, it computes a score for each feature that indicates its importance in making predictions.This feature importance score is often derived from the average reduction in impurity that each feature brings across all trees in the forest.By ranking features based on this score, RF provides a practical and intuitive way for FS, helping to improve the performance of not only itself but also other ML models that may be sensitive to irrelevant or redundant features [5].
Many RF models are specialized for FS, each offering unique advantages and disadvantages.Variants like Boruta [6] focus on systematically identifying important attributes by comparing them to randomly shuffled versions of themselves, while Conditional Inference Forest (CIF) [7] aims for unbiased FS through statistical hypothesis tests.Regularized RF [8] applies a regularization term to prioritize a sparse set of features, and Extremely Randomized Trees (RT) [9] adds an extra layer of randomness for potentially more robust selections.However, a common drawback in most of this RFbased Feature Selection Model (FSM) is their lack of focus on handling large datasets.These methods often strive for computational efficiency, facing challenges related to memory space and computing time.To mitigate these issues, highperformance computing environments and parallel architectures are often necessary for effective FS on big datasets.Failing to use such computational resources can significantly ramp up hardware and software costs.For example, scalable software frameworks like Hadoop MapReduce are often required for the learning and analysis stages to manage large datasets efficiently.Therefore, while RF-based methods offer numerous avenues for FS, their www.ijacsa.thesai.orgapplicability to big data scenarios often necessitates additional computational resources to overcome inherent limitations.
Considering the challenges associated with computational time complexity and classification accuracy in highdimensional datasets, a novel FSM integrated with data reduction techniques is proposed.The architecture operates in four distinct phases.Initially, high-dimensional data undergo preprocessing to standardize and clean the dataset.Subsequently, the preprocessed data are processed through a Best-first Search with Rough Set Theory (BFS-RST) Feature Reduction Model (FRM) in the second phase.This specialized model aims to reduce the feature size effectively.In the third phase, a novel proposed variant of RF termed Dynamic Correlated Regularized Random Forest (DCRRF) is employed.This DCRRF model incorporates correlated FSM to identify an optimal set of features from the already-reduced set.The final phase involves a rigorous assessment of the quality and efficacy of the FS using a Support Vector Machine (SVM) classifier.Performance benchmarks indicate that this proposed model outperforms existing RF-based FSM when tested on four gene expression datasets.The architecture aims to mitigate computational inefficiency and enhance classification accuracy, offering a more robust approach to FS in complex data scenarios.
The paper is organized as follows: Section II presents the literature review, Section III presents the methodologies used in the work, Section IV presents the proposed model, Section V analyses the work using different experiments, and Section VI concludes the work.

II. LITERATURE REVIEW
This section delves into various works that have employed RF-based models for FS, offering insights into their efficacy and limitations.
The research in [10][11][12] presents a two-step RF-based FSM.The first step selects features based on variable importance scores and then employs the search process in the second step to finalize a feature subset.The approach was tested on the KDD'99 intrusion detection dataset, derived from the DARPA 98 dataset.Notably, the KDD'99 dataset was modified to remove redundant records, resulting in a refined dataset called RRE-KDD for training and testing.Experimental results indicated that this approach reduced the feature set and computational time and enhanced classification accuracy.The study in [13][14][15] explores the use of the RF classifier for FS in prostate cancer detection.Utilizing an ensemble of Decision Trees (DT) for classification, the study notes that the accuracy improves with adding more trees.The classifier is adept at handling incomplete data attributes and is scalable for large datasets.Emphasizing the pivotal role of FS, the research finds that their method boosts detection accuracy by roughly 87%, underscoring the importance of effective FS in enhancing prostate cancer detection.The research in [16][17][18] study introduces a recursive FSM using RF to enhance protein structural class prediction.The method underwent evaluation through four experiments and was compared to existing prediction techniques.Findings suggest that this feature selection approach significantly bolsters the efficiency of predicting protein structural classes.Remarkably, the method uses fewer than 5% of the features yet boosts prediction accuracy by 4.6-13.3%.Further analysis revealed that features related to predicted secondary structures yielded the best performance, providing insights that could inform the development of even more effective prediction methods for protein structural classes.
The study in [19][20][21][22][23] explores how the number of trees and class separability influence the consistency of variable importance rankings in RF algorithms.The research concludes that achieving stable importance values is possible either by incorporating a large number of trees in a single execution of the model or by taking the average values from multiple runs with fewer trees.While the second approach is more economical regarding computational cost, both methods produce comparable rankings for the variables.The research additionally points out that the ideal number of model iterations fluctuates depending on class separability and offers recommendations for ascertaining the appropriate number of runs or trees to achieve stable rankings of variable importance.
In study [24][25][26][27][28], introduce an explainable Artificial Intelligence (AI) model for blood test sample-based COVID diagnosis.Despite the advancements in AI-based diagnostic models, few effectively integrate human-centered and machine-centered approaches.This research employs humancomputer interaction design principles to address this gap.Employing graph analysis for the visualization and optimization of features, the model integrates an interpretable decision forest classifier to categorize COVID-19 cases using existing blood test information.This enables clinicians to leverage DT structures and feature visualizations for better model interpretability.They proved that their model had not just better diagnostic accuracy but also reduced computation time.
The research in [29][30][31][32][33][34] examines the efficacy of ML algorithms like RF and its variations in selecting Single Nucleotide Polymorphisms (SNPs) for fine-scale genetic population assignment in wildlife conservation.The study, which uses unpublished data for Atlantic salmon and published data for Alaskan Chinook Salmon (ACS), found that ML methods outperformed traditional Fixation Index (FST) rankings in identifying informative genetic markers.Specifically, RF-based methods led to an accuracy improvement of up to 7.8% and 11.2% for ACS, respectively.The findings underscore the potential of ML algorithms in enhancing genetic marker selection for conservation efforts.The research in [35][36][37][38][39][40] addresses the challenges in intrusion detection systems, such as the scarcity of labeled datasets, computational overhead, and suboptimal accuracy.The research introduces an Auto-Encoder Intrusion Detection System (AE-IDS) that leverages the RF algorithm for improved performance.The approach focuses on creating a robust training set through FS and grouping in [41][42][43][44][45]. Posttraining, the model employs an auto-encoder for prediction, significantly reducing detection time and enhancing accuracy.Experimental findings suggest that AE-IDS outperform conventional ML-based IDS, offering more accessible training, better adaptability, and higher detection accuracy in [46][47][48][49][50].The research in [51][52][53][54][55] employs an RF algorithm for county-scale cotton mapping, using spectral, vegetation, and www.ijacsa.thesai.orgtexture features.The study found that texture features, particularly the Gray Level Co-occurrence Matrix (GLCM), significantly improve classification accuracy.Compared to other classifiers like SVM and ANN, RF exhibited better stability and higher accuracy.The method that combined multiple features achieved an average accuracy of 93.36%, showing the effectiveness of using RF and multiple features for precise cotton mapping [56][57][58].

A. Random Forest
RF is an Ensemble Learning (EL) algorithm that builds a forest of DT, usually trained with the "bagging" method.The general idea of the Ensemble Learning Method (ELM) is to combine weak learners to create a robust model.In RF, each DT, is trained on a different bootstrap sample Drawn from the original dataset.The algorithm performs this operation times based on the parameter , effectively creating different trees.A unique aspect of RF is that it considers only a subset of features when making each split, a number specified by the parameter .This random subset of features introduces diversity among the trees, leading to a more robust model.
For regression problems, the output of a RF model is the mean prediction of all the trees, mathematically expressed as Eq.(1).

ˆ
In classification tasks, the model employs a Majority Voting Scheme (MVS), choosing the mode of the classes predicted by individual trees, given as Eq.(2).

ˆ
. ( This EML provides a way to reduce the variance that might be present in a single DT, improving generalization to unseen data.One of the essential aspects of RF is the criteria used for node splitting, often specified by the Gini Impurity (GI) as shown in Eq. (3)., where is the proportion of samples of class at a node, GI quantifies the "messiness" of the data.The algorithm aims to minimize the weighted sum of the GI of child nodes when making each split.This weighted sum can be calculated as in Eq. ( 4).
In addition to GI, entropy is another criterion which is sometimes used for splitting nodes, defined as .The algorithm then selects the split that maximizes the information gain, calculated as in Eq. ( 5).

| | ( ).
( One lesser-known but critical aspect of RF is the Out-of-Bag (OOB) error.This internal error estimate eliminates the need for a separate validation set.Each tree in the forest leaves out some samples during its bootstrap training, called OOB samples.The OOB error for each tree is calculated using its corresponding samples as in Eq. ( 6).
The overall OOB error for the RF is the average of these individual tree OOB errors as shown in Eq. (7).Overall (7) where, ˆ is a loss function measuring the difference between the true label and the predicted label ˆ.

Algorithm 1 for RF Algorithm
Initialize Parameters: (i) tree count in the forest (M): " (ii) features needed for each split: (iii) each tree's maximum depth: " (iv) the sample count needed to split a node: " (v) The sample count needed to be a leaf node: " .
For to: ) Variable importance scores from RF: RF is not only known for its robust predictive power but also for its built-in FS capabilities.One of the metrics that the algorithm provides for understanding the dataset is the variable importance score for each feature.Understanding variable importance is crucial for improving and interpreting the model's decisions.The variable importance score in an RF algorithm is computed based on two principal factors: a) Mean Decrease in Impurity (MDI): This method calculates the average reduction in impurity-Gini impurity or entropy, for example, for each feature brought about when used for node splitting.Mathematically, the Mean Decrease in Impurity for feature is computed as shown in Eq. ( 8).b) Mean Decrease in Accuracy (MDA): Another method, which usually involves using the Out-of-Bag error, calculates the decrease in model accuracy when a particular feature is permuted.The idea is to assess how much worse the model performs without each feature.The formula for MDA can be generalized as Eq. ( 9).
( OO rror with OO rror without ) (9) Calculation Steps: Step 1. Run the RF Algorithm: First, generate the RF model using all variables and calculate the OOB error rate.
Step 2. Permute Each Variable: For Each feature in the dataset, randomly permute the values of in the OOB samples and record the new OOB error.
Step 3. Compute Importance: For Each feature , compute the Mean Decrease in Accuracy or Mean Decrease in Impurity, depending on which method you're using.
Step 4. Normalize Scores: The raw importance scores can be normalized to sum to one, making them easier to interpret and compare.
2) Regularized random forest (RRF): RRF is an advanced extension of the traditional RF algorithm.While RF is already effective in ensemble learning, RRF takes a step further by incorporating regularization techniques aimed at reducing overfitting and improving feature selection.In standard RF models, each DT is trained independently on a bootstrap sample , with no explicit mechanism for feature regularization.RRF, however, adds a regularization term to the ELM, effectively penalizing the complexity of individual trees.
The objective function for each tree in RRF can be mathematically represented as in Eq. ( 10).(10) Here, Impurity refers to the impurity measure, which can be either GI or entropy.Complexity is a function quantifying the complexity of the tree, such as the depth or the number of leaves. is the regularization parameter controlling the trade-off between impurity and complexity.This parameter is usually determined through cross-validation.In the RRF model, the standard information gain is replaced by a regularized form, , which integrates the regularization term: Here, is the set of feature indices already used for splitting in previous nodes.The term serves as the penalty coefficient.Regularization in RRF can be applied at different stages:  During Feature Selection: The regularization term is incorporated into the evaluation metric used for selecting the features for node splitting.
 During Tree Pruning: After constructing the trees, they can be pruned to minimize the regularized objective function.
By introducing the regularization term, RRF balances model complexity and fit quality, ensuring a more interpretable and robust ensemble model.This is particularly useful in cases where the dataset contains many irrelevant features or when overfitting is a concern.Therefore, RRF benefits from the inherent advantages of Random Forests while simultaneously mitigating some of their limitations.

B. Best-First Search (BFS)
BFS is a tree-based search algorithm that aims to find the most optimal solution by navigating through the state space of possible solutions.In the context of feature selection, each node in the search tree represents a subset of features , and the root node usually represents an empty set or the complete feature set.The primary driving force of the algorithm is an evaluation function , which measures the 'quality' or 'promising nature' of node .Mathematically, the evaluation function can be expressed as in Eq. ( 12).
(12) www.ijacsa.thesai.orgwhere, is the cost to reach the current node from the root (often equal to the number of features in when feature reduction is the goal), and is the heuristic estimate of the cost to reach an optimal solution from .The algorithm maintains a priority queue , initialized with the root node.The nodes are sorted in based on their evaluation scores.The algorithm iteratively performs the following steps until a stopping criterion is met: The mathematical representation of the priority queue after 'k' iterations can be represented as in Eq. ( 13).

{ } s t (13)
By focusing on the most promising subsets of features, BFS achieves a balance between exhaustive search and greedy algorithms.However, it can be computationally intensive, especially when the feature space is ample, as the time complexity can go up to , where is the branching factor, and is the depth of the solution.

C. Rough Set Theory (RST)
FR helps reduce the computational cost, simplifying models and sometimes even improving the performance by eliminating irrelevant or redundant features.RST developed that can be employed for feature reduction.RST provides a formal mathematical framework to deal with vagueness and uncertainty in data.In the context of Feature Reduction (FR), it helps identify the minimal set of features indispensable for preserving the discernibility between objects.In simpler terms, it helps find the most miniature set of features necessary and sufficient for classification tasks.
Let represent the universe of objects or instances in the dataset, and let denote the set of attributes or features.A decision table may be formed, in which comprises the rows, and makes up the columns.Additionally, a subset of , can be introduced as the decision attribute(s) of interest.Using this foundation, the following aspects of RST are discussed: a) Indiscernibility Relation: The fundamental concept in RST is the indiscernibility relation.For a given subset of attributes, an indiscernibility relation is defined as follows in Eq. (14).
Here, is the value of attribute for object .The indiscernibility relation groups objects that cannot be distinguished by attributes in b) Lower and Upper Approximations: Given a target set , the lower and upper approximations are defined as in Eq. ( 15) and Eq. ( 16).
Lower Approximation: Upper Approximation: Here, represents the equivalence class of concerning .
c) Core and Reduct: The core attributes are indispensable for maintaining the exact lower approximation for every subset of as the entire set .Mathematically, Eq. (17).

{ { } } (17)
A reduct is a minimal subset of such that .In other words, and give the same lower approximations for each decision class.

D. Feature Reduction using RST
The overarching goal is to identify all possible reducts and then choose the one with the least number of attributes while preserving the classification power of the original dataset.However, finding all reducts can be computationally taxing.For this reason, heuristic approaches are frequently used to approximate a minimal reduct effectively.

1) Initialize with core attributes:
Start by calculating the core attributes, denoted as Core , which are essential for classification.Initialize the reduct set, Reduct, with these core attributes.
2) Iterative refinement: Continue refining the reduct set until it provides the same classification power as the complete www.ijacsa.thesai.orgattribute set .Specifically, iterating while reducing is not equal to .
 Evaluate Significance: For each remaining attribute in -Reduct, evaluate its significance in distinguishing between different classes.
 Select Most Significant Attribute: Add the attribute with the highest significance score to the Reduct set.
By the end of this iterative process, Reduct will contain a minimal set of attributes that retains the original dataset's ability to distinguish between different classes.

E. Correlation-based Feature Selection (CFS)
CFS is an FSM designed to improve model performance by FS that are highly correlated with the target variable and minimally correlated with each other.The process typically begins with data preprocessing to standardize or normalize the features then calculating a correlation matrix.Based on this matrix, an initial subset of FS either through predefined correlation thresholds or optimization algorithms.The criterion often used to maximize the quality of the feature subset, is Eq. ( 18).
where, is the number of features, is the average correlation between features and the class label, and is the average inter-feature correlation.This subset is then further evaluated using methods like cross-validation.
It is important to note that the CFS employs a heuristic search strategy within its multivariate FS algorithm to pinpoint optimal attributes in a given dataset.The criteria for selection are anchored in the correlation strength and statistical significance between a feature and its associated category.This unique capability has solidified CFS's role as a go-to method for Feature Extraction (FE), especially in large-scale data environments.Moreover, CFS has yielded numerous impactful findings that contribute to elevating the efficacy of Decision-Making System (DMS).
The advantages of CFS are manifold.It tends to produce more superficial and interpretable models by decreasing the feature size, thus mitigating the risk of overfitting.However, the method is not without limitations.For example, Pearson's correlation, which is commonly used, assumes a linear relationship between variables and does not capture feature interactions.Despite this, CFS remains a powerful FSM, aiming to optimize the model's performance and generalization capabilities.When integrated with techniques like Regularized Random Forest (RRF), CFS can further enhance the FSM, leveraging the regularization capabilities of RRF to produce an even more robust and interpretable model.

IV. PROPOSED FSM
In the architecture of the proposed FSM, as shown in Fig. 1, four key steps seamlessly integrate to provide a holistic solution.Initially, the dataset undergoes a preprocessing phase, which includes tasks like data normalization, formatting, and randomization, preparing the data for rigorous analysis.Following preprocessing, the first significant phase employs the innovative BFS-RST Adaptive Algorithm to reduce the feature set effectively.Utilizing this algorithm allow for a focus on a subset of features that are most relevant to the task, thereby enhancing the model's efficiency.This reduced feature set serves as the input to the second crucial phase, which features the Dynamic Correlated Regularized Random Forest (DCRRF) application.DCRRF refines FS dynamically, optimizing performance and interpretability through a combination of Correlation-based Feature Selection (CFS) and Regularized Random Forest (RRF) methodologies.After the optimal feature set has been identified, the final step involves a data analysis phase where the effectiveness of the selected features is rigorously tested using a Support Vector Machine (SVM) classifier.This multi-layered approach enhances the feature selection process and lends itself to detailed performance evaluation, making it a comprehensive solution for complex data analysis scenarios.

A. Data Preprocessing
The first step in the proposed FSM is Data Preprocessing.This phase is crucial because it converts the raw dataset into a more manageable, clean form, making it easier to analyze and feed into subsequent FR and FS stages.A properly preprocessed dataset not only streamlines the FSM but also contributes to the robustness and interpretability of the resulting model.The following methods are used in the preprocessing pipeline of the proposed architecture:  Shuffling the order of data points enables the FSM, which employs both the BFS-RST and DCRRF algorithms, to learn more objectively, uninfluenced by the sequence in which the data points initially appear.

B. Adaptive Feature Reduction (AFR) using BFS and RST
FR is a vital process in ML pipelines, as it aims to cut down on the data dimensions without significantly affecting the model's performance.While several algorithms aim to do this, each has advantages and disadvantages.BFS is known for its ability to traverse the feature space optimally but can be computationally expensive.On the other hand, RST provides a formal framework to identify indispensable features but can be heuristic and computationally intensive for calculating reducts.
An adaptive approach that combines the strengths of both algorithms is thus conceived to achieve effective FR.The rationale is to employ RST's capability to identify core attributes indispensable for the DMS and then use BFS to navigate the feature space efficiently.In the given dataset , each feature subset is a potential candidate for FR.These subsets are represented as nodes in the search space that BFS navigates.An evaluation function is used to assess the "quality" of each subset , analogous to how each node in a traditional BFS comes with an associated cost or value.

1) Initialization using core attributes from RST: Rough
Set Theory first identifies a set of core attributes core from .These are the features that are indispensable for maintaining discernibility among the classes in as shown in Eq. ( 19 2) Evaluation function in BFS: The evaluation function used in BFS combines a cost function and a heuristic to guide the search.could represent how well performs in terms of model accuracy or any other metric, and is a heuristic estimate of the "distance" to the optimal feature subset, see Eq. ( 20).(20) Here, and are weight parameters.
3) Priority assignment using RST: During the BFS traversal, RST is used to identify if a subset is a reduct minimal set of features with discernibility power comparable to .Such subsets are flagged for higher priority in the BFS queue.

{
If is a reduct Otherwise (21) In Eq. ( 21), is a factor that lowers the evaluation function for reducts, they are effectively giving them a higher priority in the queue.By methodically integrating RST for initial setup and ongoing evaluation with the traversal capabilities of BFS, the algorithm aims to find an optimal and minimal feature subset from for dataset .In the proposed algorithm, the focus is on reducing features by generating child nodes with fewer attributes, followed by an evaluation of their effectiveness using both BSF and RST techniques.Feature sets that neither improve nor degrade the quality of the model will be pruned.With this understanding now established, Algorithm 3, shown below, illustrates the steps involved: In Step 3.3.1,each child node has one less feature than its parent .This is where FR is explicitly done.Here, Rough Set Theory is used for two purposes:

Algorithm 3 for BFS-RST based on Adaptive
(i) It provides a robust starting point core that contains indispensable features, ensuring that the essential features are not eliminated in the initial stages.
(ii) It helps to flag high-priority nodes (reducts) during the FR process, guiding the algorithm toward a more meaningful feature subset.
The BFS evaluates these smaller feature sets and prioritizes them in the queue.If a reduced feature set satisfies the stopping criteria, it is output as the optimal set of features opt .In essence, this algorithm combines the strengths of both RST and BFS to perform feature reduction in a more effective and informed manner.

C. Dynamic Correlated Regularized Random (DCRRF)
DCRRF is a novel hybrid model that aims to combine the strengths of Correlation-based Feature Selection (CFS) and Regularized Random Forest (RRF) to optimize FS and improve model performance dynamically.By incorporating CFS into the training of each tree within the RRF ensemble, DCRRF aims to maximize model robustness and interpretability.The model takes a reduced feature set as input from the BFS-RST Adaptive algorithm.This feature set is standardized or normalized to make feature values comparable using the following steps: 1) Standardization: In the standardization process, every attribute is adjusted to have a zero mean and a unit standard deviation.This becomes particularly crucial when dealing with features in disparate units or varying in scale.To standardize a given feature , a commonly used mathematical EQU ( 22) is typically employed.(22) 2) Normalization: In normalization, the features are typically scaled to lie in a given range .This is often beneficial when the algorithm involves distance metrics or when the feature has a skewed distribution.Normalization of a feature is generally achieved by Eq. ( 23): The normalized feature set is the foundation for the subsequent feature selection process in the DCRRF model.For each tree in the ensemble, a distinct bootstrap sample is chosen from this processed feature set.A correlation matrix is then computed for each of these samples, expressed as in Eq. ( 24).
Using the CFS criterion , a tailored feature subset is dynamically selected for each tree .The The criterion is calculated as Eq. ( 25).√ , (25) where, is the number of features in is the average correlation between features and the class label for , and is the average inter-feature correlation for .This criterion is used to select an optimal subset of features for each .After this dynamic FS, each tree is trained using its respective selected feature subset .The training process adopts the regularized objective function which is shown in Eq. (26).
Importantly, this objective function is indexed with to signify the dynamically chosen features for that specific tree.The optimal feature set is then determined by an intersection operation over all the dynamically selected feature subsets .This can be formally expressed as in Eq. ( 27).(27) The dynamic FS introduces diversity among the individual trees, making the ensemble model more resilient and adaptable.It also enables optimized FS, thereby potentially improving both performance and interpretability.The following Algorithm 4 presents the steps involved in the proposed FSM.

D. Data Analysis
The data analysis phase serves as the final phase of the FSM's pipeline.This phase is significant because it provides the final validation of the feature sets that have been carefully reduced and selected through the preceding stages.The focus here is on evaluating these feature sets within the specific context of the problem, be it classification, clustering, or some other form of ML task.For the purpose of this paper, the efficacy of the proposed FR and FS model is examined using a SVM classifier.The reason for choosing SVM for analysis is twofold.First, SVMs are known for their effectiveness in high-dimensional spaces, making them a suitable choice for testing the quality of the FS.Second, SVMs are robust to overfitting, especially in cases where the number of dimensions is greater than the number of samples, further validating the quality of the FS.The features that have passed through the BFS-RST Adaptive Algorithm and the DCRRF are fed into the SVM model.Performance metrics such as accuracy, precision, recall, and F1-score are computed to evaluate the classifier's performance on the selected feature sets.

A. Dataset and Implementation
In the current research, experiments were conducted on four gene expression datasets analyzed by [59], namely: i) Prostate [60], ii) Brain [61], iii) NCI60 [62] and iv) Adenocarcinoma [63].The specifics regarding the number of instances and attributes for each dataset are detailed in Table .I. All methods and experimental procedures were executed in a Jupyter Notebook environment, utilizing the Python 3.6 language.Computations and tests were carried out on a system equipped with a Windows 10 operating system, powered by a 2.8GHz AMD Ryzen 5 processor, and supplemented by 8GB RAM.Various stages of data processing, feature selection, and machine learning implementations leveraged pre-existing software libraries.
The datasets mentioned above are partitioned into an 80:20 ratio for the purposes of training and evaluation.The SVM model is calibrated using specific hyperparameter settings, as shown in Table II, for optimal performance.The regularization parameter C is set to 1 to maintain a compromise between maximizing the margin and minimizing the classification error.The Radial Basis Function (RBF) kernel is chosen for its ability to handle both linear and nonlinear patterns in the data [64][65][66][67][68][69][70][71][72][73].The model undergoes 100 iterations during training to ensure convergence and optimal performance.The performance of the proposed feature selection model is compared with RF-based baseline models such as i) Boruta, ii) RRF, iii) VSURF and iv) GRRF.The effectiveness of the SVM, when employing each feature mentioned above, FSM, is assessed through metrics such as accuracy, sensitivity, specificity, precision, and F-score.The results achieved by all the models for the listed performance metrics are shown in Table III.In both the Prostate and Brain datasets, as shown in Fig. 2 and Fig. 3, DCRRF demonstrates superior performance across multiple metrics.For the Prostate dataset, DCRRF achieves an accuracy of 0.9514, edging out the second-best model, RRF, by 0.37%.It also excels in sensitivity with a score of 0.9582, which is notably higher than RRF's 0.9413.Regarding specificity and precision, DCRRF performs on par with RRF and GRRF, highlighting its balanced efficiency in identifying True Negatives (TN) and minimizing False Positives (FP).The F1-score for DCRRF is the highest at 0.9534, and it further improves to 0.9562 when augmented with BFS-RST, all while requiring only 12 selected features.For the Brain dataset, DCRRF again leads in accuracy and sensitivity, with scores of 0.9073 and 0.8802, respectively.While its specificity score of 0.9292 is not the highest, it still indicates a balanced performance compared to RRF's higher specificity but lower sensitivity.In the precision metric, DCRRF is slightly edged out by VSURF but still performs strongly with a score of 0.8986.Its F1-score stands at 0.8894, and when combined with BFS-RST, it further improves to 0.9003, again achieving this with fewer features.These metrics collectively indicate that DCRRF, particularly when enhanced with BFS-RST, offers balanced, efficient, and robust performance across both datasets in FS and classification tasks.
In the NCI60 dataset, as shown in Fig. 4, DCRRF stands out with an accuracy of 0.9283, outperforming the next-best model, RRF, which scores 0.9157.While its sensitivity score of 0.8765 isn't the highest, it's balanced by a strong specificity of 0.9405.The model's F1-score is 0.8885, which is superior to both Boruta's 0.8408 and RRF's 0.8823.Its precision score www.ijacsa.thesai.org of 0.9008 is commendable, though it is slightly eclipsed by Boruta's 0.9660.Notably, when integrated with BFS-RST, the model's F1-score rises to 0.8960 with a reduced feature count of 53.In the Adenocarcinoma dataset, as shown in Fig. 5, DCRRF maintains its strong performance with an accuracy of 0.9101, closely followed by RRF at 0.9098.DCRRF shines in sensitivity with a score of 0.8977, substantially better than Boruta's 0.8027 and slightly edging out RRF's 0.8933.With well-rounded scores in specificity (0.9367) and precision (0.9274), it also maintains a balanced F1-score of 0.8992.When enhanced by BFS-RST, the F-score improves to 0.9179 with just 14 FS, demonstrating the model's efficiency and efficacy in FS and classification.
In a comprehensive review of the results for OOB error, time consumption and AUC efficiently, as shown in Fig. 6 to Fig. 8, BFSRST+DCRRF consistently delivers outstanding performance across all datasets, excelling in AUC and minimizing OOB errors.For instance, in the Prostate dataset, this model achieves the highest AUC of 0.893, using the fewest features (12) and an OOB error of just 0.11.The computational time, although slightly higher than its DCRRF counterpart, remains modest at 0.06 minutes.Similarly, in the Brain and NCI60 datasets, BFSRST+DCRRF again tops the chart in AUC, recording 0.911 and 0.914, respectively, while maintaining low OOB errors and computational times.On the Adenocarcinoma dataset, it achieves an AUC of 0.902, leading the pack.VSURF performs well but is computationally expensive, particularly noticeable in the Prostate and Adenocarcinoma datasets, where the computational times are 0.08 and 0.1 minutes, respectively.DCRRF alone also shows promise, particularly in the NCI60 and Adenocarcinoma datasets, where it nearly matches the performance of its BFSRST-enhanced version but with more features.Boruta and RRF, although competent, generally lag in AUC and OOB error metrics.Notably, GRRF consistently demands fewer features but doesn't offer a compelling tradeoff regarding AUC or OOB error.The BFSRST+DCRRF model demonstrates superior, balanced performance across all four datasets.

VI. CONCLUSION
Handling big data with high dimensions presents unique challenges, particularly regarding computational resources and predictive accuracy.To address these issues, an allencompassing Feature Selection Model (FSM) has been developed.This system incorporates initial data cleaning and feature reduction through Best-first Search and Rough Set Theory (BFS-RST).It culminates in deploying a specialized Random Forest (RF) algorithm called Dynamic Correlated Regularized Random Forest (DCRRF).Each stage of this four-tiered architecture serves a specific function, from initial data refinement to advanced FSM.The final assessment phase employs a Support Vector Machine (SVM) classifier to evaluate the quality and utility of the selected features rigorously.When tested against existing RF-based FSM on four gene expression datasets, this innovative approach improved computational efficiency and classification accuracy.The system's enhanced performance indicates its potential as a scalable solution for tackling the unique challenges presented by high-dimensional big data across various applications.
all nodes using Impurity of Parent Node -Weighted Impurity of hild Nodes(8)
Repeat steps 1.2.1 to 1.2.4 for the child nodes and .
Recursive Split www.ijacsa.thesai.org 1. Proposed FSM.www.ijacsa.thesai.org  Data Randomization: Data Randomization is incorporated to mitigate any sequence-based biases in the dataset.Data points can come in sequences that may reflect various forms of underlying structure or bias, such as time-based or class-based ordering.

TABLE III .
PERFORMANCE COMPARISON FOR DIFFERENT BASELINES AGAINST FOUR DATASETS