Is Deep Learning on Tabular Data Enough? An Assessment

—It is critical to select the model that best fits the situation while analyzing the data. Many scholars on classification and regression issues have offered ensemble techniques on tabular data, as well as other approaches to classification and regression problems (Like Boosting and Logistic Model tree ensembles). Furthermore, various deep learning algorithms have recently been implemented on tabular data, with the authors claiming that deep models outperform Boosting and Model tree approaches. On a range of datasets including historical geographical data, this study compares the new deep models (TabNet, NODE, and DNF-net) against the boosting model (XGBoost) to see if they should be regarded a preferred choice for tabular data. We look at how much tweaking and computation they require, as well as how well they perform based on the metrics evaluation and statistical significance test. According to our study, XGBoost outperforms these deep models across all datasets, including the datasets used in the journals that presented the deep models. We further show that, when compared to deep models, XGBoost requires considerably less tweaking. In addition, we can also confirm that a combination of deep models with XGBoost outperforms XGBoost alone on almost all datasets.


I. INTRODUCTION
Deep learning has gained popularity in a variety of fields in recent years, including medicine, engineering, and agriculture. The exponential growth of data is most likely to blame. Deep learning algorithms have shown to be effective in a variety of domains, including audio [1], images [2], and text data [3]. Many architectures exist in these domains that are capable of converting raw data into meaningful exemplifications. Because the most common type of data is in tabular format, which consists of rows and columns with a variety of parameters, These types of data are used in real-world applications in a variety of fields, including medicine, agriculture, academia, and geography. Traditional and ensemble machine learning approaches, such as Logistic model tree (LMT), Decision tree (DT), Random forest (RF), Gradient Boosted decision tree (GBDT), and others, are used to process these tabular datasets, and these models still outperform deep learning on tabular data. When using a deep learning model on tabular data, there are a number of issues to consider, including missing data, data integrity i.e., mixed data (nominal, numerical, and categorical), data imbalance, data overfitting, and a lack of specific knowledge about the dataset's structure. When tabular data is taken into account, boosting machine-learning algorithms like XGBoost perform better, according to the -no free lunch‖ (NFL) theorem [4] [5]. Since then, the authors [6] [7] have implemented deep learning on the tabular dataset in their research, and it has been demonstrated that the deep learning model outperforms GBDT. However, because each study was conducted on different datasets, one of the major flaws in their approach is that there was no benchmark dataset [8] [9]. So, based on these papers alone, it's difficult to claim that deep learning always outperforms traditional and ensemble algorithms like GBDT when dealing with tabular data [10].
Since the number of research studies using deep learning on tabular data is growing, there is no standard benchmark model in deep learning from which we can conclude that deep learning always outperforms traditional machine learning on tabular data. As a result, the main goal of this paper is to see if any deep learning model is a good fit for these types of tabular dataset problems. Furthermore, in this paper, we attempt to evaluate the proposed deep learning models on tabular datasets, as well as implement XGBoost on various algorithms, with a focus on a historical geographical dataset from India's Kashmir province [11]. This paper is structured as: Section 2 provides a basic background of deep learning and ensemble models on the tabular data. Next, Section 3 presents the experimental setup where dataset descriptions are presented and furthermore this section defines the implementation details with optimization parameters and statistical significance test. Section 4 defines the experimental results and model evaluation. Section 5 defines the overall working of the paper. Finally, the conclusion and future strategies have been suggested in Section 6. www.ijacsa.thesai.org

II. REVIEW OF LITERATURE
In this section, we present studies that used deep learning approaches and ensemble approaches to predict rainfall using a tabular geographical dataset. This section is divided into two subsections: Section 1 contains several studies that use deep learning models on tabular datasets, and Section 2 contains some model ensemble approaches that use the same tabular geographical dataset and record individual performances.

A. Deep Learning on Tabular Geographical Dataset
Salman et al. [12] (2015) use a variety of deep learning techniques, including recurrence neural networks (RNN), convolutional neural networks (CNN), and conditional restricted Boltzmann machines (CRBM), to look for hidden patterns in the dataset. These techniques were used in the Indonesian region, with data collected from the national weather service center for environmental forecasting (NOAA). This study used a dataset that spanned 35 years, from 1973 to 2009. Initially, RNN was applied to a dataset containing ESNO variables. RNN produces results with a higher level of accuracy, according to the findings.
Emiley et al [13] (2016) present a deep learning architecture-based accumulated daily rainfall prediction. This research employs auto encoders to reduce non-linear attribute relationships and a multi-layer perceptron (MLP) for prediction. This hybrid architecture was then compared to previously implemented techniques, and it was discovered that the model performs better for daily rainfall prediction when using root mean square error (RMSE) and mean squared error (MSE) statistical approaches. This research was carried out in the Colombian city of Manziles, where the data was grouped into a daily time series spanning the years 2002 to 2013.
Devi et al. [14] (2017) propose an artificial neural network (ANN) model for a reliable forecast mechanism. This method was used to analyze spatial and temporal data from the Nilgiris district in Tamil Nadu, India. Performance was measured using a variety of statistical parameters such as correlation coefficient, MSE, and so on. When compared to time delay neural network (NN) and other ANN models, the best model in this study is a wavelet Elman model. This research also develops a system for early landslide warnings based on the wavelet Elman model.
According to Geetha et al. [15] (2018), using deep learning techniques for meteorological purposes on a time series dataset will significantly improve accuracy precision. This research uses deep learning architectures such as LSTM and ConvNet to analyze time series data from 468 months in various locations. Later, it was discovered that increasing the number of hidden layers improves the model's performance for daily rainfall prediction when using RMSE and MAPE statistical approaches.
Yen et al. [16] (2019) proposed using Echo state network (ESN) and deep Echo state network (DeepESN) algorithms to apply deep learning models to rainfall prediction. This research uses hourly rainfall data from southern Taiwan from 2002 to 2014, spanning a period of 12 years. When the DeepESN algorithm's correlation coefficient was compared to ESN and commercial neuronal network algorithms like BPNN and SVR, the study concluded that it is a reliable algorithm. It was suggested that DeepESN could be used globally on larger sets of data to predict rainfall based on the results obtained.
Manoj et al. [17] (2020) proposed a hybrid deep learning model (BLSTM-GRU), for the monthly prediction of rainfall. The experiment was conducted using data obtained from Bhutan's National Center of Hydrology and Meteorology Department (NCHM). To test the data's predictive capability, various NN algorithms such as LSTM, CNN, BLSTM, and GRU were used. LSTM outperforms the other techniques with a MSE score of 0.0128, but the hybrid model BLSTM-GRU outperforms LSTM by approximately 41% with a MSE score of 0.0075.
Zeelan et al. [18] (2020) claimed that deep learning models can learn from nonlinear data with less error. The Multi-layer perceptron (MLP) and Auto-encoder NN are used in this study to predict the rainfall. The accuracy parameters used in this study were RMSE and MSE, and these implemented models were later compared with other machine learning models on the same set of data, with the study concluding that MLP and Auto-encoder NN perform significantly and can be used as a solution to all available approaches.
Ari Yari et al. [19] (2021) present a rainfall prediction comparative analysis study. The authors use deep learning (DL) models and simple rainfall estimation approaches based on traditional machine learning algorithms. The study was conducted in five major cities across the United Kingdom (UK), with data collected spanning roughly 20 years (2000-2020). The bidirectional LSTM network and stacked LSTM with two hidden layers performed best after the proposed model was evaluated. One of the study's major flaws was the model's inability to generalize the data. That is, the model over-fits the training data in most cases, which makes it difficult to record accurate, predicts in the testing and validation sets.
Razeef et al. [20][21] (2020,2022) proposed a neural network approach to predict the rainfall on the time series data of UT of J&K, India. Rainfall was predicted using a Grey Wolf-based neural network model. The data in this study spans 30 years, from 1990 to 2020, and includes variables such as maximum temperature, humidity, minimum temperature, wind, vapor pressure, and others. When using RMSE, PRD values, and MSE statistical approaches, it was discovered that the model performs better for daily rainfall prediction. This model was later compared to non-linear autoregressive models with exogenous inputs (NARX), and the study concluded that when both models are used together, non-linear time series data would perform better.
According to the literature reviewed in this study, many deep learning models have been utilized for various time-series prediction applications, but they have yet to become a standard algorithm in the artificial intelligence arena. We must analyze the performance of these models using varied threshold datasets, and these techniques must be re-evaluated as a result.

B. Model Ensembles on Tabular Geographical Dataset
Zaman et al. [22] (2019) use an ensemble distributed decision tree (DDT) approach to improve classification www.ijacsa.thesai.org accuracy on a historical geographical dataset. The experiment was conducted on a tabular dataset containing approximately 6000 records with five different parameters. When the DDT approach was used, there was no performance improvisation, according to this study.
Patil et al. [23] (2020) use machine learning algorithms to forecast rainfall based on a variety of variables such as temperature, humidity, wind speed, and rainfall. These algorithms include linear regression and NN, and the type of data fed to it, according to the study, determines the accuracy of the algorithm. That is, when the dataset of different structures is used, we may get different accuracies and require some modifications. Furthermore, the accuracy of DT's was found to be superior to other techniques used on the same type of data in this study.
Sheikh et al [24] (2021) proposed a stepwise machine learning approach on the discrete data collected from the Indian Meteorological Department (IMD), Pune India. The implemented model, known as LMT, employs logistic regression functions at the DT's leaf nodes. The logistic functions on the leaf nodes combine the final output of the constructed DT into linear models, which were examined and revealed a significant improvement in accuracy performance. The accuracy of the constructed DT on the same set of data is 66 percent, but when the logistic functions are applied to the leaf nodes, the accuracy jumps to 87 percent. The dataset used in this study was from J&K's Kashmir province, and it covered the years 2012 to 2017, with around 6000 data rows.

A. Dataset Description
In this study, we employed a variety of tabular datasets from diverse fields which are used in various classification and regression problems. Some of these datasets have heterogeneous features, while others have just homogeneous features. There are approximately seven tabular datasets that have already been used by various academics in their publications, and we have used one additional dataset that has yet to be used by any researcher. In the experimental operations, roughly 80000 samples were taken, and the datasets range from 7 to 1600 parameters. The seven datasets are obtained from TabNet, NODE, and DNF-Net studies, and each dataset has been well-trained and preprocessed in its respective paper. These datasets include Blastchar [34] (Source: Kaggle), Higgs Boson [35] (Source: Kaggle), Microsoft MSLR [36](Source: MSLR-WEB10K), Forest Cover Type [37] (Source: Kaggle), Epsilon [38] (Source: PASCAL Challenge 2008), YearPrediction [39] (Source: Million Song Dataset) and Gas concentrations (Source: OpenML) [40]. These datasets have also been adjusted and relative values were calculated, resulting in a standardized data with a zero mean value and unit variance. As a result, we won't go into detail about these datasets in this study; instead, we'll just establish the historical geographical dataset that we will be implementing latter. The historical geographical dataset has been collected from three different locations in Jammu and Kashmir's UT. These three locations are in the province of Kashmir, but they are quite far apart. The data spans five years, from 2012 to 2017. At these locations, the average annual rainfall is around 1700 mm. The data consists of 5491 records with a total of 9 explanatory characteristics, including minimum temperature (°C), maximum temperature (°C), station ID, season, year, humidity at various intervals, and the target parameter rainfall, which shows the quantum of rainfall measured in millimeters [41][42][43].
In Fig. 1, the reader can find a brief description of the data. It has gone through an ETL (Extract, Transform, and Load) process to achieve data integrity, normalization, and standardization.
To normalize the dataset's range of parameters, we use the function (1), as given below: (1) The data was scaled using the R tool's built-in function 'Scale.' We also use relative values of each attribute to normalize the training data. The tabular (Table I) and graphical (Fig. 2) representation of the dataset is shown.  A total of 70% of the data is used for training, while 15% is used for validation and testing, i.e. 3844 samples were randomly selected for training, 823 samples for validation, and the remaining 823 samples were selected for testing.
Thus, the overall description of the tabular datasets used in this paper is shown in Table II. B. Implementation Details 1) Optimization process: To pick the model hyperparameters during the optimization phase, we used the HyperOpt parameter-tuning package. To optimize the results on the validation set, this technique first uses Bayesian optimization, followed by hyperparameter search on each dataset utilized in this study. There were around 7-9 main hyperparameters, which in the case of a deep learning model include the number of nodes, layers, and, most importantly, the learning rate.
To optimize the hyperparameters all the datasets used in this study were initially divided into three individual splits, which include training split, testing split and validation split. In partitioning process, we use stratified random sampling partitioning to split the data. The below tabular representation (Table III) shows the individual splits of the datasets in order to optimize the model.
Around 1000 steps of search were performed on each set of data in order to maximize the validation set's findings, and only the set of hyperparameters with the smallest loss for the final configuration were chosen.

2) Metrics evaluation and statistical significance test:
In the case of classification issues using discretized data, we simply utilize cross-entropy loss metrics to evaluate the datasets. It calculates the impurity at each stage of the data and the total entropy loss in the end. Furthermore, when the data is continuous in nature, such as in regression situations, statistical parameters such as RMSE, mean signed difference (MSD), and MAE are used. We reported the performance of each dataset on their respective test sets based on these metrics. We also have Friedman's testing for statistical significance in addition to these cross entropy and RMSE, MAE measurements. Friedman's testing has the advantage of assuming that data is not evenly distributed. Using Friedman's hypothesis, we compare all of the classifiers to the baseline classifier. The null hypothesis is rejected at a certain level of confidence (90 percent in this study) if the p-value for any model pair is less than 0.05; otherwise, the hypothesis is not rejected.  5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59

A. How Effectively can Deep Learning Models Generalize to other Datasets?
The performance of deep learning models on the aforementioned datasets is proposed in this study, and the individual outcomes are compared to the XGBoost technique. The performance of each algorithm on each dataset is presented in the table below. The mean and standard error of each model's performance on the datasets are shown in the  Table IV. The best performance of the dataset is presented for each model, and it was discovered that the model with the lowest value is considered to have the best performance. Friedman's testing was utilized to perform a statistical significance test between the models with a 90% confidence level. There are some observations based on the results, as given in the table. To begin with, the models almost outperform unknown datasets on original datasets. On each dataset, the XGBoost model nearly outperformed all deep learning models such as NODE, DNF-Net, and TabNet. As we can see, the XGBoost model outperforms deep learning models in 5 of the 8 datasets, and these datasets had significant p-values (< 0.05), indicating that the results were significant. We can also see that the deep learning model has not consistently performed. The authors claimed in their study that deep learning models outperform other models, but this was only true for the datasets included in their study. As a result, when distinct datasets are involved, this conclusion is unjustifiable. We can also observe that the Deep ensemble and XGBoost model beats individual models in the majority of cases, i.e. it outperforms 5 individual models out of 8, and the p-value in these 5 cases was substantially less than 0.05, indicating that the null hypothesis is rejected. Now, in order to evaluate these models and see which one is better for a given dataset, we compared the relative performance of each model (NODE [44], TabNet [45], DNF-Net [46] [47], and so on) to the best model for that dataset. For example, assume we used the historical geographical dataset in table (Table IV) and compared the relative performance of the models to choose the model with the best performance (Deep Ensemble & XGBoost in this case). We discovered that Deep Ensemble & XGBoost had the best relative value gain of 2.46 percent, with XGBoost coming in second with 3.86 percent, TabNet with 8.67 percent, DNF-Net with 10.55 percent, and NODE with 13.23 percent. The tabular representation of average relative performance deterioration on unseen datasets is shown (Table V).
With these findings, we discovered that deep learning does not always outperform other methods. When compared to XGBoost, Deep Ensemble, and XGBoost, the deep learning model performs the worst when trained on datasets other than those used in the original studies. Only two choices exist for the lowest performance results. Either there is a selection bias or there is a difference in hyperparameter optimization. Furthermore, the results in the original papers reflect the results that we have reported, excluding the possibility that implementation errors were the cause of our observation.

B. Model Evaluation: Is it required to apply both the XGBoost and Deep Models in Combination?
In this section, we will see which model performs better in all scenarios when compared to other models. We employed four types of models in table (Table V), including deep models (TabNet, DNF-Net, and NODE), XGBoost, Deep ensemble with XGBoost, and Deep ensemble without XGBoost. When comparing the performance of deep learning models to XGBoost and combined Deep ensemble & XGBoost Models on various data types, we discovered that deep learning models perform poorly in most circumstances. The question now is whether we require a combined XGBoost and Deep model. To answer this, we can see that in 6 of the 8 examples, the combined ensemble and XGBoost show significant results. Simple ensemble did not produce any improvised results, although competing with deep learning model results. Furthermore, when we look at the Deep ensemble models without XGBoost, we can observe that it did not do well in any situation when compared to any other model. As a result of this analysis, we can conclude that for tabular datasets, we require both deep ensemble and XGBoost in combination. In real-world situations, time and resources are limited when it comes to training a model for a new dataset and optimizing its hyperparameters. As a result, it's fascinating to learn how difficult it is to do so for each model. Calculating the number of computations required by the model is one way to assess this. Floating point operations per second (FLOPS) is a common unit of measurement. However, because each parameter set has a different FLOPS number, comparing various models in this way when optimizing model parameters has become impossible [47].

V. DISCUSSION
This study was based on deep models that had already been deployed by several academics on a tabular dataset [12][13][14]. Deep models were applied to tabular datasets (Forest CoverType, Higgs, Gas Concentration, Epsilon [30], MSLR [31], Year Prediction [32], Blastchar [33], and so on) by the authors in their publications, and they argued that deep models exhibit some promising outcomes. However, their research was limited to a single dataset. We used one more tabular dataset (Geographical dataset) in this research and attempted to construct all of the deep learning and ensemble models. On all of the datasets utilized in this study, we also investigated various possible tradeoffs that are required in real-time applications, such as hyperparameter tuning, metrics evaluation, and Statistical Significance test. Our results reveal that the performance is similar to what the authors have shown in their respected publications, but when we tried to compare the performance of different datasets on the models used by the authors in their study, the deep learning results were not as good as the original datasets. We next looked at XGBoost and ensembles of deep models with XGBoost and without XGBoost, and discovered that the XGBoost model outperforms deep models. However, as seen in the table, the ensemble of XGBoost models with Deep models outperforms the XGBoost model alone. Furthermore, optimizing a new dataset using deep models is a difficult procedure, whereas optimizing a new dataset using ensemble models with XGBoost is quite simple [48].

VI. CONCLUSION AND FUTURE STRATEGIES
This research demonstrates that using various deep learning algorithms on tabular data does not improve performance. We also used XGBoost on these datasets, which produced some promising results when compared to deep models, and we used ensemble deep learning with and without XGBoost to see how it affected the performance of each dataset. On these tabular datasets, an ensemble of XGBoost models without deep learning never performed well, but when we looked at the overall performance using an ensemble of deep models with XGBoost, the results were astounding. This ensemble deep model with XGBoost beats all previous models, and our enhanced models pave the way for future study on tabular datasets in terms of comparing performance and assisting researchers in determining the best technique for optimizing hyperparameters. Our findings will also aid in the development of new models (such as CatBoost, where learning rates are uniformly distributed) that are simple to optimize and can compete with the performance of ensemble deep models such as XGBoost and many others.