A Heuristic Feature Selection in Logistic Regression Modeling with Newton Raphson and Gradient Descent Algorithm

—Binary choices, such as success or failure, acceptance or rejection, high or low, heavy or light, and so on, can always be used to express decision-making. Based on the known predictor feature values, a classification model can be used to predict an unknown categorical value. The logistic regression model is a commonly used classification approach in a variety of scientific domains. The goal of this research is to create a logistic regression model with a heuristic approach for selecting input characteristics and to compare the Newton Raphson and gradient descent (GD) algorithms for estimating parameters. Among predictor traits, there were four that met the criterion for being both dependent on the target and independent of one another. Also, optional features In Malang, Indonesia, researchers used the Chi-square test to find four significant characteristics that increase the incidence of pregnant women developing preeclampsia: age (X1), parity (X2), history of hypertension (X3) and salty food consumption (X6). In the above work author proposed, the logistic regression model developed using the gradient descent approach had a lower risk of error than the logistic regression model generated using the Newton Raphson algorithm. The model with the gradient descent approach has a precision of 98.54 percent and an F1 score of 97.64 percent, while the model with the Newton Raphson algorithm has a precision of 86.34 percent and an F1 score of 72.55


I. INTRODUCTION
In modern statistical modeling, there is a simple point of view in developing a statistical model, namely by observing the presence of a target feature in the data set. A descriptive model is developed if there is no target feature. On the other hand, if there is a target feature, a predictive model is developed. The clustering method is the most popular descriptive model. Marji et al [1] discussed topics related to fuzzy subtractive clustering, and Handoyo et al [2] discussed the performance of the optimal clustering method with a hybrid between subtractive fuzzy and c-mean fuzzy clustering. Another type of descriptive modeling is the method used as an assessment tool to generate a ranking of objects based on their features [3]. Predictive modelling is divided into 2 types based on the measuring scale of the target feature. The regression model is built when the target feature is continuous (interval or ratio), while the classification model is built when the target feature is discrete (nominal or ordinal). In statistics, regression modeling is more emphasized to explore the causality relationship between the target and predictor features [4][5], but in the machine learning community, regression modeling is oriented to capture all existing patterns in a data set in order to obtain a model that is able to predict the unknown value of target feature with high accuracy [6]. Some examples of regression modeling for predictive purposes include Handoyo et al [7] have developed a model to predict the regional minimum wage, while Handoyo and Chen [8] have developed a model to predict daily soybean prices in Indonesia.
The application of the classification method gets more serious attention because a decision-making will be more measurable and easy to execute in the form of discrete choices, each continuous scale will also be simpler when it is transformed into a discrete scale [9]. Several researchers have compared the performance of classification models, including Widodo and Handoyo [10] compared logistic regression and Support Vector Machine, Nugroho et al [11] compared logistic regression and Learning Vector Quantization, and Handoyo et al [12] varied the threshold values to obtain the best performing logistic regression and linear discriminant models. A model involving only predictor features that have a significant contribution to the variability of the target feature is an ideal model for researchers [13][14]. Thus, feature selection is an important stage in model development. In general, the feature selection method is divided into 2 approaches, namely the wrapped and the filter approach. Wrapped approach features selection is computationally expensive because it involves the model with all of the possible feature combinations [15]. Feature selection with the filter approach method is more heuristic in nature, namely by evaluating both the dependency between predictor and target features, as well as independency among predictor features [16][17]. Chi-square test can be used for the above evaluation purposes if both features are categorical features [18].
Parameter estimation has an important role in producing the best model. In statistics, estimate parameters can be obtained by minimizing the sum of squared errors (SSE) known as the least square (LS) method [19] or by maximizing the loglikelihood function known as maximum likelihood estimation www.ijacsa.thesai.org [20][21]. The LS method is generally used for estimating parameters in linear systems, while the maximum likelihood estimation (MLE) method is used for estimating parameters in nonlinear systems. The complexity of the nonlinear model has also prompted researchers to lead using optimization methods such as Particle Swarm Optimization [22][23]. Newton Raphson algorithm works based on maximizing the likelihood function which is considered as an equation that is solved to find the equation root as the estimated parameters [24][25]. On the other hand, the gradient descent algorithm finds the estimated parameters by reducing the score function gradient and leads to be 0 which means the optimal solution has been reached [26][27].
In the field of public health, there are many problems that must be handled and controlled properly, one of which is the case of preeclampsia because it is the main cause of maternal death [28]. The immune system plays an important role in promoting the occurrence and development of preeclampsia. Wang et al [29] identified significant immune of the related genes for predicting the occurrence of preeclampsia. Women with preeclampsia are more likely to develop acute kidney injury, placental abruption, and postpartum hemorrhage syndrome before they give birth [30]. Reddy et al [31] evaluated the related application of a broader definition of hypertension and the most appropriate definition of end-organ dysfunction because there is still controversy over the definition that has been used so far.
Based on the description above, this study aims to obtain predictor features that are independent and have a significant effect on preeclampsia by using the Chi-square test, also to compare the performance of fitting the logistic regression model obtained using Newton Raphson algorithm and gradient descent by popular criteria used as classification model performance measure.
The paper consists of five sections. The remaining sections include Section 2 which described the proposed method in detail, namely the feature selection method with a filter approach using the Chi-square test, the cost function in predictive modeling, and both learning algorithms i.e. Newton Rapson and gradient descent. The presentation of empirical data, both of response and predictor features are given in Section 3, while in Section 4, the logistic regression model and its performance are discussed, both the model generated by the Newton Raphson and algorithm gradient descent. Conclusions and recommendations for further research are given in Section 5.

II. PROPOSED METHOD
A good model is simple and has high performance. One of the characteristics of the simple model is that it involves a small number of predictor features. Model parameters estimate are carried out in the training process using an optimization technique such as Maximum likelihood. When the loglikelihood function is non-linear in its parameters, a numerical iteration algorithm can be used to obtain an estimator of the model parameters. In this section, we will discuss the test of dependencies for feature selection, the score function in maximum likelihood, the Newton Raphson and gradient descent algorithm.

A. Chi Square Test for Feature Selection
In machine learning, the predictor and the response features are expected to have a relationship (dependency) while between two predictor features do not have a relationship [32]. The chi-square test is useful for evaluating the correlation between two categorical features. The chi-square (χ2) statistical test has the null hypothesis i.e. two categorical features are independent versus the alternative hypothesis i.e. two categorical features are dependent [33]. The null hypothesis is rejected when the ( 2 > 2 statistic) is less than 0.05 (the p-value is less than 0.05) and otherwise the null hypothesis not able be rejected.
The main idea of the chi-square test is to compare the observed values in the data with the theoretically expected values and test whether the values are related to each other. The contingency table associated with both categorical features is created to support the calculation of the chi-square value. The formula of chi square statistic is the following [34]: Where 2 is Chi square statistic, , is the observed value and , is the expected value of two nominal variables. The Chi square statistic has a degree of freedom (df) of ( − 1)( − 1). The , value can be calculated by formula: is the sum of the ℎ column, and N is the total instance.
When the evaluation of dependency between predictor and response feature, the expected decision is to reject the null hypothesis and the associated predictor feature is kept as the member of predictor variable. In other side, when the evaluation of dependency between 2 predictor features, the expected decision is to accept the null hypothesis that means both categorical features are kept as the member of predictor features.

B. Score Function in Maximum Likelihood Estimation
The goal of a predictive model is to make the correct prediction of the target value for a previously unseen data item. A score function is a function of the difference between the real answer ( ) and the predicted value ̂( ( ) ; ) [35]. Consider the n instances hawing the response feature ( ) and predictor feature ( ) = [ 1 , 2 ⋯ ] for = 1,2,3, … , . Assume ( ) = T ( ) + ( ) is a regression model structure having as many as p unknown parameters. The ( ) is a random noise (error) which is the un-modeled effect. By assuming ( )~( 0, 2 ), the probability density function of ( ) can be stated such as the equation (3) following [36].
(3) Vol. 13, No. 3, 2022 121 | P a g e www.ijacsa.thesai.org The posterior probability with the unknown parameter is The equation (4) means that ( ) | ( ) ;~( T ( ) , 2 ) and it also is called the likelihood function. The following is the likelihood function of n instances: The log likelihood is Maximum Likelihood Estimation method is how to choose to maximize ℓ( ) in the equation (5) by the first derivative with respect to and set its to 0 [37]. All term in equation (5) involving the parameter is only the second part numerator i.e. the sum square of error which must be minimized to get the ℓ( ) maximum. In the other word, to obtain the optimum parameter through MLE is equivalence to minimize the equation (6) also called as the score function of regression model which is the negative of log likelihood ℓ( ).

Minimize
Where ( ) is called as a loss or cost function of a regression model. A binary classification model has the response feature of {0,1}. In the logistic regression case, the classifier model structure is a sigmoid function which has a primary task to separate both classes or as a boundary curve between 2 classes. Suppose the sigmoid formula of an instance is stated in the following: It is expected that ℎ ( ) [0,1] with ( = 1| ; ) = ℎ ( ) , and ( = 0| ; ) = 1 − ℎ ( ) . The posterior probability of a binary classification follows a binomial distribution as the following: The n instances likelihood function is expressed as the following: The log likelihood function for binary classification is The score function of a binary classification model is the negative of ℓ( ) which has the popular name called as cross entropy loss function as the following [38].
Machine learning model is trained by minimizing loss function to yield the estimate parameter .

C. Newton Raphson and Gradient Descent Algorithm
A way to obtain the estimate parameter is by maximizing the log likelihood function ℓ( ) through the first derivative with respect to and to be set 0. Because the ℓ′( ) has non linear form, the analytic (close form) solution can not be obtained. A numerical approach through the iterative method can be used to handle the problem. Newton's method was originally intended to find the roots of an equation by determining the value of the function to be 0 (to find the root of ( ) = 0) [39]. Consider that the gradient (slope) of a line equation is defined as the following: Where Hessian H is defined as = 2 ℓ( ) and ∇ ℓ( ) = ℓ′( ). The equation (10) is the iterative formula of Newton Raphson algorithm [40]. The stopping criteria can be used www.ijacsa.thesai.org either a iteration number or a threshold value desired by user. So the solution of the Newton Raphson is a value that maximize the log likelihood function ℓ( ).
In the machine learning approach, a gradient descent (GD) is an algorithm that minimizes the cost function ( ) such as stated in equation (9). The parameters that minimize ( ) are obtained using a search algorithm that starts with a "initial guess" value by repeatedly changing it to make ( ) smaller until it is expected to converge to a value. Here is the formula of the GD algorithm which starts with an initial value, and is repeatedly updated [41].
The GD algorithm can be implemented when the partial derivative on the right-hand side of equation (9) has been known. Suppose there is 1 instance (x, y), so the summation term in the definition of ( ) on the equation (8) can be negligible.
So, it is found that the first derivative of the loss function classification is The gradient descent iterative formula is By substituting the equation (12) into the equation (13), It leads to the updating parameter final formula of the GD algorithm as the following: Where is a learning rate determined together with a stopping criteria value such as a threshold or iteration number before the training model is started.

III. DESCRIBING DATA
The data used in this study are the secondary data as many as 205 instances obtained from the Center of Child Development Studies at the Wira Husada Nusantara Midwifery Academy Malang in 2021. The data set consist of a response feature, namely preeclampsia status, and 7 predictor features, namely the factors affecting preeclampsia include age, parity, history of hypertension, pregnancy interval, household harmony, consumption of salty foods, consumption of fruits, and vegetables. The description of features in the data set is stated in Table I.  [Yes, No]. The class label in the first order is worth 0, while the class label in the second-order is worth 1. In the target feature y, the proportion of class 0 is 68% and the proportion of class 1 is 32%. The distribution of class labels on the predictor features is very similar to the distribution of class labels on the target features, except that the X6 feature has a distribution of class labels of 58% and 42% for class 0 and class 1. Imbalance class on the target feature should receive serious attention in building a classification model. Fortunately, in this data set, both the target and predictor features have a distribution of class labels that are classified as balanced.

IV. RESULT AND DISCUSSION
This section initially discusses feature selection by evaluating the dependencies between target and predictor features. The predictor features that have significant dependencies are preserved as the final candidate features that are evaluated for their independence. The final predictor features are selected from the final candidate features that are independent of each other. The classification model parameters associated with the final predictor feature are estimated using the Newton Rapson and Gradient descent algorithms. The performance of the two models is evaluated using several measures that are popularly used in classification.

A. Heuristic Feature Selection
Dependencies between two categorical features can be evaluated using Chi-square statistic which is calculated based on the contingency table formed from these two features. The contingency table between the target feature (Preeclampsia) and the Parity feature is presented in Table II.
The values in the cells of the contingency table are the observed values between the two categories (combination of 2 www.ijacsa.thesai.org labels) derived from the two features. The observation values are compared with the expected values calculated using formula (2). Then the Chi-square statistic was calculated using formula (1). Table III presents the Chi-square statistic and associated p-value of the dependency measure between target and predictor feature. Table III are less than 0.05 (level of significance) which means that all predictor features have a significant dependence on the target feature. The evaluation between predictor features was based on the Chi-square statistic and the corresponding p-values which are presented in Table IV and Table V, respectively. The independent features are obtained by using the grid search method. The first time the X1 feature is used as a search base i.e. to look for a p-value greater than 0.05 (significant level) in the X1 row, and the results show that the p-values of the X2, X3, and X6 features are greater than 0.05 that means features X1 are independent to features X2, X3, and X6. Next, feature X2 as the basis for searching and do checking whether the p-value of X3 and X6 in row X2 is greater than 0.05, lastly, feature X3 as the basis for searching and do checking whether the p-value of X6 in row X3 is greater than 0.05. The p-values in Table V which are greater than the significant level are marked with different colours. Thus the predictor features that have a significant dependence on the target feature and are also significantly independent of each other are features X1, X2, X3, and X6. These four features are finally used as predictor features of the logistic regression model to be built.

B. Model with the Newton Raphson Algorithm
The Newton Raphson algorithm is widely implemented in various statistical data analysis software, including R and SAS, which are statistical computing software that is popular among the statistician community. By setting the number of iterations = 1000 and the threshold value = 0.0001, the parameter estimators of the logistic regression model are presented in Table VI.
Based on the parameter estimator values in the second column of Table VI, If the coefficient is positive, it means that it contributes to support for class 0, on the other hand, a coefficient that is negative means that it contributes to support for class 1. All of coefficients except the intercept support for class 0 where the feature X3 has the highest contribution to support for class 0.
The ability of the model to predict the instances used to build the logistic regression model is determined based on the confusion matrix, which is a matrix whose elements state the number of instances that were predicted correctly or the number of instances that were predicted incorrectly by the logistic regression model in equation (15). The Table VII presents the confusion matrix of model in equation (15).  Based on Table VII, it can be seen that there is no instance of the class 0 which is predicted to be wrong. However, there are the 28 instances of the 65 instances of the class 1 which are predicted to be wrong. This logistic regression classification model with Newton Raphson algorithm turned out to produce a model that was only able to detect the sensitivity of the model in that the risk of misclassifying people with preeclampsia was very high, which was above 40%. The model performance is presented in Table VIII. The model's accuracy performance is 86.34% meaning that the model is able to predict instances according to their actual class of 86.34%. While the performance of the F1 score of 72.55% means that the model is able to correctly predict the occurrence of preeclampsia cases by 72.35%.

C. Model with Gradient Descent Algorithm
As described in section 2, the gradient descent algorithm works based on the minimization of the cost function. In this research, the stochastic gradient descent method is applied by setting the learning rate hyper-parameter value = 0.015, and the number of iterations = 1000. After the training process is complete, the results of the cost function graph in Fig. 1, and the parameter estimator in the last column of Table VI. Fig. 1 is the learning curve of the logistic regression model shows the curve of cross-entropy loss in which starting from the 200th iteration there is only a fairly small change and the curve tends to slope after the 800th iteration. This curve also illustrates that the selection of a learning rate of 0.015 is the right value, namely in the initial iterations. , the curve does not experience a very sharp decrease (occurs when the learning rate value is too large) or the curve decreases very slowly (occurs when the learning rate is too small).
Based on the estimated parameter values which are in the last column of Table VI, the logistic regression model obtained with GD algorithm is as follows.  In this logistic regression model, all of the coefficients except the intercept support for the class 0 where the X3 feature has the highest contribution to support for the class 0. Although the coefficients generated by the GD algorithm have a similar pattern to the coefficients generated by the Newton Rapson algorithm, the two models have different performances. The confusion matrix and performance measures of the logistic regression model with the GD algorithm are presented in Table VIII and Table IX.  Table IX shows that only 3 instances of the 65 instances from the class 1 are predicted to be wrong and also all of instances from the class 0 are predicted to be correct. The Gradient descent method produces a logistic regression classification model that is able to detect the sensitivity of the model, namely the risk of misclassification of patients with preeclampsia case is very low, which is less than 5%.
The last column of Table VIII shows very clearly that the logistic regression classification model with gradient descent algorithm has superior performance than the one with Newton Raphson algorithm. It has the model's accuracy performance is 98.54% and the performance of the F1 score of 97.64%.

V. CONCLUSION
Feature selection using Chi-square test on factors that influence the incidence of pregnant women experiencing preeclampsia in Malang, Indonesia, obtained 4 significant features, namely consisting of age (X1), parity (X2), history of hypertension (X3), and consumption of salty foods (X6). The logistic regression model with the gradient descent algorithm has a lower risk of error in predicting cases of preeclampsia than the logistic regression model generated with the Newton Raphson algorithm. The model with the gradient descent algorithm has an accuracy performance of 98.54% and an F1 score of 97.64%, while the model with the Newton Raphson algorithm has an accuracy performance of 86.34% and an F1 score of 72.55%. www.ijacsa.thesai.org The dataset used in this study is too simple, which only consists of 7 predictor features, all of which are of binary categorical type. The comparison of the two algorithms will be more interesting if a dataset with a large number of predictor features is used and also involves both categorical and numeric features. Furthermore, the feature selection method used, not only involves the Chi-square test but also involves analysis of variance (F test) and also the Spearman correlation test.