Adaptive Rectified Linear Unit (AReLU) for Classification Problems to Solve Dying Problem in Deep Learning

— A convolutional neural network (CNN) is a subset of machine learning as well as one of the different types of artificial neural networks that are used for different applications and data types. Activation functions (AFs) are used in this type of network to determine whether or not its neurons are activated. One non-linear AF named as Rectified Linear Units (ReLU) which involves a simple mathematical operations and it gives better performance. It avoids rectifying vanishing gradient problem that inherents older AFs like tanh and sigmoid. Additionally, it has less computational cost. Despite these advantages, it suffers from a problem called Dying problem. Several modifications have been appeared to address this problem, for example; Leaky ReLU (LReLU). The main concept of our algorithm is to improve the current LReLU activation functions in mitigating the dying problem on deep learning by using the readjustment of values (changing and decreasing value) of the loss function or cost function while number of epochs are increased. The model was trained on the MNIST dataset with 20 epochs and achieved lowest misclassification rate by 1.2%. While optimizing our proposed methods, we received comparatively better results in terms of simplicity, low computational cost, and with no hyperparameters.


I. INTRODUCTION
The concept of Artificial Intelligence (AI) revolves around creating intelligent machines that are able to simulate human thinking while Machine Learning (ML) is a branch of this concept that allows these intelligent machines to learn the hidden patterns from the input data [1]. Neural networks (NNs) as a subset of ML simulate the human brain using a set of algorithms. These networks consist of input, hidden, and output layers. These layers consist of neurons that mimic the structure of a biological neuron, where each neuron has inputs that are processed to give outputs, which in turn will be input to another neuron. When neural networks consist of more than three layers then they can be called Deep Learning Networks (DLNs) [2].
AFs play a critical role in DLNs to extract the results from the input values and thus determine whether the underline neuron is activated or not [3]. DLNs can be considered as just a linear regression without AFs, so appropriate AFs must be used to model a nonlinear DLNs. AFs classified as binary step, linear activation and nonlinear activation functions.
Binary step function is a basic threshold classifier where some threshold value is decided to choose which output neurons should be activated or deactivated. Linear activation function is a simple straight line activation function that converts linear input signals into non-linear output signals. Nonlinear AFs are what make it easier for the DLNs model to adapt to a variety of data and to distinguish between outcomes; examples are: ReLU, Leaky ReLU, Sigmoid, Tanh and Softmax [4]. Some of these are suited to be used in hidden layers and others in output layers.
There are two terms used in training the model, the first is the term feedforward, which is used in NNs to refer to the transition with specified weights from input to output, while the term backpropagation, as the name suggests, moves from output to input with readjustment of weights depending on loss values and then propagation processes straight ahead. This approach allows the use of gradient methods, such as gradient descent or stochastic gradient descent, to train multilayer networks and update weights to reduce loss [5] [6]. ReLU as a nonlinear AF has gained a lot of interest in research due to its simplicity, low computation cost and it avoids the vanishing gradient problem that inherent to the earliest AFs like tanh and sigmoid [7]. Despite all the previous advantages of this function, it has a problem called the Dying ReLU problem, which indicates that the neuron becomes inactive and outputs zero only for any input. This problem has been attributed to a high learning rate and a high negative bias [8].
ReLU was initially introduced by [9]; the researchers designed an electronic circuit to simulate a hybrid slug in which the latent cortex combines the digital selection of an active cluster of neurons with an analog response, and this behavior is achieved by dynamically changing the positive feedback inherent in recurrent cortical connections, this behavior, according to the researchers, created computational capabilities creates the process of stimulus selection, conferring the ability to modify and generate a spatio-temporal pattern in this cortex.
ReLU was later used in object recognition by [10], and researchers summarized the three stages used to extract object features such as filter bank, nonlinear transformation and a kind of feature pooling layer emphasizing that most systems use one or two of these stages, assuming that the use of two stages gives more accurate results. The study demonstrated the (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 14, No. 2, 2023 98 | P a g e www.ijacsa.thesai.org accuracy of this hypothesis by using nonlinear layers and pooling layers on different object data sets through either supervised optimization or unsupervised pre-training.
ReLU was also popularized by [11] in the context of Restricted Boltzmann Machines. The study demonstrated how to create a more powerful type of hidden units for Restricted Boltzmann Machines (RBM) in object recognition and face comparison by combining weights and biases for an infinite set of binary units with approximating these stepped sigmoid units with noisy corrected linear units.
Leaky ReLU [12] has added a slight slop in the negative range; this modification on ReLU ends the presence of dead neurons in the negative region by using a hyperparameter. Thereafter many leaky ReLU variants have been appeared like Parametric Rectified Linear Unit (PReLU) [13] which introduces a new learnable parameter as a slope for the negative part and Exponential linear unit (ELU) has used an exponential function to transition from the positive to small negative values [14].
The value of the loss function is related to the results of the model. If the value of the loss function is low, this means that the model will give good results [16]. Loss functions are divided into two types, classification and regression. Classification functions also divided into binary entropy loss/log loss and hinge loss. During the execution of AReLU the first function was used [15]. AReLU is applied on MNIST dataset that contains 70000 images of black and white handwritten digits divided into 60000 images for training and 10000 images for Testing [17].
In this study, the decreasing value of the used loss function was exploited as an adaptive parameter to keep the network active. The study is presented into four sections: section two introduces the idea of ReLU, section three identifies the ReLU dying problem, and section four introduces the AReLU. Section five presents the results and finally section six is the conclusion.

II. RECTIFIED LINEAR UNIT (RELU) ACTIVATION FUNCTION
Artificial neurons are mathematical model that mimic human biological neurons and they are the basic building blocks of neural networks as shown in Fig. 1. represents the inputs, illustrates the weights that connect inputs with perceptions and they measure the significance level of each input. The bias value (b) is added to the weighted sum of the inputs to prevent the activation function from getting a zero value. This linear results in linear modeling come from the linear mapping of the input function to output in hidden layers. The role of activation function is to convert these linear outputs into nonlinear outputs for further computation as in ; where α is the activation function. The literature has introduced many activation functions such as Sigmoid, binary step, Tanh, ReLU, Leaky ReLU, identity and Softmax.
ReLU activation function can be described mathematically as in Eq. (1) and graphically as in ReLU avoids vanishing gradient problem occurred with other activation functions by preserving the gradient [18]. This problem is formed when the gradients of deep neurons vanish or becomes zero, this means that the deep layers of the network may not learn or learn very slowly [19]. Derivative Activation function is fundamental to optimizing neural network, the ReLU (x) can be expressed as: It can be simplified as follows:  The normal situation for ReLU neurons is to stay active, update weights, and keep learning. Although this feature provides the power to ReLU through the sparsity of the network, it poses a problem when most of the inputs of these ReLU neurons are in the negative range, and the issue becomes more complicated when the output of most Neurons is zero, making their task so abnormal that they become inactive and unlearning. This inevitably causes gradients to fail to flow during backpropagation.
The cause of this problem is due to two main factors: High Learning Rate and a Large Negative Bias. The former one allows faster learning with the possibility of a numerical overflow, while its very small value may never converge or stumble on a suboptimal solution. So choosing an average rate that is neither too large nor too small ensures an optimal approximation of the mapping problem as represented by the training data set. The best way to discover the value of the learning rate is through trial and error, not analytically for a particular model on a particular data set. This can be illustrated by the update process in backpropagation as shown in Eq. (2).
where is the derivative of error with respect to weight. We can see from Eq. (2) that giving a high value of the learning rate (LR) will cause a high value for the last part of Eq. (2) ( ), so subtracting large number from will end up with highly negative . These negative results cause negative inputs for ReLU, therefore generating the dying ReLU problem.
Biases are extra inputs that ensure neurons are activated regardless of the input. Changing the value of the weights in the neuron changes the steepness of the curve without the ability to change it to the right or left, to change the curve to the left or right the value is changed. Giving a high negative bias value makes the ReLU activation input negative. To mitigate Dying ReLU problem, several techniques have emerged, all trying to keep the network active when the input is negative or zero.
Leaky ReLU [12] demolished dead neurons in the negative part by adding a slight slop in the negative range using a hyperparameter (α= .1 or more) as shown in Eq. (3) and illustrated in Fig. 4. And Exponential linear unit (ELU) used an exponential function to transition from the positive to small negative values [14] as shown in Eq. (5).

IV. ADAPTIVE RECTIFIER LINEAR UNIT (ARELU) ON MNIST DATASET
The study used the MNIST dataset of handwritten greyscale images, these images were size-normalized and centered in a fixed-size image available from NIST [20] as shown in Fig. 5 which shows the first 25 images of MNIST. www.ijacsa.thesai.org This dataset composed of approximately 70,000 handwritten monochrome images of 0 to 9 (10 digits), each of which is 784 pixels in size, so that the input data is in pairs (70,000,784) and output (70,000, 10) as shown in Fig. 6.
To form the network, the AReLU activation function were used in the hidden layer and softmax in the output layer. The used loss function is categorical_crossentropy and the optimizer is Adamax. The batch size is adopted to 128 and the number of epochs to 20.
Once the output is generated from the final neural net layer, loss function (input vs output) is calculated and backpropagation process is performed where the weights are adjusted to get the minimum loss. Neural Networks are trained using the gradient descent process. This process consists of the backward propagation step which is basically chain rule to get the change in weights in order to reduce the loss after every epoch. The primary goal of AReLU is to mitigate Dying ReLU problem by improving the previous methods by using the adaptive Loss Function (Ɩ) parameter instead of hyperparameter one. Ɩ is multiplied by the input value as shown in Eq. (6) to transit from the positive to small negative values.
The AReLU has implemented by using Python programming language according to the algorithm shown in Fig. 7 and more illustrated in Fig. 8. It is noticed from the equation that there is no change in the case of the positive values, but only the change in the negative inputs, as we notice this in Fig. 9(a). The used structure of the deep neural network is shown in Fig. 9 that composed of four layers, one input layer represents the input shape as 784 image pixels, two hidden layers each composed of 512 neurons and the final 10 neurons layer that characterize the output layer.
Binary Cross-Entropy Loss/Log Loss has been used as loss function in the model compilation process; where in this phase the loss function, the optimizer and the metrics are defined. This function is defined in Eq. (7); where N is the number of rows and M the number of classes.
are the corrected probabilities, a negative average is used to compensate for negative values resulted from calculating log value of www.ijacsa.thesai.org corrected probabilities because their values range between 0 and 1. It is one of the most common loss functions used in multiclass classification problems. The value of this function decreases as the predicted probability converges to the actual label.

∑ ∑ (7)
Backpropagation in a network aims to make a change in the error value with respect to weights and this process is called derivative because its goal is to make a change in one value with respect to another. The first derivative of this function is: The function starts with any initial value; say 0.1 and then it is automatically adapted according to the initial loss function value. Fig. 9(b) shows the Graphical Representation of AReLU Derivative. It is evident that the values of the derivative are close to zero but are not zero in the case of negative values.
The most effected activation function used in the output layer in the case of multi-layer classification problems is Softmax, which converts the raw outputs of a neural network into a vector of probability scores between 0 and 1. Its equations is defined in:

∑ ⁄
Where o is the input vector, is the standard exponential function for , N is the number of classes in the multiclass classifier and is the standard exponential function for output vector and e is the exponential which is equal nearly 2.718.  Fig. 10 illustrates the relationship between the training and validation accuracy over 20 epochs, the accuracy escalations are noticed in the first three epochs, indicating that the network is learning fast, thereafter the curve flattens, indicating that there is no need for more epochs to further training the model. The model accuracy was 98.8% (meaning 9880 of the 10000 images were predicted correctly!) and 120 images were wrongly tagged (1.2%). AReLU gives better results based on the Misclassification Rate (MR) which is a measure of the percentage of observations that were incorrectly predicted by some classification model [21] and it's calculated as in

⁄
The MR for our model was 1.2% where for PReLU is 1.62 according to [22] as shown in Table I. This study measured the MR for different adaptive ReLUs including Sigmoid, tanh, MSAF, MSAf_Symmetrical, ReLU, LReLU, ELU and adaptive tanh. Reading Fig. 11 which illustrates the relationship between training and validation loss, we can see the rapid loss in the training set at the first two epochs while validation loss remained almost constant for several epochs, in contrast to the loss level of the training set, which means that the model can be generalized to unseen data. www.ijacsa.thesai.org This article produced automatic and adaptive activation function in which it retained inherent characteristics of ReLU with simplicity, high accuracy, speed and low loss ratio. Expected diminishing characteristic of Loss function value has exploited in implementing the AReLU, this function is used to measure the difference between the current output and the expected output. Cross-entropy type is used in developing the ReLU as one of the most widely used loss functions in machine learning due to its role in better generalization and faster model training. This function is used in binary and multi-class classification cases. AReLU is implemented by using Python programming language on MNIST dataset of handwritten digits to get 1.2% classification Rate. The model maintained the gains that the previous methods indicated, such as simplicity, low computational cost, no fixed coefficients, and adaptation in nature. In the future, AReLU will be applied to different data sets and work to reduce the rate of misclassification while maintaining the characteristics of simplicity, low computational cost, and no hyperparameters.