Face Age Estimation using Shortcut Identity Connection of Convolutional Neural Network

—Depletion of skin and muscle tone has a considerable impact on the appearance of the face, which is constantly evolving. Algorithms necessitate a large number of aging faces for this purpose. Another popular deep learning technique is convolutional neural networks. In a recent study, many computer vision and pattern recognition problems have been successfully tackled using it. But these methods have architectural issues (e.g., the training process) that have a negative impact on their age estimation performance. As a result, a whole new approach is proposed in this research to address the issue. Using a convolutional neural network framework and resnet50 architecture, researchers were able to detect the age of a human face. This proposed shortcut identity connection strategy, which enables age estimation from the face image, has improved the success of the resnet50 architecture. To be able to tell a person's age from a picture of their face, it was important to know the characteristics of aging. As a result, the rhetorical classification method, which employs the resnet50 structure, is used to shift the face aging levels to a probability level. All of the 50 layers in the proposed residual network have a residual block that connects them. As far as face-aging databases go, ImageNet and FG-NET are both good choices for the proposed age estimation process. In the training session, the experiment results are 2.27 and 2.38, based mostly on the mean absolute error. The test accuracy results for the ImageNet dataset are 81.75% with the FG-NET dataset and 57% with the ImageNet dataset.


I. INTRODUCTION
There are numerous real-world applications for age estimation from facial photographs, including security monitoring, missing-person inquiries, entertainment, crime investigation, forensic art, cosmetology, biometrics, and facial recognition. Because of the rising interest in this topic among scientists, the automatic age estimate has attracted a large number of them. It is possible to define an automatic age estimate as either a predefined age range or an actual age.
Multiple ways have been investigated to depict age using facial photos to estimate it. Unfortunately, most of the techniques require a significant number of facial aging datasets and an excessive level of architectural fitness to be implemented. Due to their inaccuracies, these approaches take a long time to train and also lack robustness [12]. To train deep neural networks, researchers have turned to the ResNet framework, which was released last year. The method can employ shortcut identity connections to avoid overfitting in this network's numerous layers. When used for image classification, ResNet is capable of achieving good performance in prediction [1].
The Shortcut Identity Connection of ResNet50, as shown in Fig. 1, is a convolutional neural network is used to estimate an individual's age. The age estimation problem may be solved using a minimal face aging dataset with the proposed approach. Overfitting architectural designs have been solved through shortcut identity connections. In addition, the input data size can be matched to the design using this technique. Different images are trained and tested on various datasets to establish a shortcut identity connection in the ResNet50 architecture. As a final step, the algorithm used the testing dataset and a live image taken by the webcam to estimate age using an estimator. Detailed explanations of the algorithm can be found in the third section.
When deep networks were seen to be converging, a degradation issue arises [1]. As a result of this, the gradients between neurons were completely wiped out [2]. No additional layers or overfitting will solve this gradient problem [1]. Active functions like the sigmoid active function and the tanh active function [3] are to blame for this. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 4, 2022 516 | P a g e www.ijacsa.thesai.org The following are the main points of this manuscript: The proposed shortcut identity connection method demonstrates the high capability of resnet50 by freezing layers from the structure; the presented method can accurately estimate the ages of global and regional people. The proposed age estimation process is immediate and uses an image taken directly by the device webcam.

A. Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN) is a feed-forward artificial neural network approach [34]. Individual neurons are structured such that they can respond to the overlapping areas of the visual field. In 1968, Hubel and Wiesel [4] studied the cat's visual cortex and discovered that the cortex had a complex arrangement of neurons. In order to cover as much of the subregional visual field as possible, these neurons are placed in a grid-like pattern. It is these neurons that perform the filtering of the input image and work with respect to the natural face photo. Each layer of the CNN design uses a differential function to turn the input volume into the output volume. The architecture of CNN is divided into several separate tiers. Fig. 2 displays the CNN layers.
The work of CNN layers [13] has given below: The convolution layer serves as the foundation upon which CNN is built, and it is the first step in the learning process. CNN parameters can calculate by multiplying each filter's width and height (w and h), the preceding layer's filter (F), and the current layer's filters (n). .
During the training session, filters and parameters will learn. While processing the image classification, each filter generates an activation map. In the pooling layer, there are no parameters. The CNN pooling layer does not require any kind of learning. To avoid overfitting during training sessions, it is also beneficial to use the pooling layer. So, the pooling layer p parameter is zero.
A Fully Connected Layer is made up of feed-forward neural networks. The "Fully Connected Layer" is the network's final layer and is referred to as such because of its connectivity. An input is flattened from the final Pooling or Convolutional Layer. A fully connected layer has the most factors to consider when comparing it to other layers. The ideal parameters for a fully linked layer may be found easily because each neuron is connected to every other neuron in the system. The generated image is presented in the figure above in the top right corner (Fig. 2), is known as the "Feature Map" [31]. It's the number of confusing features. As a result, the input image information is reduced. When the number of confusing features is 2, the algorithm reduces the image. If the number of confusing features is 4, the algorithm further reduces the image, which makes it easier to process. The question is whether the filter function detector causes information loss. The higher the number in the feature map, the better the filtering process, which implies that the algorithm is not missing many features. Essentially, researchers create all of the feature maps that are required, as well as all of the filters that are required (e.g., edge detection, blur detection, bump detection). Using the Softmax function, the method can assure that all of the estimation will add up to 1. The cross-entropy function and the Softmax function are frequently used interchangeably [32]. After applying the Softmax function in CNN, the next step is to test the reliability of the model using the cross-entropy function to maximize the performance of the neural network [35]. Using the cross-entropy function has several advantages. One of the best features is that if the output value is much smaller than the actual value at the start of the back-propagation, the gradient descent will be very slow. Since Cross-Entropy uses the logarithm, it helps the network evaluate even the largest errors.

B. Residual Network (ResNet)
VGG networks [5] encourage the use of simple residual networks. [5] In order to reduce layer complexity, the number of filters is doubled when the feature map's size is only halfcomplete [1]. Residual connections were introduced in [1], where they proposed theoretical and practical proof for the use of signal merging for image recognition. This is the original residual connection, and this is an improved version that decreases the computational cost of 11 convolutions (Fig. 3). The 50-layer of residual network architecture [1] detail has given below: It is possible to get a stride size of 2 with a kernel size of 7 * 7 and 64 unique kernels in the first convolution layer. With a stride length of two, which is the maximum pooling layer. As a result of the three levels being repeated three times, the network now contains nine layers after 1*1, 64 kernels, and 3*3, 64 kernels. By the time the procedure reaches the 12th layer, the model has a 3 * 3-layer kernel, a 1 * 1, 128-layer kernel, and a 1 * 1-layer kernel that has been replicated four times in total. There is one more layer in this level, making the total number of layers in the kernel 1*1*1, 1024. There are www.ijacsa.thesai.org now a total of 18 tiers to pick from after the procedure's sixth iteration. It now has a total of nine levels of kernels: 3 * 3, 512, and 1 * 1, 2048. In the end, a Softmax function and a 1000node connected layer reduce the algorithm to a single layer. So, this is a typical pool, no? The activation function and the maximum and the average pooling layer are not calculated. Adding all of the numbers together, they get a convolutional network with 50 layers: 1 + 9 + 12 + 18 + 9 + 1.
Age estimation can be seen as both non-linear and noncategory regression problems. In the current age estimation frameworks, the use of images of aging and age estimation modules is common. Shape and texture data from facial photos are commonly used in aging trait techniques. They fall under the categories of anthropometric models, AAM, AGES, age distribution, and appearance models. Regression methods or age-related institutional categories can be used to estimate an individual's age from this point onward. In the most recent research, categorical and regression methodologies are experienced jointly in the most recent research, resulting in a hybrid structure [29]. To create age estimates based solely on facial features, the robust multi-sample regression mastering rule set is also used [30].

Network Architecture Publishing Year References
Multi-purpose CNN 2017 [14] Diagnosing DLM 2017 [15] Deep cumulatively and comparatively LM 2017 [16] AAFs of CNN 2017 [17] DMTL 2017 [18] VGG-NET-GPR 2018 [19] ELM 2018 [20] DAG-CNN 2018 [21] The CNN triplet ranking 2019 [22] DeepAge 2019 [23] Multitasks-AlexNet 2019 [24] SADAL & VDAL 2020 [25] LRN 2020 [26] CR-MT 2020 [27] MA-SFV2 2020 [28] C. Shortcut Identity Connection Adding more layers in the architecture of a deep neural network allows it to better recognize multiple layers of features in an image. Due to over-exercise, additional layers may not always yield precise results. The problem can be alleviated by establishing a shortcut identity connection between residual blocks. The layer of the neural network where most accuracy will be achieved must be identified [36]. Excessive fitness reduces accuracy, so don't add any more layers after you've found the one with the highest accuracy. The "shortcut identity connection" refers to the identity link between the beginning residual block and the correct stage of a residual block. When the shortcut identity connection is used, all remaining blocks in the architecture will be eliminated [33]. The residual building block [1] is defined as: , There are two levels here, a is input, and b is output level [37]. The function represents the remaining knowledge. According to Eqn. (1), the size of input a and F must be equal. Use a linear projection if the dimensions of the input image and F don't match up. This issue can solve with a shortcut identity connection if they aren't. , ( The experiments in this paper involve varying dimensions of images to estimate age. The shortcut identity connection is effective for the degradation problem, and there is a need to match input dimensions. Age estimation from different dimensional input images can be robust in real-life experience.

III. AGE ESTIMATION METHODOLOGY
The ResNet50 residual network is used in this paper to present a new age estimate approach called "age estimation using the shortcut identity connection of CNN" [33]. Data preprocessing, model creation or choosing a pre-made model, learning rate estimation, and model training are all part of the algorithm's process of determining age estimation. These parts work together to figure out how old you are.

A. Data Preprocessing
This module consists of input data from GitHub (ImageNet) and a loaded data set from the device (FG-NET dataset by Yanwei Fu). The model's expectations for data preprocessing are using ResNet50-selected auto normalization pixel values. Several steps are involved in the data preprocessing modules: acquiring the ImageNet data set, importing all libraries required for the ResNet50 model, identifying missing values, encoding the face aging data, splitting the dataset for the training phase, the testing phase, and the feature scaling.
There are several steps involved in data acquisition before the model may be built and developed. The algorithm needs to create a data set accurately, and then the algorithm uses a shortcut identity connection to match dimensions. A number of graphs and figures were created using the Matplotlib Python program for the experiment. In addition to importing datasets from multiple sources, and data preparation also requires the import of datasets. Setting the data set's current directory was required before import. For the experiment's training and testing phases, the algorithm needed data set splitting to calculate the mean absolute error (MAE), validation error, loss, and validation loss, as well as delete a particular row or column when more than 80% of the values were missing. When dividing a dataset in half, the ratio is 80:20. In the end, the data preparation module applies feature scaling. Within a particular range of values, the algorithm had to summarise several sorts of variables in a dataset. These things are very important in age estimation experiments of any significance: the epochs, loss functions, error functions, and validation functions for each epoch are very important as well. www.ijacsa.thesai.org

B. Model Creation
The residual network has 50 layers in this experiment. In brackets, you'll see the residual building blocks (Table II). It takes 50 layers of the residual network to achieve 3.8 billion FLOPs per second [1].
The algorithm used shortcut identity connections for identity mapping to use different dimensional images and set up a freezing process to ignore layers from the model to avoid overfitting. To wrap up, the algorithm presents an alternative CNN design based on shortcut identity connections (Fig. 4).

C. Estimate Learning Rate
During these trials, the algorithm used the range of learning rates to estimate the optimal learning rate for the algorithm and data set under consideration. A maximum learning rate of 0.0001 has been employed in the one-cycle policy [6] (1 e-4]. Algorithm training sessions make use of the term "epoch term." There is a count of how many times the algorithm has been trained. A batch size of 512, ResNet50, and momentum equal to 0.9 was used, and the weight decrease was equivalent to 1e-4 [6]. Since it has a rate of learning close to 1, it can be put to good use. CIFAR10 and ResNet56's training loss, validation accuracy, and validation loss are illustrated in Fig. 5's characteristic plot (a). Fig. 5 shows the age estimation method's log scale of learning rate (b).

D. Training Model
In this experiment, the auto-fit method employed a learning rate schedule with an automatic initial stop on the plateau and a reduction of the maximum learning rate [7]. The auto-fit strategy, when supplied with the fit_onecycle parameter, reduces the learning rate of each cycle using cosine. To estimate age, the algorithm has used the Softmax image classifier with the ResNet50 model pre-trained on ImageNet.

IV. EXPERIMENTS AND RESULTS
To obtain these results, the algorithm used the ImageNet and FG-NET datasets together. The experiments use a variety of epochs to evaluate the outcomes. Epoch specifies how many times the algorithm runs during the training process. Each sample of the training data set is only transmitted through the neural network once during a single epoch in a single neural network [33]. One epoch is made up of one or even more batches. The facial photos of Malaysian and Bangladeshi citizens are used to test the suggested approach. Images from the ImageNet and FG-NET databases are shown in Fig. 6. Another experiment was performed on the FG-NET database utilizing the ResNet50 design. To see how the shortcut identity connection worked, the algorithm used images of different sizes and resolutions.
A total of 21690 photos from the ImageNet collection were utilized to train the algorithm, and a further 2411 images that have been verified were used to conduct the testing. To obtain the most accurate results while using the fewest facial photos possible, we have sought to reduce the number of images used. The ImageNet data set can be used to estimate an age with an average accuracy of 81.75 percent and MAE of 2.27 in the training session (Fig. 7). The evaluation method of Mean Absolute Error (MAE) [8] is shown in Eqn. (3).
Where: ̂ age is recognized, i is a testing sample, and the total number of testing samples is .
Only 352 facial photos from the FG-NET database were used throughout the training period to test how this method worked with a smaller number of facial aging cases. Fig. 8 shows the FG-NET dataset age estimation accuracy [23]. In the training session, MAE for the FG-NET dataset was 2.38. The accuracy rate is 57% across the experiment.
A parameter includes added to estimate age automatically from the camera-captured facial image. The system will automatically turn on the device's camera at the end of the training and testing session of the algorithm. It helps to capture the live facial image to estimate age. After capturing the photo, the estimator will estimate the age. The accuracy of the camera-captured image is 93.75%. A sample has been shown in Table III. Only images from the ImageNet database have been used in training to figure out how old people are from their live webcam captured images, so far.       Table IV compares the proposed approach's performance to other methods for estimating an age from facial photos and finds that the suggested method outperforms them all. ImageNet testing results have not been found to compare with the proposed algorithm's results.

V. CONCLUSION AND FUTURE WORK
The proposed ResNet50-based CNN model for estimating age from a facial image has the best performance of any model that's already out there. Use the shortcut identity connection method to come up with a new CNN architectural design. Using different dimensional images and freezing many layers of CNN architecture, this method has been effectively implemented. The image is classified using the softmax classifier, and the positive value is normalized. Improved calculation speed and accuracy are two benefits of shortcut identity connection use in CNN systems. The proposed method achieves MAE 2.27 and MAE 2.38 using the ImageNet and FG-NET datasets. In the future, the algorithm will consider the proposed technique to determine gender and ethnicity and improve the accuracy of the age estimate method by adjusting the ResNet method. Age estimation thus relies on the capacity to classify face images. The proposed method can apply to small and big databases for facial image classification.
ACKNOWLEDGMENT I have my deepest gratitude to Dr. Hadi Affendy Bin Dahlan, the doctor who supervised me. Patience, time, and encouragement he has given me have been greatly appreciated. He taught me a lot about image processing and computer vision in particular. Hadi Affendy bin Dahlan, in particular, taught me a great deal about doing and writing a research report. I'd want to express my gratitude to the individuals who provided feedback on my review article, which allowed me to make the necessary corrections.