Improved Deep Learning Performance for Real-Time Traffic Sign Detection and Recognition Applicable to Intelligent Transportation Systems

—In this paper, we improve the performance of Deep Learning (DL) by creating a robust and efficient Convolutional Neural Network (CNN) model. This CNN model will be subjected to detecting and recognizing traffic signs in real-time. We apply several techniques; the first is pre-processing, which includes conversion of color space RGB, then equalization and normalization histogram of the image dataset according to Computer Vision (CV) tools. The second is devoted to Artificial Intelligence (AI), which needs the right choice of a neural layer such convolution layer, or dropout layer, with powerful optimizer as Adam and activation functions such as ReLU and SoftMax. Also, we use the technique of augmentation dataset which characterizes by the function of batch size for each epoch. The results obtained is very satisfactory, the prediction at the average is equal to 98%, which encourages this approach to the integration in Intelligent Transportation Systems (ITS) in the automotive sector.


I. INTRODUCTION
The detection and recognition of traffic road signs are done in different ways, depending on the methodology or strategy followed by the researcher. In general, the detection and recognition methods can be summarized in three classes. The first method can be based on color segmentation (red, blue, yellow) [1]. In the second method, we can use the geometry of objects (Triangular, Square, Rectangle) [2]. Finally, methods that use artificial intelligence (AI), specifically DL of CNN architecture [3]. For road safety, we use ITS systems [4]. This system is devoted to detecting and recognizing all traffic road signs by identifying them from other objects that existed in environments (a passage, animals, cars, trucks, buildings……) in real-time [5]. These systems are used in Advanced Driving Assistance Systems (ADAS) [6] [7] and are based on a digital camera for perception road environment.
There is a standard technique for detecting and recognizing traffic road signs. For example, the scale-invariant feature transform (SIFT) [8] [9], the local binary patterns (LBP) [10], and the histogram of oriented gradients (HOG) [11]. Also, we find advanced techniques to classify a different object, in which the feature vectors are extracted normally from the training dataset, for example, the support vector machine SVM [12], VGG16 [13], and ImageNet [14]. In recent years, we have been using the CNN model for complex classification situations [15]. The CNN architectures are the best models; they have the same analysis vision as the human being. [16].
To guarantee a reliable and effective model in the decision, most of the research work in the field of AI often plays on the following parameters: optimizer [17], accuracy function [18], loss function [19], dataset [20], architectures [21].
In this paper, we play with several parameters to obtain a robust and efficient model for traffic sign detection and recognition. The first thing we will examine is the effect of normalizing and equalizing the images in the traffic sign dataset on model training. So according to the result of the first step, the second step is choosing an optimal fitting function (Simple, Generator) for deploying the best function between them. Finally, we will use the data augmentation technique by discussing the effect of batch size function during model training. All this is to ensure that the proposed ITS system detects and recognizes signs well in advance so the right decisions can be made as quickly as possible.
In our work, we will test our approach based on Computer Vision (CV) and Artificial Intelligence (AI), for the detection and recognition of the different traffic road signs in real-time. The approach results can be exploited by Intelligent Transport System (ITS) to assist the driver. The paper is organized as follows: Section 1 introduces the most techniques used for the detection and recognition of road signs. Section 2 is dedicated to related work, and then Section 3 presents a general view of the approach proposed. Section 4 is for methodology. Section 5 is devoted to experimentation and evaluation of the approach proposed. Section 6 with Section 7, is for the real-time implementation to recognize traffic road signs. The last section is devoted to the conclusion.

II. RELATED WORK
Most of the developed applications that have high accuracy in object detection and recognition are based on the RNN and CNN architecture [22]. Nevertheless, depending on the available data or the problem to be solved, one type of neural network may be more suitable and used than another for a different problem than the one it is used. Generally, a

B. Deep Learning and Neural Network
The learning methods are among the techniques that use DL [45], this method has made a revolution in the industrial sector, especially in the embedded systems in the automotive sector [46]. This method is robust in object detection and recognition compared to the geometric and colorimetric methods, which are among the classic methods that suffer from many factors.
The creation of CNN models was based on neural networks. Many hidden layers of the neural network serve to produce CNN. These neuron layers are grouped into a tree category of layers, input layers, hidden layers, and output layers. Firstly, the feature vectors dataset is accepted from the input layer and has a bias neuron. Secondly, the liaison between input and output is hidden layers that use the neuron bias. Finally, the output of neural networks is not used for the bias neuron. The output from a single neuron is calculated according to the following equation (1).
 The input vector (x) represents the feature vector.
 The vector θ represents the weights.
 The ɸ is the transfer/activation function.

III. THE APPROACH DESCRIPTION
The approach that will be proposed to integrate it into an ITS system, is essentially based on the creation of a CNN architecture to guarantee road safety for both passengers and drivers of vehicles. Therefore, our approach is based on two processes, the detection process and the recognition of traffic signs as shown in Fig. 1. The detection process uses camerabased CV techniques to receive images to ensure that traffic signs are detected. When the signs are detected, the recognition process is activated using AI techniques. We will use a CNN architecture to extract the characteristics of the road signs. To achieve our objective, the deep training will be done on the German Traffic Sign Recognition Benchmark (GTSRB) dataset. We arrive at the end to identify each detected by the classes that belong to the prediction probability.
The strategy we will follow to have an efficient CNN architecture is summarized in the following points: (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 714 | P a g e www.ijacsa.thesai.org

IV. METHODOLOGY
A. Transformation Techniques for Dataset a) Visualization dataset: For our implementation, we use a dataset of the German traffic sign Benchmark [47], composed tree part, training data, validation data, and testing data. The training set uses 80% of the data and the validation set uses 20%. The GTSRB is composed of 43 traffic road sign classes, 34799 images for training data, 4410 images for validation data, and 12630 images for testing data, as illustrated in Fig. 2. b) Normalization of a histogram: Normalizing a histogram is a technique consisting of transforming the discrete distribution of intensities into a discrete distribution of probabilities [48]. To do this, we need to divide each value of the histogram by the number of pixels. In our case, the normalization is done by dividing all pixels in an image by 255.
c) Equalization of a histogram: Histogram equalization is an image processing method to adjust the contrast of an image, by modifying the intensity distribution of the histogram [49]. Equalization processing is based on the use of the cumulative probability function. This function is a cumulative sum of all the probabilities in its domain and is defined by equation (2).
The idea of this processing is to give the resulting image a linear cumulative distribution function.

B. Convolutional Neural Networks (CNNs)
Domain CV has been affected by AI mainly by CNNs. The neural network architecture was introduced by LeNet-5 [50]. The next step is the description of each layer type used in the CNN model. a) Convolution layers: The first layer of analysis is the convolution, it allows us to detect the characteristics of each visual element: circles, lines, colors, edges ..., this work is done by internal filters in the layer. If the number of filters is very well brought, they have more features for better accuracy. The filters have a square shape that sweeps over the image from the right to the left. Then there is a very important parameter, which is the width and length of the filter that normally affects the number of features extracted from the images. The single www.ijacsa.thesai.org output matrix of the convolution layer is described in equation I mg : Input matrix. Ker: Kernel matrix.
Each set of kernel matrices represents a local feature extractor that extracts regional features from the input matrices. Optimizes neural network connection weights, and can be applied here to train the kernel matrices, biases as shared neuron connection weights.

b) Max pooling layers and dropout layers:
Putting the Max-Pooling layers belong after every convolution layer. It serves for re-sizing a picture of 2D in a smaller dimension [51]. Most CNN frameworks implement dropout as a separate layer to avoid the production in DL the overfitting. Dropout layers function like a regular, densely connected CNN layer. The only difference is that the dropout layers will periodically drop some of their neurons during training. c) Activation function: However, current deep neural networks mainly use the following activation functions, each function has a role to play in a neural network. For the output of the hidden layers, we use the ReLU (Rectified Linear Unit) function [52]. The ReLU function is calculated as follows in equation (4).
The ReLU activation function [53][54] was one of the key improvements in CNN applications, that make deep learning. Unfortunately, the ReLU function is not differentiable at the origin, which makes it hard to use with backpropagation training. ReLU for rectified the feature map, to find the final value positive and deleted the negative value.
The output of classification CNN: We implemented SoftMax. The SoftMax is calculated as follows in equation (5).
The SoftMax function is only useful with more than one output neuron. It guarantees that the sum of all output neurons is equal to 1.0. It is therefore very useful for classification, where it indicates the probability that each of the classes is the correct choice.

d) Optimization function:
Adam optimizer is very effective [55]. Adam estimates the first mean and second variance moments to determine the weight corrections. Adam begins with an exponentially decaying average of past gradients (m) described in equation (6).
This average accomplishes a similar goal as a classic momentum update; however, its value is calculated automatically based on the current gradient (g t ). The update rule then calculates the second moment (v t ) in equation (7).
The values m t and v t are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively. 1 and 2 : are exponential decay rates. Adam is very tolerant of the initial learning rate (η) and other training parameters. Default values of β 1 =0.9, β 2 =0.999, and η=10 -8 [45].

C. CNN Architecture
We have a dataset of dimensions (32,32,3), and we will perform a conversion from RGB color space to gray level. The input images of our architecture will have dimensions (32,32,1). Table III presented the architecture of CNN in detail, type of layers, output shapes, and activation functions. The layers with their corresponding type are shown, denoting the characteristics used. Then implementation of CNN in CPU takes more time because we have a dataset of images that are more difficult to execute. However, the faster implementation we propose to use GPU.

A. Simple Fit Function
The training of the proposed CNN model requires two essential elements, the training data, and the training labels. For the training, we will use the fit function of the Keras library. www.ijacsa.thesai.org The number of epochs is the number of times the model will run through the data. The more epochs we run, the more the model will improve, up to a certain point. We started our model for 50 epochs with a batch size set to 32. We will also train the dataset with equalization and normalization of the histogram. Thus, the training without equalization and normalization will be noted as Method 1, and the training with equalization and normalization will be noted as Method 2.
We can visualize in Fig. 3 in the accuracy curve, a drop during the training of the data in 2 steps for 50 and 100 epochs. But for the loss curve, we have a huge increase in the error value. So, method (1) leads us to overfit. We can deduct from Fig. 4 that we don't have any underfitting or overfitting in the accuracy curve, we can easily observe that the increase in the number of epochs did not disturb the learning stability. The same thing for the loss curve, we have a very remarkable degradation of the error values compared to the curve of method (1). Equalization and normalization can be used almost. However, this method (2) shows negligible effect loss and we have the full precision of our network that shows a significant improvement.
A comparison of the performance in Table IV sows  accuracy function and loss function. We can conclude from  Table IV which contains tests accuracy and loss for Method (1) and Method (2). It is necessary to equalize and normalize. The equalization is served to adjust the contrast in the image's dataset. For the normalization, it allows making training faster and the loss becomes more circular symmetric. The next step is to change the simple fit function by using a fit generator, we visualize if does good predictions and evolution of accuracy with a loss function. The equalization and normalization algorithms result in improved performance of CNN classification.

B. Fit Generator Function
We propose to use the fit generator function to accept the data sets, perform backpropagation, and update the weights in our model. This function has a hyperparameter, it is the number of steps per epoch, its value as the set of servant landmarks becomes divided by the batch size. It is based on an infinite loop, which must not return empty or exit. However, all researchers calculate the value of steps per epoch as the total number of training data divided by the batch size of training data images.
So, the idea of our experiment is to use method 2 from the previous section. Method 2 will be driven by the generator fitting function with a batch size of 32. We will compare different optimizers (Adam and SGD) and the loss function (Categorical cross-entropy, and Mean squared error). We fixed parameters learning rate in 10 -2 and epochs in 50.
In Table V, when the loss function uses categorical crossentropy, we have a high prediction score with a low loss score. Now we improved the model to get the lowest loss score. We got the best scores with the Adam optimizer and the categorical cross-entropy function, for 97.11% accuracy and 11.32% loss. Moreover, the idea is now to improve the accuracy score.
As we can see in Fig. 5, using the fit generator function in the training model the objective is achieved, at 90% we control the situations for not have the overtraining our DL models. The assumptions are therefore correct, we using all of the datasets at each epoch. We need to choose a batch size and steps per epoch which multiply to give a total number of samples. Usually, it will be a resource. If memory is a problem, we need to reduce the batch size until we can adapt a batch on a GPU. Note that this implementation also allows us to use the multiprocessing argument of a fit generator, where the number of threads specified in workers corresponds to those which generate batches in parallel. A fairly high number of workers ensure that the calculations performed on the GPU are managed efficiently, or in other words, the bottleneck of the whole training process will be due to the propagation operations. In our case, we would probably set batch size the desired amount; we change it only if you want the model to not use all the data for each epoch which deflects the definition of the word epoch.  We introduce one more technique to improve the model training process data augmentation. This technique creates new data for our CNN model to use during the training process. This is done by taking our existing datasets and transforming or altering the images in useful ways to create new images.

A. Image Data Generator Function
We can have a typical sign image such as this STOP sign image, taking this image and transforming it to create a different image representing the same stop sign. The transformation could be rotation or simply zooming into the image. Also, could even be a combination of both these transformations. These newly created images are referred to as augmented images because they essentially allow us to augment our dataset by adding them. The data augmentation technique is useful because it allows our model to look at each image in our dataset from a variety of different perspectives. This allows it to extract relevant features more accurately and allows it to attain more feature-related data from each training image. This is especially the case for our traffic sign datasets because we have a small dataset (32x32) and a large number of classes. This means that certain classes have very few proximately only 200 training in the Fig. 2. It can benefit our traffic sign recognition model.
We apply the five following transformations with shift range, height shift range, zoom range share, and a rotation range. Five transformations will add sufficient variety to GTSRB datasets and will allow the training process to be much more effective. The first transformation is with shifts, this refers to a horizontal translation in the image which will cause our images to be centered, and this will help our CNN model adapt to test images that aren't necessarily going to be centered. The range can be defined in two ways, if the range value is defined as a number between 0 and 1, then it refers to the fraction of the image that can be shifted. A value of 0.1 would simply imply that the maximum horizontal shift possible is 10 percent of the width of the image. The images with only horizontal translation can be similar. So, to have a difference between the generated, we apply a second technique is a vertical translation. The range value is defined in much the same way and for that reason; the value of vertical translation is 0.1 (10%).
For zoom transformation, can be either zoom out or into the image. The degree of zoom can be defined with a float value between 0 and 1. While the maximum outer zoom is defined by one minus the float value and the maximum zoom is defined by a 1 plus the flow value. We will use a float value of 0.1 which means that we can zoom as far as 0.1 eight's and zoom in as close as 0.2. Next, we have the shear transformation in plane geometry a shear mapping is a linear map that displaces, each point in a fixed direction by an amount proportional to its side and distance. The line that is parallel to that direction, possible in both the x-direction and the y-direction. This transformation is defined using shear intensity which simply refers to the magnitude of the shear, angle in degrees as seen in the image above. We apply a small magnitude of shear to be effective, using a value of 0.1. The last transformation is the rotation; this transformation is a bit more intuitive it simply rotates an image by a certain value of degrees. This value can be defined using an integer value, in our case, we will use 10. These transformations are simply applied to stop signs as shown in Fig. 6, which will then be applied to the GTSRB dataset.

B. Batch Size Function
First, we declare a batch size is equal to 32 which mean that our image generator will create a batch of 32 images at a time for our CNN model to use our next argument as illustrated in Fig. 7. Also, the steps per epoch this parameter essentially refers to the number of batches. The steps per-epoch argument must specify the number of batches of samples comprising one epoch. In our case, the original dataset has 34799 images and the batch size is 32. Then a reasonable value for steps per epoch when fitting a model on the augmented data might be ceil (34799/32), or 1087 batches. So, we fix the value of the steps per epoch in 1000.

C. Experimental Results
We are fixed step pre-epochs to 1000, we switch the value of epochs between 50 and 150, we behold augmentation accuracy the same time value loss has diminution. We fit and evaluate all these models in different batch sizes (32, 64, 128, and 256) using the same procedure above of optimizer Adam and the same value of steps pre-epochs with different epochs, found through some minor experimentation. The model is evaluated, reporting the classification accuracy on the test sets between 96.86% and 98.01%. We can specify the results may vary given the stochastic nature of the training algorithm. Table VI demonstrates the effect of batch size, after testing very hard which took an enormous time to find it up to incredible values. When we have for batch size is 256, we have a precision in 50 epochs of 98,01% which is very interesting, and also a remarkable reduction in the function error of 09.15%. The same thing for size 100 epochs has values for the two 97.99% and 09.11%.   In Fig. 8, we can see the validation converges to above 99%. A significant improvement is shown over our previous CNN model. This might be our modification that was pretty effective. We have a much smaller gap and training accuracy as well as our validation loss and accuracy, respectively. This demonstrates consistency in our training and a better-trained model and we now finish our model training with a validation accuracy of over 98 % and training accuracy. This is all very good to see and shows our augmentation technique was effective. The model will not learn complex patterns and we can avoid overfitting, we use more dropout layers in our architecture and check its performance. So, the augmentation dataset after performing histogram normalization and equalization, the model learned the data better, and the accuracy of the set improved. Now there is just one more test that our model needs to pass and that is classifying images from the test dataset to predict a couple of them correctly. So, we'll start by testing out the image not seen before for our CNN model.

D. Analyses Performance of a Model Trained
We define several measures based on the confusion matrix, to quantify the performance of a classifier from different points of view: Precision by class, average precision, Recall by class, average recall, F-score by class, and f-score average.

a) Precision of classification:
The accuracy of a classifier concerning a certain class in other words, about a certain modality of the variable to be predicted, is measured as the proportion of individuals, among all those for whom the classifier predicted this class, who belong to it, exposed in this equation: The global averages of the recall over all of the classes i can be evaluated by the macro-average which first calculates the recall over each class i followed by a calculation of the average of the reminders over the n classes: : recall each class i. : number of classes. : The average precision of all classes.
: The average recall of all classes.

E. Confusion Matrix
The Confusion Matrix identifies the classes of signs and also gives the number of times it gets the confused class to identify the class from another in Fig. 9. Most of the color is diagonal, but there are still some annoying spots somewhere. When we narrowly look at the confusion matrix, we see that the classes [0] have very less respectively all classes, but it's minimized for other classes. The diagonal observations are the true positives of each class and other non-diagonal observations are incorrect classifications of the model.

F. Classifier Metrics
A Classification report is used to measure the quality of predictions from a classification algorithm. We can see in the Table VII, the model has the as recall and precision are calculated for individual classes, have a good score of all the class of traffic road signs. We use macro or micro or weighted scores of recalls, precision, and F1 score of a model for multiclass classification problems have a higher score is 98% this very satisfied.

A. Test with the Test Dataset
A remarkable performance is illustrated in Fig. 10. Now we'll test for test datasets, we look reaction to our model, to see where the fails. We tried to visualize the class predictions of the test images, it is relevant to have good results, all the images were well classified, and the curve shown next to each image represents the class of the images among the 42 classes, when we have the color blue and a single peak in the curve means the image has been put in the right place without any errors.

B. Testing the Proposed CNN Model in Real-Time
In this section, we will present and evaluate the results of our approach. Traffic road signs that appear in video sequences are often detected. More details on the video sequences are given in Table VIII. In general, for all performance indicators, our proposed approach outperforms other object detection algorithms by achieving up to 100% accuracy. Our CNN model metric value is often higher than in the results of previous work. For the video sequences, our algorithm surpasses the good probability of prediction and classes of Traffic Road signs by the method. This shows that using a robust appearance CNN model achieves better results. It can also be observed that the CNN precision value obtained for the video sequence is higher than that obtained by the approach with a difference between 97.56% and 100%.  In this paper, we proposed a methodology for the construction robust CNNs model. We talked about the problems associated with the detection and recognition of traffic road signs in real-time. We also demonstrated how using the right tools and techniques helps us in developing robust CNN models. These CNNs can guarantee road safety in realtime. We also try other pre-processing techniques to further improve the model's accuracy (equalization and normalization histogram). The step of adding augmentation data improved the performance of our deep learning CNN model. We are curious about how much the accuracy can be improved based on adding such simple transformations. We think these results could further be used in the development of automotive systems, such as intelligent transportation systems (ITS). All this is for the safest roads; we try in the future to get better performance and optimist. It is also very interesting to note that the proposed CNN model reaches 98% accuracy using NVIDIA's GPU processor, which makes them feasible for realtime traffic sign recognition.
In future work, we plan to study other neural network architectures that have been shown to be optimistic for traffic sign detection or classification. In addition, we will attempt to employ these networks in advanced in-vehicle platforms applicable to intelligent transportation systems to reveal valuable information that will help drivers make the right decisions in the real world.