Data Augmentation for Deep Learning Algorithms that Perform Driver Drowsiness Detection

— Driver drowsiness is one of the main causes of driver-related motor vehicle collisions, as this impairs a person’s concentration whilst driving. With the enhancements of computer vision and deep learning (DL), driver drowsiness detection systems have been developed previously, in an attempt to improve road safety. These systems experienced performance degradation under real-world testing due to factors such as driver movement and poor lighting. This study proposed to improve the training of DL models for driver drowsiness detection by applying data augmentation (DA) techniques that model these real-world scenarios. This paper studies six DL models for driver drowsiness detection: four configurations of a Convolutional Neural Network (CNN), two custom configurations as well as the architectures designed by the Visual Geometry Group (VGG) (i.e. VGG16 and VGG19); a Generative Adversarial Network (GAN) and a Multi-Layer Perceptron (MLP). These DL models were trained using two datasets of eye images, where the state of eye (open or closed) is used in determining driver drowsiness. The performance of the DL models was measured with respect to accuracy, F1-Score, precision, negative class precision, recall and specificity. When comparing the performance of DL models trained on datasets with and without DA in aggregation, it was found that all metrics were improved. After removing outliers from the results, it was found that the average improvement in both accuracy and F1 score due to DA was +4.3%. Furthermore, it is shown that the extent to which the DA techniques improve DL model performance is correlated with the inherent model performance. For DL models with accuracy and F1-Score ≤ 90%, results show that the DA techniques studied should improve performance by at least +5%


I. INTRODUCTION
Road accidents represent a major socio-economic challenge for individuals, industries, and nations [1]. Commuters involved in road accidents are affected in a variety of ways; such as death, sustaining physical injuries, psychological trauma, as well as incurring financial burdens from damage to property [1][2][3][4]. For industries, road accidents adversely affect supply chain performance and logistics, reducing operational efficiency [5][6][7]. The net result of this adversely impacts the economy of a country. Furthermore, for national authorities, road accidents cause traffic congestion; resulting damage to infrastructure and increased environmental pollution. Road accidents are a greater concern in developing countries, wherein more than 90% of accidents result in fatalities [1]. Of all developing countries, the World Health Organisation reports that South Africa has the poorest road safety record, with approximately 14 000 deaths per annum and an accident fatality rate of 3.2% [2,8,9].
The factors that cause road accidents need to be identified before an effective solution can be developed. Studies, such as those presented by Machetele and Yessoufou [1] and Verster and Fourie [2], highlight that driver-related accidents account for 80% to 90% of fatal road accidents. A key cause of driverrelated accidents is drowsiness (which may result from excessive alcohol consumption), as this impairs a person's concentration and focus [2,10]. The detection of driver fatigue or drowsiness is hence essential towards improving road safety and reducing the accident rate [11,12].
In light of the fourth industrial revolution, technology is becoming more ubiquitous and there is growing motivation to utilize artificial intelligence and machine learning to solve social problems, such as driver drowsiness detection. To this end, there have been a range of studies that apply deep learning (DL) techniques to solve the problem of driver drowsiness detection [13][14][15][16][17][18][19]. DL is a subset of machine learning that mimics the neural network of the human brain, thus creating an artificial neural network [14]. Artificial neural networks comprise of multiple nodes that model neurons of the human brain, which are organized into layers [20]. Data is propagated from the input layer to the output layer. These artificial neural networks have the potential to solve regression and classification problems, including image classification problems [20,21]. In the context of image classification, each layer trains upon the output of the previous layer, enabling latter layers to identify more intricate elements of the images [21].
At a technical level, the aforementioned studies perform driver drowsiness detection by considering images of a driver's eye, and using DL algorithms to determine the eye state (i.e. whether the eye is opened or closed). By applying this technology to frames from a video feed of the driver, it is possible to determine whether eyes are closed for extended periods of time, which is an indicator of drowsiness. Some of the DL algorithms used in literature include: (i) convolutional neural networks (CNNs) of different configurations [14-16, 18, 22, 23]; (ii) the multi-layer perceptron (MLP) [13,24]; (iii) the respective Visual Geometry Group 16 (VGG16) [25,26] and 19 (VGG19) [17,26] models; as well as (iv) the generative adversarial network (GAN) [27]. The reported accuracies of the models in these studies range between 75% and 96%. www.ijacsa.thesai.org Despite the high accuracies reported in the studies, realworld challenges during implementation were reported that adversely affected the accuracy of the trained models. Among these challenges were: (i) poor lighting, where lighting is either too bright or too dim [13,14,17,19]; (ii) changes to the driver's seat position [22]; (iii) a change in the angle of the driver's face while driving [13,22] the use of spectacles and/or sunglasses by drivers [14,[17][18][19]24].
In this paper, the authors proposed to address these realworld challenges by performing data augmentation (DA) on the training image sets that are input into DL models for driver drowsiness detection. DA techniques introduce artificial images that simulate real-world effects [28], such as different lighting environments and changes to face orientation. This study also uses a training dataset containing images of drivers with and without eyewear to address the challenges associated with drivers wearing spectacles or sunglasses. The DA techniques are tested on CNN models, GAN models, MLP models and both the VGG16 and VGG19 models. Hyperparameter tuning is performed on all models to optimize their learning rate and enhance their overall performance. Literature has shown that careful selection of hyperparameters has a significant impact on model performance [28,29]. The effect of the DA is evaluated by comparing the performance of models trained with and without DA in with respect to the following metrics: (i) accuracy, (ii) precision, (iii) negative class precision, (iv) recall, (v) specificity, and (vi) F1-score. It is hypothesized that the use of DA will result in improved performance of all models.
It is noted that previous studies in literature [14,25,27] have incorporated the use of DA in improving the performance of their specific driver drowsiness detection models. However, to the best of the authors' knowledge, there are no comprehensive studies that investigate DA techniques for a wide range of DL algorithms in the context of driver drowsiness detection, as is done in this paper.
The research in this paper makes the following contributions: 1) Presenting an overview of DA techniques to model the specific real-world scenarios that cause challenges for driver drowsiness detection systems.
2) Studying the DA techniques on a wide range of DL models that perform driver drowsiness detection and statistically analyzing the effects of the DA techniques.
3) Demonstrating the extent to which the DA techniques studied are able to improve DL models that perform driver drowsiness detection and proposing a design guideline for DL model developers on that conditions under which the DA techniques should be considered.
The rest of this paper is organized as follows. In Section II, a review of existing literature was presented. Section III presents the materials and methods used in this study, including providing an overview of a real-world drowsiness detection system. In Section IV the results of the investigations are presented and finally, conclusions and insights that were drawn from this study are presented in Section V. Section V also makes recommendations for future work.

II. RELATED WORK
This section reviews the DL algorithms that have been extensively used in previous studies, to implement models and applications, for drowsiness detection in motorists.
A study by Jabbar et al. [14] proposed a drowsiness detection system that could be implemented on the driver's dashboard, using an Android phone. The system was able to predict the drowsiness of the driver based on their eye state. This study made use of a CNN network to implement a binary classification model that was able to classify the drowsiness in facial images. Data augmentation techniques were applied to the images, before they were trained on the model. The Dlib C++ library was used to extract the driver's facial landmarks from the images. These facial features were fed into the algorithm for training. The dataset was created using the extracted eye features. This model achieved an accuracy of 83.3%. A similar study by Zhang, Su, Geng and Xiao [18] was conducted to detect the drowsiness of a person, using the eye state. This proposed model was implemented on an Infrared video camera. The AdaBoost algorithm was used to extract facial landmarks from the images. The extracted eye landmarks were used to create the image dataset, to train the model on.
The CNN model was used as the binary classifier for drowsiness. An accuracy of 95.8% was achieved by this study.
Sharan, Viji, Pradeep and Sajith [15] proposed a similar drowsiness detection system to Jabbar et al. [14] that could be implemented on the driver dashboard. However, this study proposed that a Raspberry Pi camera module be used to capture the drivers face. The drowsiness prediction was also based on the eye state. The Haar Cascade classifier was used for facial extraction during the implementation of this system and the CNN network was implemented as the binary classifier. Contrast Level Adaptive Histogram Equalization was applied to remove the noise and improve the picture quality, before they were trained on the CNN model. The CNN model was trained on an existing dataset, comprising of eye images. The study by Seetharman, Sridhar and Mootha [22] made use of a CNN network to classify the drowsiness in images. The prediction was based on the eye and mouth state of the extracted faces. The Dlib library was utilized to extract the facial regions from the images, similar to the study done by Jabbar et al. [14]. A dataset for the model was then generated using the extracted eye regions. The trained CNN model achieved an accuracy of 92.4%. In addition, this proposed model was intended to be implemented on a dashboard video camera. Chirra, Uyyala and Kolli [16] proposed a similar model for drowsiness detection, as a CNN network was used to predict the drowsiness in images. The eye state was the metric for prediction, with the Viola-Jones algorithm used to extract the facial landmarks from the images, during the implementation of this system. An existing dataset of eye images was used to train the CNN model. The model produced an accuracy of 96.42%. This model was also proposed to be implemented on a video camera for drowsiness detection, like the study conducted by Seetharman, Sridhar and Mootha [22].
A model using the VGG 19 model to detect driver drowsiness, based on the eye state, was proposed by Hashemi, Mirrashid and Shirazi [17]. This study made use of the Violawww.ijacsa.thesai.org Jones algorithm to extract the facial landmarks from the images. The extracted eye landmarks were then used to create the dataset for this model. The Viola-Jones algorithm has been utilised in previous work [16]. This model obtained an accuracy of 94.96%, with its intended application in driver dashboard monitoring. A study by Ahuja, Saurav, Srivastava and Shekhar [26], proposed an approach to improved drowsiness detection, by using a knowledge distillation technique to reduce the size of DL models, whilst maintaining high accuracy. A large model will have high memory consumption and longer response times. Therefore, there was a need to reduce the size of the DL model. The Histogram of Gradient algorithm was used to extract the facial regions from the images, during system implementation. VGG19 and Visual Geometry Group 16 (VGG16) were the algorithms used to train their respective models, to classify the drowsiness in images. These models were trained on an existing dataset, consisting of eye images. The predictions were based on the eye state for both models. The VGG19 and VGG16 models, obtained the accuracy of 92.5% and 95% respectively.
Bajaj, Ray, Shedge, Jaikar and More [25] proposed a realtime drowsiness prediction system that will be implemented on an Android application, to monitor the driver's face from the dashboard. This system can predict the drowsiness using the driver's eye state. A comparative analysis of three DL algorithms, specifically: Inception, ResNet-50 and VGG 16 were performed. Data augmentation techniques were applied to the images, before they were trained on the models. The models were trained on an existing dataset, comprising of face images. The accuracy achieved by the Inception, ResNet-50 and VGG 16 models were 89%, 56% and 91%, respectively.
A study by Jabbar et al. [13] proposed a system for drowsiness detection that could be implemented on an android application, for dashboard monitoring. The prediction of this system was based on the driver's eye state. The Dlib C++ library was used to extract the person's facial landmarks from the images. This library has been used for facial feature extraction in previous work [14,25]. These facial features were used to create the dataset, which was fed into the MLP algorithm for training. The model was able to classify a driver as either drowsy or non-drowsy. An accuracy rate of 80.92% was achieved by this model. A similar study by Ghourabi, Ghazouani and Barhoumi [24] made use of the MLP algorithm to detect drowsiness in the images. The eye and mouth state were used to classify the drowsiness. The Histogram of Gradient algorithm was used to extract the facial regions from the images. These extracted facial regions were used to create the dataset that was fed into the model for training. The model is intended to be implemented for dashboard monitoring. This study obtained an accuracy rate of 74.9%.
Ngxande, Tapamo and Burke [27] proposed a framework to reduce the biasness of a model during the training process. A Generative Adversarial Network (GAN) model was trained on an image dataset. This model made predictions using facial landmarks and the eye state in particular. The extracted facial landmarks were used to create the dataset for model training. Data augmentation techniques were applied to the images before they were loaded into the GAN model. This helped to improve the performance of the binary classification model. An accuracy rate of 91.62% was achieved by this model.
Many of the studies have used facial and eye extraction algorithms, to create image datasets from real-time data, to train their models on. However, this study aimed to use existing datasets that were available online, to train the DL models. The reason for this was because, this study aimed on improving the performance of trained models, regardless of the source of data. Therefore, no facial and eye extraction algorithms were used on real-time data, in this study.
Literature has shown that many drowsiness detection models faced issues with prediction accuracy, due to poor lighting and the use of sunglasses [13,14,17,18,24]. The other challenge that affected accuracy was the positioning of the driver's face [13,22]. Another gap identified is the lack of preprocessing and data augmentation applied on the data before training. Data augmentation was used in [14,25,27], to create more comprehensive models that exhibits improved performance. DA was used to remove biasness from the models, thus improving the performance. However, not many of the previous studies have comprehensively studied DA to model real-world scenarios to improve model performance, on a wide range of DL algorithms that detect driver drowsiness, as done in this study.
Therefore, this study aimed to develop an improved approach towards drowsiness detection by using data augmentation. Data augmentation techniques were used to create training data that replicate real-life scenarios that correlate with the challenges faced in previous studies.

III. MATERIALS AND METHODS
This section first provides an overview of a real-world driver drowsiness detection system and isolates the role of the DL algorithms that this study focuses on. The data sources and DA techniques utilized in this paper are then discussed. Thereafter, a technical summary of the DL algorithms considered is provided, along with the parameters used in this study. Finally, the authors present the different evaluation metrics that are used to quantify the performance of the DL algorithms.

A. An Overview of a Real-World Drowsiness Detection
System Fig. 1 illustrates the process flow for a real-world driver drowsiness detection system. The process starts with a camera that captures a video of the driver's face, which serves as the input to the system. The camera can either be mounted to the dashboard or steering wheel of the vehicle. The captured video is then stored on cloud-hosted infrastructure, typically in some form of unstructured blob storage.
At the start of the processing stage of the system, the video file is passed on to an artificial intelligence engine, consisting of three sub-units. The first sub-unit extracts individual frames from the video file, which will then be treated as a series of sequential images. The second sub-unit uses image detection techniques to isolate the eye from each image of the driver's face. This produces a series of sequential images of the driver's eyes. Finally, the third sub-unit utilizes a pre-trained DL model to analyze the images and determine the state of the driver's eye (open or closed) in each frame. The eye state determined in each frame is then logged in a database, which is also typically cloud-hosted.
In the final stage of the system, the eye states stored in the database are analyzed and interpreted to detect the drowsiness of the driver. Drowsiness detected when the driver's eyes are in the ‗closed' state for extended periods (multiple consecutive frames from the video feed).

B. Design and Configuration of Study
The research presented in this study focuses on the third sub-unit of the artificial intelligence engine, viz. the DL algorithm that determines the driver's eye state, as described in Section III.A. Hence, for the experiments conducted, the inputs in this study were images of a driver's eye and the outputs were a categorical variable indicating the eye state. A binary categorical output was used, with the positive class label indicating the -open‖ eye state and the negative class label indicating the -closed‖ eye state. The experimental configuration used is depicted in Fig. 2.
In performing the experiments, appropriate datasets of eye images were first sourced. In selecting the datasets, the authors ensured that images where the eye was partially obscured by eyewear (spectacles or sunglasses) were included. By doing this, the DL models would learn to distinguish between eye states irrespective of the use of eyewear.
The datasets were then split into training and testing data using an 80:20 ratio. A copy of the training dataset was created, and data augmentation techniques were performed to model the real-world challenges of eye orientation and lighting conditions. Two DL models were trained: one was trained on the original (pre-treatment) training dataset, and the other was trained on the modified (post-treatment) training dataset. Depending on the architecture of the DL algorithm being investigated, any necessary data-shaping modifications were made to the images from the dataset.
The pre-treatment and post-treatment DL models were applied to the testing dataset to evaluate and compare their performance. As was the case with the training datasets, any modifications to the testing dataset required by the DL model architecture were made. www.ijacsa.thesai.org  The experiments were done using pre-built Python libraries on the Jupyter Notebooks development environment. A personal computer equipped with 8 gigabytes of random-access memory, an Intel Core i5-7200U processor and a 64-bit Windows 10 operating system.

1) Selection of datasets:
There were two datasets utilised in this study, which were obtained from online repositories [30,31]. Both datasets contained images of human eyes with and without eyewear, and images labelled according to the eye state. The properties of the datasets are presented in Table I.
The balanced distribution of eye states was preserved when splitting each of the datasets into respective training and testing datasets, using an 80:20 ratio. The Scikit-learn Python library was used to implement the data splitting.
When exploring the datasets, it was also noted that both sets of data contained images from a diverse range of ethnicities. Different skin tones and complexions were noted, as well as different eye shapes. The authors further observed that among female eyes, the extent to which make-up such as eyeliner and false eyelashes were used differed.
2) Data augmentation and pre-processing: Data augmentation improves model performance by generating variations of training data [14]. This reduces overfitting and improves the model's ability to make generalizations [14,32]. The specific augmentations performed in this study were designed to simulate real-world scenarios and overcome some of the challenges indicated in literature.
The ImageDataGenerator class within the Keras library for Python [33] was used to implement pre-processing and DA in this study. The ImageDataGenerator class supports DA in realtime and makes sure that the model is trained with different variations of images during each training iteration (epoch) [34,35].
The following pre-processing and data augmentation techniques were applied: www.ijacsa.thesai.org a) Brightness adjustment: Multiple studies in literature have shown that poor lighting conditions had a negative impact on the accuracy of DL models for driver drowsiness detection [13,14,17,19]. While driving, ambient lighting conditions can change due to environmental conditions such as the time of day and the weather. For example, driving at night results in a very low brightness conditions and driving in bright sunshine results in very high brightness conditions. While driving, it is also possible for lighting conditions to change rapidly, such a when driving under a bridge/overpass on a sunny day or through the shadow cast by a building or other large structure.
To model scenarios with different lighting environments, this study applied a randomized change to image brightness when augmenting images. This is implemented through adding a constant, , to all pixels in the image. The brightness adjustment function is mathematically described as: In (1) b) Horizontal flips: The shape of a human eye may differ slightly between the left eye and the right eye. Creating artificial data by flipping the horizontal orientation allows the DL model to be trained to analyze either eye of the driver. c) Rotation, translation and zoom: Literature showed that changes to the driver's face orientation was a real-world scenario that adversely affected the performance of DL models [13,21]. Therefore, in this study, rotation, translation shifts and zoom transformations were used to model changes to the driver's face orientation. Rotation and translational shifts are useful to simulate movement of a driver's head while travelling. Zoom transformations model a change in depth between the camera and the driver's face, which may result from the driver changing their seat position or posture. d) Normalization, centering and standardization: Normalization and standardization improve the learning rate and reduces the number of epochs required to train a DL model [36,37]. These processes ensure that no individual input pixel dominates performance [38]. This is done by mathematically adjusting data such that it follows a Gaussian distribution with zero mean and unit variance [39].
Normalization involves rescaling the value of pixels to have a unit maximum, which reduces the computational power required to train the DL model. As all pixels have the same maximum value ( ), the normalization function is described by [36]: Centering ensures that the data has a mean of zero, while standardization ensures that the data has a unit variance [36]. Setting these statistical properties of the data improves the rate at which a DL algorithm converges when training, as well as increasing model accuracy by eliminating statistical bias.
Centering and standardization can be applied to data in with respect to individual images (sample-wise) or with respect to the entire set of images (feature-wise). The functions for sample-wise centering (sc), feature-wise centering (fc), samplewise standardization (ss) and feature-wise standardization (fs) are [39]: In (2) - (6), ̅ represents the mean pixel value and represents the standard deviation of pixel values. The subscripts ‗I' and ‗D' respectively denote statistics calculated over pixels from a single image (I) and statistics calculated over the entire dataset (D).
In this study, each of the above pre-processing operations is performed on input data.

3) Deep learning algorithms:
As discussed in Section I, DL is a subset of machine learning and involves mimicking the human brain. DL algorithms follow a common structure, to the extent that they adopt a layered architecture with multiple nodes at each layer. The DL algorithms for this study are designed to perform a binary classification in determining whether the eye state is ‗opened' or ‗closed'. A brief overview of the different DL algorithms implemented in this study for image classification is provided below.

a) Convolutional neural network (CNN):
The CNN is the most popular artificial neural network (at the time of writing). There are typically three classes of layers in a CNN: convolution layers, pooling layers and fully-connected layers [16,40]. Fig. 3, re-produced from [41], illustrates the layout of these layers.
Convolution and pooling layers work together to perform feature extraction from the input image [16,40]. First, input data representing pixels of an image is multiplying the kernel filters of a convolution layer to generate feature maps. Thereafter, a pooling layer is used to group features together and reduce the size of the feature maps. Pooling features together improves the computation time of the DL algorithm [16]. Fig. 3. Basic CNN architecture [41] (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 14, No. 1, 2023 239 | P a g e www.ijacsa.thesai.org The processed feature maps are then fed into one or more fully-connected layers. The final layer is referred to as the output layer, and any fully-connected layers between the pooling layer and the output layer are referred to as hidden layers. Each node in a fully-connected layer performs a mathematical operation on its input data using an activation function. These activation functions are selected to map inputs to suitable outputs and perform classification [42]. Two different CNN model configurations were investigated in this study. For brevity, they are referred to as CNN-C1 and CNN-C2. Their respective architectures are shown in Table II  and Table III.   Table II describes the first CNN architecture used in this study. These layers are arranged sequentially in a linear stack [43]. The first two convolution layers in this model have 32 nodes each, which are responsible for learning multiple spatial patterns and features from the input image [44]. The last convolution layer 64 nodes. A 3×3 kernel filter is used in each convolution layer, to generate the feature maps. Each convolution layer applied same padding to the input image, which enabled the image to get completely covered by the kernel filter, to generate a feature map [45]. Furthermore, each convolution layer was followed by a pooling layer that applies a maximum filter (max pooling). Once the convolution was completed, the data was then passed to the flatten layer to flatten the multi-dimensional feature map into one dimension [46]. This single dimensional array was then forwarded into the dense layer of the network. A dense layer of 128 units is then used to perform the image classification, using the output from the convolution layers [47]. The last layer of this network was a two-unit output layer which made use of a softmax activation function that calculated the probabilities of each class [48]. There are only two units used in the output layer, because these models are binary classifiers, with predictions made for only two class labels. The output produced by the softmax layer, is represented in the form of a vector, which contains the probabilities of each class, for every sample data In addition, a Rectified Linear Unit (ReLU) activation function was added to each convolution layer and dense layer, to ensure no negative values were passed to the subsequent layers [16]. The ReLU activation function is given by: In (7), refers to the input data to the activation function. Table III describes the second CNN configuration used in this study, which also consists of sequential layers. This configuration uses fewer convolution layers than CNN-C1, but more fully-connected layers when performing classification. CNN-C2 also applies an averaging filter in the pooling layers (average pooling). A with CNN-C1, a ReLU activation function was added to each convolutional layer and dense layer, to ensure no negative values propagated through the network.

b) Visual geometry group (VGG) networks 16 and 19:
The VGG have conducted extensive research into DL algorithms for image classifications that improve upon the traditional CNN [49]. The two VGG algorithms chosen were VGG16 [50] and VGG19 [51]. The VGG16 model consists of 13 convolution layers, five max pooling layers, two fullyconnected layers and one softmax activation layer at the output [50]. The VGG19 model comprises of 16 convolution layers, five max pooling layers, three fully-connected layers and one softmax activation layer at the output [51].
The VGG19 and VGG16 models used in this study were built using the Keras pre-trained VGG library. As with CNN-C1 and CNN-C2, the output layer was configured to have two units with a softmax output representing the probability on an image falling into either classification.

c) Generative adversarial network (GAN):
GANs are a class of DL algorithms that has been applied to image classification problems [52]. The structure of a GAN, shown in Fig. 4 [53], comprises of two sub-neural networks: a generator network and a discriminator network.
During training, both the generator and the discriminator learn concurrently. The function of a generator network is to produce new, artificial instances of data/images from the input features [52]. This is a form of data augmentation that occurs within the network architecture. The artificial images output from the generator network are evaluated by the discriminator to determine whether they adequately resemble images from the true training dataset. Back-propagation is then used to iteratively train the generator. Generator networks are typically seeded with randomized noise data. www.ijacsa.thesai.org  The discriminator network is trained with images from both the actual dataset and the artificial images produced by the generator. When using a GAN, the discriminator is the final trained model that is tested and deployed in a system.
In the design of a GAN, the discriminator is often a CNN model, and the generator is often a de-convolutional neural network.
The GAN models in this study were built with the architectural layers described in Table IV. There were three convolutional layers used in this network with each layer having 128 nodes. Each convolutional layer was followed by a pooling layer to perform down-sampling. The data was then flattened and passed to a two-unit softmax output layer, where the output prediction was produced. The GAN models deployed a Leaky ReLU activation function, as described by (8), which was added to each down-sampling layer and dense layer. The Leaky ReLU activation function dampens the effect of negative values [54], but does not force them to zero like the standard ReLU function in (7).  The MLP is a more basic DL architecture than those derived from the CNN, as it only consists of fully-connected layers [55,56]. The typical structure of an MLP consists of an input layer, an output later and at least one hidden layer between the input and output layers. As such, the operation of the MLP is the same as classification stage of a CNN. As a result, MLPs require data to be flattened at the input layer.
The MLP models in this study were built according to the architectural layers described in Table V. The ReLU activation function was implemented in the hidden layer.

4) Model evaluation:
When analysing model performance, this study considers a range of metrics collectively to provide a holistic evaluation of performance. The following performance metrics were used to evaluate the DL models: accuracy score, precision, negative class precision, recall, specificity and F1-score. These metrics are defined in (9) - (14), in terms of the number of true positive classifications ( ), the number of true negative classifications ( ), the number of false positive classifications ( ) and the number of false negative classifications ( ). These output classifications relate true eye state (based on the known label associated with an image) to the detected eye state (based on the output of the model). The definitions of the different output classifications are visually represented in Fig. 5.
a) Accuracy score: The accuracy score is a measure of how many correct predictions were made by the classifier, out of all the predictions made [57,58]. This is hence the percentage of true output classifications with respect to all output classifications, and is mathematically described as: www.ijacsa.thesai.org (9) b) Recall and specificity: Recall defines how well the model can correctly classify positive outcomes [58,59]. In the context of this study, recall indicates how many images of open eyes were correctly classified by the model. In addition, for a balanced evaluation of the predictions made for both class labels, the specificity metric was also used. Specificity indicates how well the model can correctly classify negative outcomes [58]. In the context of this study, it indicates how many images of closed eyes were correctly classified by the model. For the problem of driver drowsiness detection, being able to correctly identify when the driver's eyes are closed is of equal importance than identifying when the eye state is open. The mathematical definitions of recall and specificity are given in (10) and (11), respectively.  (12). Similarly, the negative class precision represents the percentage of correct closed eye state classifications from all closed eye state classifications. The formula for negative class precisions is presented in (13).
d) F1-Score: The F1-Score represents a weighted average between precision and recall and is hence considered the most appropriate measure of model performance in some literature [57,61]. Equation (14) presents the mathematical formula to calculate F1-Score [61,62]. (14) IV. RESULTS AND DISCUSSION This section presents and analyses the effects of data augmentation on model performance. Pre-treatment and posttreatment results are presented in Table VI and Table VII, and their descriptive statistics are presented in Table VIII. The change in performance metrics due to treatment is presented in Table IX. While results for all performance metrics are presented, the main analysis focuses mostly on accuracy and F1-score, as the latter provides insight into the underlying precision and recall.
In the analysis carried out, the authors first confirmed that the DA techniques adopted in this study have improved the performance of the DL models that were investigated. Fig. 6 presents a box-and-whisker diagram of the statistical distribution of all evaluation metrics considered; and compares pre-treatment results with post-treatment results. From the results in Fig. 6, Table VII, Table VIII and Table IX, the following observations and interpretations were made: 1) The post-treatment mean and median values of all evaluation metrics are higher than the pre-treatment values (Table VIII and Table IX). This indicates that the average performance of all DL models studied improved due to the DA techniques applied. The average improvement of the most conclusive metrics, accuracy and F1-score, were +6.1% and +6.8% respectively.
2) The interquartile ranges (IQRs) and standard deviations of post-treatment results were less than for pre-treatment results. In terms of the most conclusive metrics, accuracy and F1-Score, the IQR of both metrics decreased from 13% to 3%. The standard deviation of accuracy scores decreased from 0.17 to 0.12. Similarly, the standard deviation of F1-Scores decreased from 0.20 to 0.14. This indicates that there is less variability in the expected post-treatment performance of all DL algorithms.
3) Outliers were noted in the results, which are clearly illustrated in Fig. 6. These arose from the VGG16 and VGG19 models which were trained on Dataset 1 and displayed inferior performance to the other models studied. Upon investigation, this has been attributed to the dimensionality mismatch between Dataset 1 images (96×96 pixels) and the input dimensions defined by the VGG16 and VGG19 architectures (224×224). While the application of DA techniques has shown the greatest improvement to these models, the post-treatment performance is still low compared to the other models studied. It is thus concluded that the VGG models are not suitable for Dataset 1, and in practice, should not be used with lowresolution cameras that produce smaller video frames/images. www.ijacsa.thesai.org   Having confirmed the hypothesis that the DA techniques that were applied have improved the performance of the DL models studied, the next step was to attempt to quantify the extent of this improvement. The VGG16 and VGG19 models trained on Dataset 1 were excluded from this analysis due to their poor performance, as discussed previously. Table X presents the change in evaluation metrics due to the application of DA with these models removed. The statistical distribution of the data presented in Table X is illustrated in Fig. 7.
When analyzing the results, the following was observed: 1) A few instances were observed where applying DA treatment caused a reduction in individual evaluation metrics (recall, precision, specificity and negative class precision), as indicated by shaded backgrounds within Table X. However, despite this, the F1-Score increased for all models, indicating that these performance reductions were compensated for. The average increase in both accuracy and F1-Score was +4.3%, and the median increase in each of these metrics were +2.1% (accuracy) and +2.0% (F1-Score).
2) The box-and-whisker diagrams in Fig. 7 indicated that there is significantly more variability for recall, specificity, precision and negative-class precision than for accuracy and F1-Scores. As such, attempts at quantifying the expected improvement in DL model performance using the methods in this study can only reasonable be performed for accuracy and F1-Score. However, these are the most conclusive metrics to evaluate the DL models studied.
3) By analyzing the distribution of the change in accuracy and F1-Scores, it was observed that the data for these evaluation metrics was positively skewed. This resulted from the high pre-treatment accuracy scores and F1-Scores of some of the DL models studied, where there was not much room for improvement without over-fitting the model to the training dataset.
Prompted by the final observation listed above, the final analysis investigated the relationship between the change in evaluation metric scores and pre-treatment metric scores. The scatterplot presented in Fig. 8 illustrates this relationship, using data from Table VI, Table VII and Table X and excluding the outlier results resulting from the VGG16 and VGG19 models that were trained on Dataset 1. The trend lines show that all evaluation metrics exhibited a strong negative correlation, indicated by the R 2 values of the correlation trend lines (R 2 > 0.7 for all evaluation metrics). From this, it is concluded that the DA techniques under study have a marginal improvement when applied to DL models that already exhibit strong performance, but are much more powerful in enhancing weaker-performing DL models. From Fig. 8, an improvement of ≥ +5% to an evaluation metric occurs when the pretreatment value of the metric is ≤ 90%. This indicates the type of DL models for driver drowsiness detection that will benefit most from the DA techniques presented in this study, and is recommended to developers as a design guideline when considering the implementation of the DA techniques presented in this paper. The results confirm that by modelling real-world scenarios using the data augmentation techniques described in Section II.B.2, it is possible to train more robust deep learning algorithms that perform driver drowsiness detection. With respect to implementation of driver drowsiness detection systems, the deep learning model development and training would be performed before the model is deployed in the driver drowsiness detection system hardware.
V. CONCLUSION Many road accidents are caused by driver drowsiness. Previous studies have considered applying deep learning techniques to detect driver drowsiness and improve road traffic safety. In practically testing their systems, many previous studies have indicated that real-world scenarios such as unfavourable ambient lighting and movement of the driver while driving cause inaccuracies when detecting driver drowsiness.
In this study, the authors focussed on the deep learning algorithms that determine driver drowsiness based on the eye state of the driver. It was hypothesised that by modelling the real-world scenarios and using data augmentation techniques on a standardised image dataset, the performance of the DL models would improve. This study considered two different datasets, six different DL models: two CNN variations (CNN-C1 and CNN-C2), two architectures designed by the VGG (VGG16 and VGG19), a GAN and an MLP.
The performance of the DL models was evaluated primarily using accuracy and F1-Score, although other metrics such as precision, recall, specificity and negative class precision were also considered. In analyzing the results in aggregation, improvements across all metrics were noted. The average improvement in accuracy across all DL models was +6.1% and the average improvement in F1-Score was 6.8%, and the variability in model performance was reduced. However, there were some challenges noted when training the VGG models. These models trained on low-resolution images, exhibited poor performance and distorted these results. A more realistic indication of the benefits of DA for the DL models studied was obtained by excluding these outliers, yielding an average improvement of +4.3% for both accuracy and F1-Score.
The results further indicated that the extent to which the DA techniques studied improve DL model performance is strongly correlated with the pre-treatment DL model performance. From the analysis conducted, the data augmentation techniques presented are best suited for improving models with accuracy and F1-Scores ≤ 90%although they are applicable to any DL model for driver drowsiness detection.
It was thus concluded that the use of DA techniques improves the performance of DL models for driver drowsiness detection under the isolated conditions of this study. However, since the conditions of this study focussed on testing the DL models on images from datasets, rather than testing being done on captured data from a real-world driver drowsiness detection system, this opens the possibility for future research. Future works should look at implementing the trained DL models proposed in this study in practical driver drowsiness detection systems to validate these results.

FUNDING
All funding in support of this research was provided by the Durban University of Technology.