Deep Speech Recognition System Based on AutoEncoder-GAN for Biometric Access Control

—Speech recognition-based biometric access control systems are promising solutions that have resolved many issues related to security and convenience. Speech recognition, as a biometric modality, offers unique advantages such as user-friendliness and non-intrusiveness, etc. However, developing robust and accurate speaker identification and authentication systems pose challenges due to variations in speech patterns and environmental factors. Integrating deep learning techniques, especially AutoEncoder and Generative Adversarial Network models, has shown promising results in addressing these challenges. This article presents a novel approach based on the combination of two deep learning models, namely, AE and GAN for speech recognition-based biometric access control. In the model architecture, the AutoEncoder takes the MFCC coefficients as input, and the encoder converts the latter to the latent space, whereas the decoder reconstructs the data. Then, speech features extracted from the latent space are used in the GAN generator to generate additional speech data. The discriminator network has a dual role, serving as both a feature extractor and a classifier. The first extracts relevant features from generated samples, while the latter distinguishes between generated and authentic samples that come from AutoEncoder. This strategy outperforms DNN and LSTM models on VoxCeleb 2, LibriSpeech, and Aishell-1 datasets. The models are trained to minimize Mean Squared Error (MSE) for both the generator and discriminator, aiming at achieving highly realistic datasets and a robust, interpretable model. This approach addresses challenges in feature extraction, data augmentation, realistic biometric samples generation, data variability handling, and data generalization enhancement, providing therefore, a comprehensive solution.


I. INTRODUCTION
Speech recognition systems [1] have become increasingly important in various domains, including biometric access control, where the identification and authentication of individuals based on their unique voice characteristics are crucial.These systems aim, securely, to use biological or behavioral characteristics to authenticate and authorize individuals for access to a physical location, a device, or a system.It relies on unique and measurable traits that are specific to an individual, making it difficult to forge or replicate.These characteristics can include physiological characteristics such as fingerprints, face features, iris patterns, and voiceprints, as well as behavioral characteristics such as typing patterns, gait, and signature dynamics as shown in Fig. 1.The main function of this paradigm is to collect biometric data, convert it into a digital template, and then compare this template to templates stored in a database.If the comparison results in a match, the individual is given access.Otherwise, access is blocked.Fig. 2 presents a system architecture based on speech recognition.Biometric access control systems based on speech recognition offer numerous advantages [2], such as universality, nonintrusiveness, high authentication security, and convenience i.e., bypassing the use of memorized passwords or access cards, audit trail (keep track and accountability), and faster processing than the traditional one.This concept is a powerful and convenient way to improve security and access management in a variety of areas, such as physical facilities, digital systems, and digital transactions. . .etc.However, to achieve reliable and robust performance, it is essential to develop accurate speaker identification and authentication mechanisms.
Deep learning models are characterized by their ability to learn sophisticated patterns and representations from data, providing a strong foundation for tackling the complexities of speech recognition [3].Hence, providing powerful tools for improving the accuracy and effectiveness of its tasks.In this context, the use of AutoEncoder models for feature extraction [4] has shown promising results in identifying and authenticating speakers.The AutoEncoder model, a type of neural network architecture, has gained significant attention in recent years due to its capability to learn meaningful and compact representations of input data.AutoEncoders can be leveraged to extract discriminative features from raw speech signals.By training the AutoEncoder model on a large dataset of labeled speech samples, it can learn to encode the essential characteristics of a speaker's voice into a lower-dimensional representation, which facilitates efficient and accurate speaker identification and authentication processes.However, this model faces several challenges among them: • Lack of Realism in Generated Samples: Because Au-toEncoders concentrate on recreating the input data, they may produce generated samples that are excessively similar to the training data and are devoid of variation.
• Noisy or incomplete reconstructions: AutoEncoders may have trouble accurately reconstructing the input speech signals, mainly when there is noise or fluctuation.
• Limited Generalization of novel data: Because Au-toEncoders tend to concentrate on recreating wellknown patterns from the training set, they may have trouble in generalizing novel or unseen data.
• Incapability to Distinguish: AutoEncoders are typically unsupervised models concentrating on feature learning and reconstruction.The capacity to discriminate is essential for precise authentication in a biometric access control context.
• Limited data augmentation: AutoEncoders can be used for limited data augmentation by reconstructing and producing synthetic samples.The produced samples, however, may not represent the whole range of variability included in the training data.
• Adversarial Attacks: AutoEncoders can be vulnerable to adversarial attacks [5], which include making tiny and purposeful changes to input data in order to trick the model.In the case of speech recognition systems, this might include discreetly changing a voice recording to deceive the system into providing access to an unauthorized user.
• Inadequate Temporal Information: Traditional Au-toEncoders struggle with sequential data, which is an issue in speech recognition, where the order of the input (i.e., the sequence of sounds or words) is important.Recurrent or convolutional AutoEncoders, for example, can alleviate this, although they are more sophisticated and computationally intensive.
Generative Adversarial Network (GAN) is a promising paradigm that consists of two main components: a generator and a discriminator.The generator produces synthetic data and the discriminator tries to differentiate between real and synthetic data.In the context of biometric access control using speech recognition [6], GANs can be applied to generate synthetic speech data to augment the training dataset [7], which can help address data scarcity issues, increase the diversity of the training data, and improve the robustness and generalization of the speech recognition system.The key contributions of this project can be outlined as follows: • Introducing a novel approach rooted in deep learning models, specifically AutoEncoder and Generative Adversarial Network in biometric access control through speech recognition context.This integrated model enhances system performance, accuracy, robustness, and efficiency.
• Employing the AutoEncoder (AE) model as an unsupervised method for extracting meaningful and discriminative features, reducing dimensionality, and addressing storage and computational challenges associated with raw audio data analysis.Additionally, features are extracted from the latent space and utilized in the Generative Adversarial Network (GAN) to augment the training dataset, enhancing model generalization, mitigating overfitting, and alleviating data scarcity issues, resulting in the creation of highquality, realistic biometric samples.
• Proposing a GAN model where the generator network produces synthetic speech data resembling that from the latent space representation, thereby expanding the training dataset.This approach improves model generalization, reduces overfitting, and addresses data scarcity, leading to the generation of high-quality biometric samples.The discriminator in this proposal serves two roles: feature extraction and classification.
The former extracts features from generated samples, capturing more informative and efficient features, while the latter distinguishes between generated samples from both models, enhancing overall system performance.
• Application of this approach to diverse datasets, including VoxCeleb 2, Aishell-1, and LibriSpeech, has yielded positive results when compared to outcomes from Deep Neural Network (DNN) and Long Short-Term Memory (LSTM) models.
The remainder of this article is organized as follows: Section II provides related works on speech recognition systems for biometric access control.Section III presents the proposed solution.Section IV presents the results and analysis of the experiments conducted, highlighting the performance gains achieved through the AutoEncoder-GAN-based approach.Section V presents a discussion.Finally, Section VI concludes the article with a summary of the findings and discusses potential directions for future research in this field.

II. RELATED WORK
In the literature, there is a lot of research related to biometric access control based on speech recognition topics, including speaker identification and authentication, and speaker verification.This section presents an overview of some works and propositions published recently that achieved significant results.
Najim Dehak et al [9] proposed two speaker verification systems models, In which the first one is based on SVM, by using the cosine similarity, and the second one utilises directly the cosine similarity in the final phase which decides the final score.The experiments are done through three different methods in the variability space, which are within-class covariance normalization, linear discriminate analysis, and nuisance attribute projection.Their study conducted on the combination of LDA with WCCN has achieved good results compared to the other ones.The test was carried out using the NIST 2008 Speaker Recognition Evaluation dataset.
Yen Lei et al [10] presented a new approach based on deep learning speaker recognition using a phonetically that aims at improving speaker recognition performance by using an ivector model [11] to represent the speech signal (extract the main features) and DNN model is used to replace the UDM-GMM [12] paradigm in order to train the model.The experiments proved that this approach has significantly improved the i-vector speaker recognition system.
Another research done by [13] has proposed d-vector instead of i-vector that aims at extracting hidden layers of a DNN as features.D-vector represents the averaged activations from the last hidden layer of DNN.Experiments of this approach have proved its efficiency in a small-footprint textdependent speaker task.Generally, this scheme underperforms the predecessor based on i-vector-DNN.
Another research made by [14] has proposed a multitask deep learning scheme based on the j-vector method that consists of extracting features from multitask DNN using probabilistic linear discriminant analysis (PLDA).This scheme has achieved good results than the predecessor models (ivector, d-vector).
The Authors in [15] have proposed a new scheme based on deep neural network DNN to extract speech features called as x-vector.This latter represents the fixed-dimensional embeddings of variable-length traits.Furthermore, this research tackled also data augmentation by adding the noise and the reverb to the existing dataset to improve the efficiency of the model in the text-independent speaker tasks.Effectively, this approach has achieved better findings than the ones based on the i-vector and d-vector.Another research conducted by [16] has proposed a new end-to-end architecture based on neural networks, especially DNN and LSTM to speaker verification in the text-dependent context that aims at mapping the utterances to a score and joining them to optimize the representation of the speaker.In the same area, the authors of [17] have proposed another approach based on the end-to-end attention model.They use the CNN model to extract the noise-robust frame-level features that will become utterance-level speaker vectors using the attention model.This approach proves its effectiveness on Windows 10 "Hey Cortana".
Another research carried out by [18] in the context of text independence has presented a new end-to-end approach based on the deep learning model to optimize the triple loss function using Residual Net block and measuring the similarity by Euclidean distance within trials.The findings show that this approach outperforms that based on conventional i-vector schemes, namely on short utterances.
In [19], the authors have proposed a new generalized End-to-end model based on LSTM.The training process has relied on the large number of utterances forming a batch.This scheme aims at optimizing the loss function through the training process in an efficient manner.The experiments show that this platform has achieved good results.N. Le et al [20] have proposed a new approach based on deep learning model, namely CNN.The main objective of this proposition is to optimize the deep speaker embedding through intra-class loss distance variance regularization compactness.The findings have proved that this approach accelerates the convergence of the training model, which enhances the model's performance.
Another research carried out by the authors in [21] has presented an end-to-end optimized scheme based on deep convolutional features extractor combined with self-attentive and large-margin loss functions in the text-independent tasks context.They use a modular neural network instead probabilistic linear discriminant analysis (PLDA) classifier.This work made use of the experiments on VoxCeleb and NIST-SRE 2016 and has achieved an enhancement model than the others based on i-vectors.The authors in [22] have suggested a novel approach for learning speaker embeddings based on a simulated model of GAN, especially the discriminator.This architecture aims to maximize mutual information, improving the model performance on the VoxCeleb corpus.Experiments show that this model outperforms the model based on i-vector and that based on triples loss systems.
Many works are proposed to optimize the performance of speech recognition tasks and provide a robust system using deep learning model.Each research has focused on one aspect or more, such as data augmentation, features extraction, denoising and de-reverberation.The proposed solution has designed a new architecture based on deep learning models, namely AutoEncoder and Generative Adversarial Network in a complementary manner to improve the model performance by minimizing the loss function.The MFCC is used to extract features and the model AE to capture the meaningful speech representation and GAN is used to generate speech data from the latent space of the model AE.

III. PROPOSED SCHEME
The proposed scheme is based on two models which are AE and GAN models as depicted in Fig. 3.At first, the speech inputs are collected, and their Mel-Frequency Cepstral Coefficient (MFCC) characteristics are extracted and used for training and tuning the model.Generally, The AE model comprises three components: Encoder, latent space, and Decoder.The encoder captures the main representation of the meaningful speaker speech features extracted from MFCC and produces the latent space, the latter will be used to reconstruct the input data.In this model, the Latent space will be extracted and used as input to the generator of the GAN model to generate more real data from it.The generated samples will be then used as input to the discriminator.This latter plays two roles, namely a features extractor and a classifier.At first, the discriminator extracts features from the generated samples and then feeds to the classification between that extracted and that comes from the AutoEncoder i.e. the decoder, to make a decision.This section presents more details of this model.

A. Data Preprocessing
Generally, the preprocessing process [23] is crucial for preparing the data.This phase involves the capture and splitting of data into segments, feature extraction, noise removal, features normalization, and data loading, etc.Among the main steps that represent the backbone of the model namely in the context of the biometric access control based on speech, is feature extraction.To this end, the proposed architecture involves the adoption of the Mel-frequency Cepstral coefficients (MFCC).
1) Mel-Frequency Cepstral Coefficients: MFCC [24] is a technique that consists of extracting features from the signal.In the speech processing context, this method is widely used to capture the spectral features of sound well-suitable for various machine learning and deep learning tasks including speech recognition and speech analysis.Simply this technique is an amount of coefficients that represent the shape of the speech power spectrum signal.Fig. 4 represents the components of the MFCC.To calculate the coefficients of MFCC, some steps are crucial as depicted in the figure.After capturing the speech signal, the first step is breaking the signal into frames (windowing process) and then applying the Fast Fourier Transform (FFT) to determine the power spectrum of each frame.Following that mel-scale filter bank processing is performed on the power spectrum by the formula 1: Where mel(f) represents the frequency on mels and f represents the frequency on Hz.The power spectrum is converted then by log domain and the Discrete Cosine Transform (DCT) is applied to get the coefficients of MFCC through the Eq.2: Where k, Ŝk , and Ĉn represent, respectively the mel cepstrom coefficients numbers, the filter bank output and the MFCC coefficients.

B. AutoEncoder model
The AutoEncoder (AE) model is a sort of neural network architecture used for feature learning, dimensionality reduction, and data reconstruction.It is especially effective for extracting relevant representations from biometric data and may be used in a wide range of biometric modalities, in various applications, namely Feature learning, data denoising, data compression, Anomaly detection, Privacy preservation, biometric template protection...etc.An AutoEncoder's primary principle is to learn a compact and efficient representation of incoming data.It is made up of an encoder and a decoder as shown in Fig. 5.The encoder takes raw data as input and converts it to a lower-dimensional latent space representation.The fundamental traits and qualities of the data are captured by this latent space representation, and the decoder uses this later to attempt to recreate the original input data.The objective is to maintain the information required for reconstruction in the latent space.AE is an unsupervised model that aims at minimizing the loss function between the input data and the reconstructed data, capturing the most relevant representations.
The proposed solution incorporates the use of AutoEncoder to capture the relevant representation of the inputs from MFCC coefficients, optimizing the speech processing system.The main objective of MFCC is extracting features and converting the input signal into coefficients that are retained as features which represent the main relevant features.The AutoEncoder takes these coefficients as input and converts them into latent space, reducing therefore, the dimensionality of the representation represented by the coefficients, and extracting the main relevant representation.The other network i.e. decoder network reconstructs the representation from that reduced (latent space).The main goal of this proposition is to get the most salient and compact representation from MFCC coefficients in a lowerdimensional space by training the AutoEncoder model.

C. GAN model
Generative Adversarial Network (GAN) is a generative model that is distinguished by two distinct networks, each with its unique set of attributes called Generator and Discriminator.The first seeks to produce realistic data from a specific class, while the second is used to determine whether the generated data is realistic or phony, as shown in Fig. 6: GAN is a deep-learning class used especially to produce synthetic data from the raw data input.In the scope of biometric access control, the generator takes a random noise as input and attempts to produce biometric data samples that mimic actual biometric data, while the objective of the Discriminator is to distinguish between the generated samples and the real ones, generally a binary classifier.The training procedure comprises a competition between the generator and the discriminator.As training advances, the generator improves at creating more realistic data, while the discriminator improves at differentiating between actual and phony data.This repeated procedure should result in high-quality synthetic data that is difficult to differentiate from genuine data.GANs may be used for a variety of reasons in the context of biometric access control, including data augmentation, Privacy-Preserving Research, Training Data Generation, Data Imputation, and Adversarial Attacks and Defense.
To this end, the proposed scheme extracts the latent space from the AutoEncoder model and uses it as input in the Generative Adversarial Network (GAN) model namely the Generator.This latter Generates more speech data from those reduced features, producing then data simulated to that of input.Whereas the discriminator in this architecture plays two roles, namely a features extractor and a classifier.At first, the discriminator takes the generated samples as input, extracts relevant features and then distinguishes them from that produced and trained by the AE model, especially the decoder.

IV. EXPERIMENTS AND RESULTS ANALYSIS
In this section, the experiments carried out by the laboratory team are presented, describing therefore, the datasets, the metrics and the implementation details of the proposed model, and finally the analysis of the result.

A. Datasets
In this model, three different datasets have been used, which are VoxCeleb 2, Aishell-1 And LibriSpeech.

VoxCeleb:
It is an open-source dataset that is widely used in the experiments of speech processing tasks [25].It contains videos interviews uploaded to YouTube.There are two types of VoxCelb datasets, VoxCeleb 1 and VoxCeleb 2. The first one has over 100,000 utterances For celebrities, whereas the VoxCeleb 2 has over a million utterances.In the proposed solution, the experiments have occurred on the VoxCeleb 2.
Aishell-1: This dataset is used also in the speech preprocessing tasks [26].It is an open-source and freely accessible speech dataset that contains Mandarin speech captured with a high-fidelity microphone (44.1 kHz, 16-bit).The Aishell-1 dataset was created by downsampling the audio collected by the high-fidelity microphone to 16 kHz.A set of 400 speakers from various accent areas in China took part in the record capture.
LibriSpeech: is an open-source corpus, available in [27].It contains 1000 hours of speech sampled at 16KHz and is generated from audiobooks in the LibriVox project.This dataset is used mainly in speech preprocessing including speech recognition and speaker identification.Table I represents the specification of the used datasets:

B. Evaluation Metrics
Generally, a metric is a method used to evaluate a system's performance on a specific task.The main metric objective is measuring the quality of classifications or predictions carried out by a system or model.A loss or error function [28] is a function that determines how much the output or predicted value departs from reality or actual value aiming at optimizing the model (either maximizing or minimizing issues).The Mean Squared Error (MSE) [29] is a loss function that measures the error between the observed and predicted values.The average of errors squared is calculated by this Eq.3.
Where y i represents the observed value, ŷi represents the predicted value, and n is the observations number In this study, the MSE metric is used in the AutoEncoder model to evaluate its performance, and the Binary Cross Entropy (BCE) metric is also used in the GAN model that represents the difference between the predicted probability distribution and the reel one.On one hand, BCE is used to solve the binary classification issues, evaluating then, the model's performance.On the other hand, is used to quantify the training loss, minimizing therefore, the loss function of the model during training.

C. Implementation Configuration and Results Analysis
In the proposed scheme, the PyToch library, written in Python programming language, has been used for training networks based on deep learning models.This model has adopted Graphics Processing Unit (GPU), due to its efficiency in Neural Network processing.After capturing the MFCC coefficients from the speech, 13 coefficients, The latter are then fed to the AE model namely, the encoder that converts the inputs to latent space, reducing therefore the dimensionality and capturing essential speech features.These highlevel features serve to reconstruct the speech data from the bottleneck layer and aim at generating outputs that closely resemble the original data input.The model training aims at minimizing the reconstruction loss between the original and reconstructed speech.In the training of the AutoEncoder, the chosen specifications include 8 dimensions as the latent dimension, 64 as the batch size, and 0.001 as the learning rate.Within the first part of the proposed architecture, the AE model is implemented with input dimensions set to 8. The encoder network consists of 128 units or neurons in the hidden layer, employing the Rectified Linear Units (ReLU) as an activation function.The use of ReLU introduces non-linearity, facilitating the model in learning complex relationships within the data.The learning rate and the network size are identified using different settings based on the try-and-error approach to choose the best configuration in terms of performance.
At the beginning of the training process, the weights are initialized at random and then gradually updated.To solve the model overfitting challenges, different methods are used such as the regularization of the parameters to promote lower values of weight, and adding dropout layers within the encoder and the decoder, furthermore, the loss function regularization has been adopted to promote certain desired behaviors in the latent space.The data mapping process is carried out from the 128-dimensional hidden representation to the latent space representation.The Adam optimizer has been deployed.The loss function is selected as the Mean Squared Error (MSE) as mentioned before.
In this proposition, the latent space features are extracted representing the high-level speech representation to feed it into the GAN model, namely the generator network.This latter takes the high-level representations (more relevant speech features) as input to generate more speech data in a manner that resembles real speech.The architecture of the generator is composed of three fully connected linear layers with ReLU activation functions between them.The Tanh activation function has been applied in the final layer to ensure that the generated values are bounded within the range [-1,1].The other GAN network, i.e., the discriminator, plays two roles in this architecture, a features extractor and a classifier.At first, the discriminator takes the generated samples from the generator, tries to extract the relevant representations and then feeds them to the classifier to distinguish them from those that come from the decoder of the AE model.The structure of the discriminator is similar to that of the generator.It consists of three fully connected layers with ReLU activation functions.In the final layer, the sigmoid activation function has been applied, which produces values within the range [0,1] where 1 identifies the real data, and 0 identifies the fake ones.Both the generator and the discriminator are adversarial trained.i.e. competing against each other.This process helps us to refine the ability of the generator to generate more high-level quality speech data, and therefore, achieve a robust system based on the combination of two promising deep learning models, AutoEncoder and Generative Adversarial Network, especially in the speech recognition tasks.The AE-GAN's ability to leverage the latent space features extracted by the AutoEncoder to enhance the generative capabilities of the GAN is a potential advantage.In scenarios with limited labeled data, the AE-GAN's capacity for generating high-quality and realistic speech samples may prove advantageous.Additionally, its ability to address overfitting challenges through regularization techniques and dropout layers may con-tribute to superior performance in diverse speech recognition tasks.Although the proposed model offers several advantages, it may face challenges in scenarios where there is insufficient diversity in the training data, potentially leading to biased representations.If the dataset lacks sufficient variation in terms of speakers, accents, or speech characteristics, the model may struggle to generalize well to a broader range of realworld scenarios.Augmenting the dataset with more diverse samples could enhance the model's robustness.Additionally, the model's performance may be sensitive to hyperparameter settings, necessitating careful tuning.Implementing automated hyperparameter tuning methods or conducting a thorough sensitivity analysis may help identify robust configurations more efficiently.The computational complexity of the model, especially in training large-scale datasets, could pose limitations in terms of time and resource requirements.The computational costs related to the training deep learning models, including the proposed AE-GAN model, are a significant consideration.

VI. CONCLUSION AND FUTURE WORK
This paper has proposed a new approach based on speech recognition for speaker identification and authentication that is considered as the main and crucial task in the speech-based biometric access control scenario.The model has proved its efficiency and robustness based on the combination of AE and GAN models.The proposed model provides an optimized platform integrating the features learning and tackling the data augmentation and generalization issues, especially the speech dataset, and data imputation such as reconstructing degraded audio or denoise and tuning the hyperparameters of the models.This approach has been implemented on three different datasets: VoxCeleb2, LibriSpeech, and Aishell-1, and has achieved good results in terms of performance, compared to AE and GAN models.However, the proposed scheme is expensive in terms of time-consuming, especially in the training phase where there are two models AE and GAN.In

Fig. 7
represents AE-GAN model training process using three different datasets, with the loss versus training epochs to illustrate how well the model learns.The experiments incorporate different utterances from three different datasets, including VoxCeleb 2, LibriSpeech, and Aishell-1.These datasets are divided into three parts for each dataset, 80% for training, 10% for validation, and 10% for test.Experimentation involved assessing the proposed deep AE-GAN model by utilizing the state-of-the-art models, namely the Deep Neural Network (DNN) model and Long Short Term Memory (LSTM) model, using the datasets mentioned above to describe the experiment findings.

Fig. 8
depicts the results of the experiments carried out over the datasets using the DNN and LSTM models.V. DISCUSSION Deep Neural Networks (DNNs) and Long Short-Term Memory networks (LSTMs) are reference models in speech recognition-based biometric access control context, and have been widely used in many studies.DNNs have demonstrated their performance in learning hierarchical representations from raw audio data.Their ability to handle complex features with

TABLE I .
DATASETS SPECIFICATION Table II lists the overall loss function of the models that are used in the test process during various research phases.As shown in the table, the AE-GAN model has a high score in training and validation in three different datasets, which are LibriSpeech, VoxCeleb 2, and Aishell 1, it has achieved respectively in training loss, 0.0574, 0.0876, and 0.0886, and in validation loss 0.0581, 0.0888, and 0.0889.Compared to the results of DNN and LSTM models, they have gotten in the training phase values ranging from 0.07 and 0.168, while in the validation phase, huge values ranging between 0.30 and 0.48, proving generally the overfitting of the models.The proposed scheme has proved its efficiency and outperformed the performance of DNN and LSTM models in three different datasets.

TABLE II .
AVERAGE LOSS PER EPOCH FOR TRAINING AND VALIDATION