Voice Biometrics for Indonesian Language Users using Algorithm of Deep Learning CNN Residual and Hybrid of DWT-MFCC Extraction Features

—This research develops a Voice Biometrics model for the Indonesian language users by using deep learning algorithm of CNN Residual and Hybrid of DWT-MFCC Feature Extraction. The voice dataset of Indonesian speakers were created with a duration of 5, 10, 15, 20, and 25 minutes. The testing phase of speaker recognition and speech recognition were carried out by comparing the model of CNN Residual with CNN Standard. In the phase of speaker recognition, CNN Residual model has obtained the best results with the highest precision percentage of 99.91% and the highest accuracy of 99.47% at 25 minutes voice samples, compared to the CNN Standard obtaining precision of 96.83% and accuracy of 99.00%. In the phase of speech recognition, CNN Residual model has reached the best performance at 100% accuracy during 20 trials, while CNN Standard only gave 95% accuracy. CNN Residual Model provides a better performance for its accuracy and precision, but it is slightly slower than the CNN Standard, with a time difference of 0.03 – 1.28 seconds.


I. INTRODUCTION
The crime of fraud and identity theft has become a crucial threat in cybercrime. It can be associated with the excessive use of the Internet for miscellaneous activities, including online transactions, social networking, and the storage of personal information. To minimize these problems, a biometric identification method was developed, especially for high-level security entry and privacy of sensitive data access in banking transactions [1][2][3].
The biometric-based personal identification method is one of the alternatives developed especially for high-level security entry, such as government or military buildings, access to sensitive data or information, and theft prevention. Voice biometrics is a biometric technology that utilizes the biological characteristics of the human voice for the identification and authentication of unique patterns for each individual [4][5][6]. Voice biometrics includes the voice commands, allowing devices such as smartphones, computers, or laptops to receive what the user has spoken and translated into certain electronic commands. The communication of voice commands between the user voice and the device is also known as human-machine interaction [7][8][9]. The development of the implementation of voice biometrics technology is a solution to maintain the privacy and security of individual identity data and to avoid frauds.
The voice biometric has been perceived providing a more secure and a more reliable identification and authentication process. In principle, the authentication mechanism can be conducted remotely using a common device such as smartphones and laptops, while the cost of implementing voice biometrics is lower than other biometric solutions because it does not require special devices, such as fingerprint readers or retina scanners. It also has higher security, easy to operate, and accurate identification method to identify a person [10][11][12][13].
Currently, voice object recognition research is being developed using the CNN deep learning model. Deep learning Convolutional Neural Network (CNN) technology is one of the neural network algorithms that can assist in solving problems with large amounts of data and data complexity in the object identification process [14][15][16]. However, most of the previous research provides a separate discussion between the performance of speaker recognition [17][18][19] and speech recognition [20][21][22]. Meanwhile, most of the paper discussions on voice biometrics still use machine learning methods, in which it has the disadvantage of not being able to process large amounts of data and not being able to handle the complexity of large data in the identification process of voice biometrics.
In this paper, a Deep Learning voice biometric model was developed using CNN Residual and Hybrid of DWT-MFCC Extraction Feature. The use of CNN Residual is done to simplify the training and validation process, as well as to improve classification accuracy [23,24]. Meanwhile, the hybrid extraction feature, the Discrete Wavelet Transform (DWT) [25,26], and Mel-Frequency Cepstral Coefficients (MFCC) extraction features are used to eliminate noise interference, recognize the shape of the voice pattern from a person's characteristics and select the required voices [27,28].
The test was carried out on 2 security system processes that apply to voice biometrics [29,30], namely the speaker recognition security system to detect "Whose voice is the person speaking?" [31,32]. And a speech recognition security system to detect "What keywords are spoken?" [33,34]. If both securities are successfully accessed, then the system will be "Accepted". But if these two securities fail to be accessed, then the system will be "Rejected". Furthermore, testing is also carried out by measuring the processing time required to carry out a voice biometrics process. www.ijacsa.thesai.org To test the model, this paper compares the performance of the proposed model with the CNN Standard. The comparison is essential to see how the performance of the CNN Residual model may significantly improve the performance of voice biometrics, especially on its accuracy, which lead to better security system. The remainder of the paper presents Underlying Theories in Section II, elaborates the theory of voice biometric, its relevant studies, and the theory of DWT and MFCC. Section III presents the architecture of the deep learning model, Section IV presents the results and analysis, while Section V concludes this paper.

A. Voice Biometrics
As shown in Fig. 1, the voice biometrics system consists of 2 (two) processes, namely the user enrollment and user verification/authentication [35][36][37][38]. The user enrollment is the process of identifying the user's voice identification for registration of the user's voice data into the database. The user enrollment process begins with a capturing process where the user's voice as input is captured by the microphone as a voice sensor. The user's voice input contains the speaker's voice and speech content. Preprocessing is the process of converting analog user input voice signals into digital ones. The process of creating this template is a user identification process that is carried out to register the identity of the user's voice which is a unique individual characteristic, which registers the speaker's voice (speaker recognition) and the speech recognition content which is stored in the database [36]. The workflow of the user enrollment and verification process can be shown in the first line of Fig. 1.
The second process, namely user verification or authentication, is the process of verifying the user's voice by matching the user identification between the incoming voice and the voice that has been registered in the database. In this process, there is a template match process that is intended to verify the voice data by matching the user identification between the incoming voice data and the voice sample data template that has been registered and stored in the previous database. The output is Voice Biometrics Authentication with validation of the user's voice data (accepted/rejected).
As shown in Fig. 1, in the context of verification, basically the core of the voice biometrics security system works on 2 phases [29,30], i.e. speaker recognition to detect "Whose voice is the person speaking?" [31,32], and speech recognition to detect "What keywords are spoken?" [33,34]. If the voice recognition on both phases are successful, then the system user will be verified, otherwise, it will be rejected. However, most of previous studies discussed the phase of speaker recognition and speech recognition separately.

B. Relevant Studies
The relevant studies of Speaker Recognition, Speech Recognition and their combination on build up the voice biometrics are shown in Table I. In the previous research, several papers have discussed voice biometric by using machine learning methods, including machine learning k-Nearest Neighbors (k-NN) [35], SVM machine learning, and MFCC extraction features [36], GMM machine learning and MFCC extraction features [37]. However, it is rare for papers to discuss using deep learning algorithms.
The weakness of using deep learning methods ANN and DNN is less reliable than RNN and CNN [48,49]. While RNN is a sequential data modeling unit, RNN includes less feature compatibility when compared to CNN. The weakness of this RNN has a gradient loss problem. To avoid the problem of disappearing gradients, the RNN is combined with the LSTM. However, this LSTM has the disadvantage that it requires more memory to train [46,[50][51][52]. CNN is considered more reliable than ANN and RNN. And the weakness of using machine learning methods GMM, GMM-UBM, GMM-HMM, and HMM is only able to process data in smaller amounts and is less able to process complex data [53,54]. Therefore, the advantages of CNN are having reliable computing capabilities, having a high accuracy, having the ability to process large training data, having an ability to automatically detect important features without human supervision, and being able to classify data complexity in the voice identification process [17][18][19][20][21][22].  From the feature extraction point of view, studies in [55,56] discussed the performance of speaker recognition, while studies in [40][41][42][43] discussed the performance of speech recognition with the i-vector extraction features, PNCC, RASTA PLP, and MFCC. They had performed an accuracy of about 76%. Research on speaker recognition [17][18][19]39] and speech recognition [20-22, 44-47, 57] by using a deep learning model with i-vector extraction features and MFCC, have obtained an accuracy of around 71-90%.

C. Deep Learning CNN Standard
This CNN standard is a deep learning algorithm technology that has high performance and has been used for database training and testing. With the performance of the CNN standard algorithm, it is expected to improve higher performance compared to using the previous machine learning algorithm.
The architecture of CNN standard is shown in in Fig. 2 In CNN Standard, the input layer is a layer for processing data as a set of features. In each Convolution layer, there is a filter/kernel size (filter size) convolution matrix of 3x3 with some filters 16, 32, 64, and 128. This convolutional layer carries the main part of the network computing load, which does most of the heavy computational work. The convolution layer is needed to speed up the extraction of spatial features in the data so that the number of parameters that need to be used to extract features can be reduced, and in the end, will speed up runtime training. Each Convolution layer contains Batch Normalization and ReLu. The batch normalization layer is a normalization technique performed between the layers of the Neural Network, which standardizes the input to the layer for each mini-batch. This is done with mini-batches instead of full datasets. This serves to speed up the training process and uses a higher learning rate, making learning easier. Batch normalization is also able to solve the main problem called internal covariate shift.
ReLU (Rectified Linear Unit) is a node or unit that implements the network layer activation function. ReLU is useful for helping prevent the exponential growth in the computations required to operate neural networks. The adaptive average pooling layer is an easy average pooling operation layer, which gives the input and output dimensions, www.ijacsa.thesai.org to calculate the correct kernel size required to produce an output of the given dimensions from the given input. The flatten layer is a layer that involves taking the combined feature maps generated in the pooling step and converting the data into one-dimensional vectors, to be inserted into the next layer; by flattening the output of the convolution layer to create one long feature vector. And it is connected to the final classification model, which is called the fully-connected layer. Furthermore, a Fully Connected Layer is a layer where all inputs from one layer are connected to each activation unit of the next layer. The last few layers are full connected layers that compile the data extracted by the previous layer to form the final result. The fully connected layer is a full process of Batch Normalization and ReLU data input.

D. MFCC Method for Extraction
As shown in Fig. 1, feature extraction plays an important role to provide good accuracy. Mel Frequency Cepstral Coefficients (MFCC) is believed to be a method that has the highest level of accuracy with speech recognition rates and the fastest feature extraction time compared to other voice feature extraction methods [58]. It is so that the MFCC method is good in accuracy for feature extraction in speech recognition processing in voice biometrics. MFCC is one of the feature extraction methods and methods that are most often used in various fields of voice processing, because it is considered very good in presenting the characteristics of a signal, such as in speech recognition technology, both voice biometrics, speaker recognition, and speech recognition. MFCC is used to recognize the shape of the voice pattern from the extraction of a person's characteristics and choose only the voices that are needed from other voices that are not needed. The feature extraction process with MFCC is a process of taking from feature extraction using a discrete Fourier transform. The Fourier transform can only determine the frequency that appears in a signal, but cannot determine when that frequency appears. The sequence process for the MFCC block diagram can be shown in Fig. 3 [59].
The following is the sequencing process for the MFCC block diagram: 1) Pre-emphasis: Used for the filtering process which compensates for the high-frequency portion of the voice signal that is suppressed during the voice production mechanism. The pre-emphasis process is following Equation (1) [28,59].
where y(n) = signal from the calculation result of preemphasis process, n = serial number of voice signal, s(n) = voice signal before pre-emphasis process, = constant of preemphasis filter, with a value of 0.9 α 1.0 and s = voice signal.

2) Framing and windowing:
In the framing process, analyze the speech signal of the voice in the form of frames. The signal is divided into several pieces, to facilitate the calculation and analysis of voice signals. Each frame is represented with an interval of 20-40 ms and the signal is continued every 10 ms, which overlaps the previous signal and the next signal [60]. Windowing is used to avoid discontinuity between signals. The most widely used type of window is the hamming window [28,59,61].

3) Fast Fourier Transform (FFT):
In the Fourier transform, the digital voice signal is transformed into a frequency signal. FFT is an algorithm that has a very fast calculation to perform Fourier transforms in the discrete domain. The results of the FFT process produce detection of frequency domain waves in discrete form [28,59].

4) Mel filterbank:
Filterbank is used to determine the energy size of a certain frequency band in a voice signal. Filterbanks are overlapping bandpass filters. Mel is a unit of measure based on the frequency perceived by the human ear. Based on the Mel scale, it is linear below the 1 kHz frequency and logarithmic above it. Mel scaling process according to Equation (2) as follows [27,59,60]: mel = 2595 log 10 ( 1+ f / 700) (2) where Mel is the output of the Mel filterbank, and f is the input of the filterbank. While 2595 and 700 are fixed values that have been widely used in the MFCC method in many studies. Mel spaced filterbank as in Fig. 4, the filter bandwidth below 1 kHz is linear while above 10 kHz is logarithmic [59].

5) Discrete Cosine Transform (DCT):
DCT is used to calculate the MFCC of a single frame. DCT aims to produce a Mel spectrum to improve recognition quality. The DCT process is following Equation (3) [59].
In this case, C m = Coefficient, where Y[k] = the output of the filterbank process on the index, m = the number of coefficients, and K is the expected number of coefficients.
where dt is the delta coefficients of the t frame. In general, the value of N is 2. The data for the sum of the delta coefficients is the same as the MFCC, the number of coefficients is 13. The sum of the MFCC data plus the delta coefficient is equal to 26 features of the data dimensions [28,59,61].

E. DWT Method for Hybrid Extraction MFCC
This MFCC method has drawbacks, where the feature extraction method of voice signals is sensitive to noise [61]. From several previous studies, there is still a need to improve the performance of MFCC. To improve the performance of MFCC on the voice biometrics identification system, a method that can eliminate noise frequencies is needed. There is a need to develop a hybrid method which will help to provide better performance solutions. It is signified that the voice biometrics with a Hybrid DWT-MFCC extraction feature can be used to eliminate noise interference, recognize the shape of the voice pattern from a person's characteristics and choose only the voices that are needed from other voices that are not needed. With the Hybrid MFCC-DWT feature extraction method, it is hoped that reliable features can be formed and produce a high level of accuracy and are better than before [62,63].
Based on previous research, Discrete Wavelet Transform (DWT) is a good method to eliminate noise (denoising) in signal processing so that the voice quality in voice biometrics is better. The wavelet signal processing is suitable for nonstationary signals, whose spectral content changes over time. Each wavelet transforms measurement according to a fixed parameter will provide information about the timetemporal range of the signal and information about the frequency spectrum of the signal. The wavelet transform provides an approach to multi-analytical signal resolution and this technique has been used to identify voice signal features. The wavelet transform is an integral part of the raw signal x(t) multiplied by the scale, type shift of the basic wavelet function ψ(t). (5) as follows [26,63,64]:

Continuous wavelet transform (CWT) is calculated in Equation
where a is the scaling parameter and b is the time localization parameter. DWT is often more efficient than CWT to avoid counting on each CWT scale.
With parameter changes, DWT is defined in Equation (6) as follows [62]:  Thus, architecturally it can be described, Referring to Fig.  1, the hybrid extraction process is carried out on the processing results (signal in Fig. 1) which is then carried out with the extra feature DWT-MFCC process to help eliminate noise interference, recognize the shape of the voice pattern from someone and selects the required voices. Fig. 6 shows the architecture of the voice biometric model which is developed in this research. In principle, the user enrollment/training is processed by using the DWT-MFCC for the part of feature extraction and CNN Residual model for the part of training process. This training process is a capability learning process where the CNN model is trained to identify user voice datasets using large GPU and CPU computing devices. In this training process, the user identification process is carried out to register the user's voice identity which is stored in the database. After completing the training process, the CNN www.ijacsa.thesai.org model that has been trained will produce a Trained CNN Model. Such a trained CNN model will be subsequently used for the user verification process.

A. The Architecture
This user verification process is the process of classifying and authenticating voice datasets. This user verification process will directly apply the new voice data to the Trained CNN Model and use it to conclude the output. So, when a new user's voice data is entered into the Trained CNN Model, the system will verify the voice data by matching the user identification between the new voice data and the voice data that has been registered in the database. Next, the system will issue predictions based on the prediction accuracy of the data that has been trained on the Trained CNN Model. The Trained CNN Model classification is optimized to maximize prediction performance to achieve high accuracy. The output of the trained CNN model classification is user voice authentication, in the form of data validation (valid / not) or (accepted/rejected) of the user's voice data.

B. CNN Residual as Deep Learning used in this Research
In this research, the architecture of CNN Residual is shown in Fig. 7 For CNN Residual, Residual Shortcut is a branching technique for CNN layers, where one branch is a shortcut over 1 or several other branch layers. Initially, the CNN Residual technique was intended to deal with the problem of saturation by increasing the number of layers. Difficult iteration problems and a large number of layers tend to cause a decrease in the quality of classification in terms of speed and accuracy. With the increasing amount of large data, it will affect the increasing capacity of the CNN model, on the number of parameters, filters, and layers. By using the residual technique, the iteration training can be shorter, and the accuracy value will increase, along with the increase in the number of parameters, filters, and layers. The following is the general equation of the shortcut residual identity function, which can be seen in Equations (7) and (8) [24].
y=F(x,{W_i })+x (7) y=F(x,{W_i })+W_s x (8)   F(x,{W_i}) is a filter (residual mapping) whose optimal value is determined, and x is a feature map input. W_i is the layer group that is skipped, and W_s is a linear projection in adjusting the dimensions for x and y when performing shortcuts such as downsampling or upsampling. Although there is almost no change in arithmetic operations and the number of parameters, the addition operation performed can be neglected for the computational load. The application of this residual technique will result in a shorter iteration process and affect the classification results for the better [65,66]. To further improve the performance of the voice biometrics system, it is proposed to optimize CNN using a CNN residual model. The optimization of this CNN residual is needed to simplify the training and validation process, as well as increase the classification accuracy.

C. CNN Standard as a Comparison of Performance
The performance analysis of CNN Residual model is conducted by comparing with CNN Standard. The essential differences between them are about the Total Parameters and Parameter Size, in which the parameters on CNN Residual Model are greater than the CNN Standard Model. This will affect the working process of the CNN Residual Model which is longer than the CNN Standard Model. CNN Standard Model Parameters and CNN Residual can be seen in Table II.

D. Data Set of Indonesian Language
In this research, the original voice dataset of Indonesian language speaker was created. It is essentially used on training the CNN Model algorithm. The creation of the voice dataset begins with the user's voice input via the microphone on the smartphone. The making of this voice dataset involved 10 users, starting from Voice Biometric0 to Voice Biometric9 (VB0 -VB9). Each VB user input contains the unique speaker and speech. Each VB user contributes the voice sample by speaking in Indonesian language for 50 minutes duration.
To make a uniform voice sample files in the dataset, it is necessary to set the following parameters: First, changing the stereo voice to mono voice; Second, changing the frequency of the voice sample rate to 16,000 Hz; Third, truncating the silence to eliminate the pause in the user's voice, so that the result is that every VB user is sampled for 25 minutes, without any pauses; Fourth, segmenting the voice samples for each VB user into 1500 files each; Fifth, changing the voice sample file in the form of a WAV file type format. Finally, with the number of 10 users, a voice dataset is obtained with a total number of voice samples being 15,000 files.
Furthermore, the voice dataset is processed with the DWT-MFCC extraction feature so that it can recognize the shape of the voice pattern from a person's characteristics, can choose only the voices that are needed, and can eliminate noise disturbances. After completing the feature extraction process, the voice dataset is ready to be trained with the CNN model algorithm.

E. Testing
In this research, the system's performance was tested by conducting a performance assessment.
1) The first phase of Performance Testing, namely Speaker Recognition with the CNN Residual Model Algorithm using DWT-MFCC, (compared to CNN Standard).
2) The second phase of Performance Testing is the performance of Voice Biometric from Speech Recognition with the CNN Residual Model Algorithm using DWT-MFCC, (compared to CNN Standard).
3) Performance Testing of Training Process Time on Voice Biometric with Algorithm CNN Residual Model using DWT-MFCC, (compared to CNN Standard).
Each test was carried out for a sample duration of 5 minutes, 10 minutes, 15 minutes, 20 minutes, and 25 minutes.

A. Performance Testing of Speaker Recognition ("Whose
Voice is the Person Speaking?") Performance testing of speaker recognition on the CNN Model is to test the performance of speaker recognition with the CNN Residual Model Algorithm using DWT-MFCC, (compared to CNN Standard). This performance measurement uses the confusion matrix which is a machine learning classification method. This confusion matrix provides information on the comparison of the classification results carried out by the CNN Training model system with the actual classification results. From the results of the CNN Trained Model, it will be used to measure performance with the Confusion Matrix [67,68]. In this research, a classification system for identifying voice datasets was carried out, where the input data were grouped into 10 VB users to classify the VB voice datasets. In determining the best model, the confusion matrix method becomes important to consider in choosing the best model between deep learning CNN Residual models using DWT-MFCC (compared to CNN Standard).
This performance measurement uses a confusion matrix, which is divided into 4 (four) combinations representing the results of the classification process  Table III and IV, and Fig. 8 and 9.
Based on the comparison of accuracy on speaker recognition performance with CNN Residual model using DWT-MFCC, the best results were obtained with the highest percentage accuracy value of 99.47% for the 25 minutes duration voice sample. Accordingly, the larger the number of voice sample durations or the larger the number of voice sample files executed, the higher the percentage of accuracy www.ijacsa.thesai.org performance in prediction. It can be shown in Table III and Fig. 8.
Based on the comparison data of precision on speaker recognition performance with CNN Residual model using DWT-MFCC, the best results were obtained with the highest percentage precision value of 99.91% for the 25 minutes duration voice sample. Based on data analysis, the larger the number of voice sample durations or the larger the number of voice sample files executed, the higher the percentage of precision performance in prediction. It can be shown in Table IV and Fig. 9.
By comparing the performance of Speaker Recognition between CNN Residual Models and CNN Standard, the main results are as follows: 1) The accuracy of CNN Residual is higher than CNN Standard, in which CNN Residual is about 96.10% -99.47% while the later is 95.80% -99.00%; as can be seen in Table III and Fig. 8.
2) The precision of CNN Residual is higher than CNN Standard, in which CNN Residual is about 80.05% -99.91% while the later is 78.85% -96.83%; as can be seen in Table IV and Fig. 9.
3) The CNN Residual's best results or the highest percentage value are 99.91% precision and 99.47% accuracy for 25 Minutes voice sample duration. The same condition is also applied to the CNN Standard of 96.83% precision and 99.00% accuracy.
It can be implied that the greater the number of voice sample files and the more voice sample training carried out, the higher the level of precision and accuracy in prediction performance will be. By looking at the comparison results, the highest percentage value shows the best value for the precision and accuracy of Speaker Recognition on CNN Residual. It can be concluded that the speaker recognition performance of the CNN Residual model is better than the CNN Standard.

B. Performance Testing of Speech Recognition ("What
Keywords are Spoken?") Performance testing of speech recognition on the CNN Model Algorithm aims to test the accuracy of speech recognition performance with the CNN Residual Model Algorithm using DWT-MFCC (compared to CNN Standard) at 5, 10, 15, 20, and 25 minutes of voice sample duration. This is done by matching keyword speech or matching speech content.   This test uses a speech content of keyword "Open Access", spoke by the Indonesian users. If the statement is match or correct (True), it will be accepted, while if the speech is wrong or unclear (False), then it is rejected. Fig. 10 shows that speech recognition performance test has been carried out with the CNN Standard and CNN Residual. It was tested by 20 voice pronunciations, with a total of 10 VB users saying "Open Access". The results show that the percentage of Speech Recognition accuracy performance obtains the best results with the highest percentage value in the 100% CNN Residual, which is higher than the 95% CNN Standard.
The testing has signified that CNN Residual model is better than the CNN Standard. Optimizing the CNN Residual model can improve the validation performance of voice biometric training accuracy, speaker recognition accuracy, and speech recognition accuracy. This is because the CNN Residual model can simplify the training and validation process, as well as increase accuracy in voice biometric classification.   Fig. 11, the comparison of the performance testing of the voice biometrics training process shows that the CNN Standard training process time performance results are faster than the CNN Residual training process time. This happens because the total number of parameters and the parameter size of the CNN Residual Model is more than the Standard CNN Model, so it requires a longer processing time, with a time difference of 0.03 -1.28 seconds. It can be implied that the more training time and the more voice sample files are performed, it will result a higher level of accuracy in prediction. It is also indicated that the larger the file duration, the higher the processing time but with a not too big difference.
By analyzing the performance of training process on voice biometrics for a sample duration of 5, 10, 15, 20, and 25 minutes, it can be signified that the accuracy value are consistently above 95%. Accordingly, it can be concluded that by only using the sample of 5 minutes, the voice biometrics system can recognize and identify the speaker with a decent performance.

V. CONCLUSION
This paper has developed a Voice Biometric research model for Indonesian language speaker using the CNN Residual Deep Learning algorithm, which uses Hybrid Feature Extraction DWT-MFCC. Testing is done by comparing the model with the CNN Standard. In this study, a voice dataset was created with 10 users (VB0 -VB9). Each VB is a unique speakers who speak in Indonesian language, resulting a total number of 15,000 voice samples with a voice sample duration of 5, 10, 15, 20, and 25 minutes.
The testing was conducted in the phase of speaker recognition and speech recognition. For the speaker recognition phase, the CNN Residual model has obtained the best results with the highest percentage value of 99.91% precision and 99.47% accuracy at a voice sample duration of 25 minutes, compared to Standard CNN of 96.83% precision and 99.00% accuracy. For the speech recognition phase, CNN Residual has achieved the best results of accuracy which is 100% accurate in 20 trials, while Standard CNN only gave 95% accurate results.
From the results of performance testing of training process time for a sample duration of 5, 10, 15, 20, and 25 minutes, the accuracy value has been consistently above 95%. It can be implied that by only using 5 minutes voice data set, this developed system is able to recognize who is the speaker as well as to identify what keywords are spoken.
Optimizing the CNN Residual model can improve the validation performance of voice biometric training accuracy, speaker recognition and speech recognition accuracy as well as its precision. However, CNN Residual is slightly slower than the CNN Standard, with a time difference of 0.03 -1.28 seconds.
It can be concluded that the performance of the CNN Residual model provides better results for its accuracy and precision. This research is expected to assist in developing a new model that is able to apply an accurate and efficient individual voice identification and authentication algorithm for www.ijacsa.thesai.org voice biometrics systems for security and privacy systems to access sensitive data in banking transactions.
ACKNOWLEDGMENT Haris Isyanto is in PhD program funded by Beasiswa Pendidikan Pascasarjana Dalam Negeri (BPPDN) Ministry of Education and Culture Republic of Indonesia. Dr. Muhammad Suryanegara is main supervisor, and Dr. Ajib Setyo Arifin is co-supervisor as well as the corresponding author. The voice data set is built by the support of Electrical Engineering -Faculty of Engineering, Universitas Muhammadiyah Jakarta. This publication is supported by Research Grant Universitas Indonesia.