A Novel Smart Deepfake Video Detection System

org


INTRODUCTION
The rapid development of artificial intelligence techniques: autoencoders, Generative Adversarial Networks (GANs), and variational autoencoders facilitated the generation of hyperrealistic fake videos, images, and audio. A deepfake indicates a synthetic image or video AI-generated by swapping an individual's face with another. Applications such as ZAO [1], and DeepFaceLab [2] enable individuals to rapidly generate forged images and videos easily. Recently, a human's voice can be cloned using advanced AI techniques. AI-based audio manipulation is a category of deepfake that clones a human's voice and shows that human saying things that he never said.
Overdub, iSpeech, and VoiceApp are instances of voice cloning open-access platforms that can generate synthesized deepfake sounds that nearly resemble the target human's speech [3]. The work of [4] is an example of these manipulation methods, which involves the creation of highly realistic deepfake videos with a precise lip-sync using a group of AI technologies; FaceSwap, FaceSwap GAN, DeepFaceLab, SV2TTS [5], and Wav2Lip [6].
The majority of deepfake videos are created by cloning sounds, synchronizing lips, and frame-by-frame synthesizing faces. Nevertheless, they lack natural emotions, pauses, and breathing behaviour. Additionally, they suffer from discontinuity and faces' flickering among frames. Deepfake can be misused to impersonate individuals, configure an opinion towards a public figure, and spread falsified news. Therefore, a deepfake detection method is needed to cope with the progress in the deepfake generation process and to distinguish the fakes in video frames, audio, and the whole video including video frames and audio. This paper introduces a smart deepfake detection method that captures the manipulation in a video (multimodal by nature) on three levels; video frames, audio, and the whole video. It distinguishes whether a given video is a deepfake or not. Two proposed feature extraction methods are employed to extract features from video frames and audio modalities. The first method applied to the visual video frames modality is the XceptionNet with some newly introduced modifications. The Xception network achieved effective results in distinguishing the manipulated videos [7,8]. The suggested modifications to the Xception network produce useful spatial information of the video frames and improve the deepfake detection method performance. The second method applied to the audio modality is a modified InceptionResNetV2 model based on the CQT method to produce deep time-frequency information of the audio segments and improve the detection method performance. The CQT is a time-frequency analysis method that produces higher time resolution at high-frequency areas and higher frequency resolution at low-frequency areas [9]. Its efficiency has been proven in music signal processing tasks [10], speaker verification systems [11], acoustic scenes and events detection and classification [12], anti-spoofing [13], synthetic speech detection [14,15], and speech emotion recognition [9]. The corresponding features extracted from the two modalities are fused at a mid-layer to create a bimodal information-based feature representation for the whole video. Finally, the GRU-based attention mechanism is applied to these three levels of representation independently. This assists to learn instructive temporal information for each level and *Corresponding Author. 408 | P a g e www.ijacsa.thesai.org detect deepfake videos. The GRU performs well in tasks of sequence learning and overcomes the gradient vanishing and explosion problems of the standard recurrent neural network [16]. The proficiency of the attention mechanism has been proven in several areas including machine translations, image captioning, question answering, speech recognition [17], and event detection [18]. A comparative study with recent state-ofthe-art deepfake detection methods is conducted in terms of accuracy, Area Under Receiver Operating Characteristic (AUROC) curve metric, precision, recall, F1-score, sensitivity, and specificity.
The rest of this work is organized as follows: Section II presents the literature review for deepfake video detection methods. Section III presents the newly proposed method for deepfake video detection. Section IV is dedicated to the experimental results and analysis. The conclusion and future work are presented in Section V.

II. LITERATURE REVIEW
The progress of AI-based video and voice generation methods raised the ease of creating natural and highly realistic deepfakes that can never be distinguished. Since deepfakes violate security and pose a real threat to society, several researchers have directed their interest to create methods for detecting deepfakes. However, they concentrate on detecting the deepfakes either in video frames or audio modality.
Some of the existing deepfake visual video detection methods spot the manipulation by targeting specific spatial and temporal artifacts that are generated during the fake creation process. Some other detection methods are data-driven that do not target any specific artifacts and distinguish the manipulation by classification [3]. The deepfake visual video detection methods can be categorized into Convolution Neural Network (CNN)-based methods [19,20,21,22], methods that are based on CNN with a temporal network [23,24,25,26,27], handcrafted feature-based methods [28], and handcrafted feature-based methods with deep networks [29,30]. This is illustrated in Fig. 1.
The work of [19] detected the deepfakes by exploiting artifacts left by the generation methods when warping the target image to be consistent with the source video. It used four pre-trained CNN models for detecting fake contents; ResNet101, VGG16, ResNet50, and ResNet152. Since deepfake videos suffer from inconsistency among the interframes, Hu et al. [20] introduced two branches that are based on CNNs to capture those local and global inconsistencies and then detect deepfakes. Rana and Sung [21] proposed a deep ensemble learning method for detecting deepfake videos. Their method depended on combining several deep base-learners and then training a CNN on these learners to build an ameliorated classifier. In [22], a fine-tuned InceptionResNetV2 model followed by the XGBoost model was employed to capture discrepancies in the spatial domain of fake videos and then individuate deepfakes. The FakeApp creates forged videos that had intra-frame and temporal inconsistencies between frames. Such inconsistencies were detected using InceptionV3 CNN and long short-term memory (LSTM) models [23]. As AIgenerated fake videos lack normal eye blinking, Li et al. [24] introduced the VGG16-LSTM to capture the temporal regularities in the eye blinking process and then distinguish the deepfakes. Most deepfake videos are created frame-by-frame where each forged face is created independently. This causes incoherence in the temporal domain of the face region; discontinuity and flickering. As a result, Zheng et al. [25] introduced a fully temporal convolution network that aimed to learn the temporal discrepancies while removing spatial ones. Then, a temporal transformer encoder followed by a multilayer perceptron was employed to learn the long-range inconsistencies along the time dimension, and then distinguish the deepfakes. In [26], a 2D CNN-based Spatio-temporal learning model was introduced to learn and capture spatial and temporal inconsistencies of forged videos. This temporal inconsistency was captured from both vertical and horizontal directions over adjacent frames and helped in detecting the fakes. The work of [27] introduced a fine-tuned EfficientNet-b5 model followed by the bidirectional LSTM model and densely connected layer. It aimed to discover the Spatiotemporal inconsistencies in deepfake videos and then distinguish the authenticity of videos. Deepfakes were created by joining the generated face into the source image. This produced errors in facial landmark locations that were detected by estimating the 3D head poses for real and deepfake videos. Then, the estimated difference of head poses was fed into the Support Vector Machine (SVM) for deepfake detection [28]. Khalil et al. [29] proposed a model that employed the local binary patterns descriptor to analyze the texture of real and fake videos. Additionally, a CNN-based enhanced highresolution network was used to automatically capture informative multi-resolution representations of these videos. Then, the output of both was fed into the capsule network to individuate deepfakes. Ismail et al. [30] introduced a hybrid method in which two feature extraction methods were employed to learn and extract enrich spatial features from the detected face frames of video. These methods were a CNN that was based on the Histogram of Oriented Gradient (HOG) method and the improved XceptionNet. Their outputs were merged to be fed into GRUs sequence to extract the spatiotemporal features and detect the fake videos.
The deepfake audio detection methods can be categorized into handcrafted feature-based methods [31,32], methods that are based on low-level features with CNN [14,33,34,35], methods that rely on using low-level features with CNN and temporal network [37,38], and end-to-end deep networksbased methods [39]. This is presented in Fig. 2.
The work of [31]    In [32], MFCC, CQCC, and Mel-filter bank slope features were employed to train the GMM to capture the vocal tract information and then distinguish the fake audio. Reimao [14] employed the CQT method to convert the audio signals into visual audio representations. These produced representations were fed into the Mobile network model to detect the synthetic speech. The linear filter banks' low-level features were extracted from audios. Then, these features were fed into the ResNet model to produce deep feature representations and detect audio manipulation. In addition, the online frequency masking augmentation layer and the large margin cosine loss function were employed during training the Residual network to learn more robust key feature embeddings [33]. WU et al. [34] employed the long-term CQT and log power spectrum to extract audios representation. This representation was used as an input to the feature genuinization method. This method learned a transformer with a CNN which was based on genuine speech characteristics. It aimed to maximize the difference between the distribution of genuine and synthetic speeches. Then, the transformed features were utilized with a light CNN model to detect the synthetic speech. In [35], the audio signals were converted into spectrogram images using the Fast Fourier transform method. These images were fed as an input into a CNN to validate audio signal authenticity. The work of [36] utilized Linear Frequency Cepstral Coefficients to convert raw audios into feature vector representations. Then, these representations were fed to a fine-tuned ResNet-18. In addition, a one-class Softmax loss function was proposed to www.ijacsa.thesai.org learn an embedding feature space in which the genuine speech had a compact boundary while the fake data was isolated from the genuine one by a certain margin. In [37], the log magnitude spectrograms were extracted from audio files. Then, a light convolution gated recurrent network was employed on these spectrograms to produce deep features and discriminate the real speech from the spoofed one. The work of [38] employed the short-term zero-crossing rate and energy to select the silent segments from each speech signal. Then, the linear filter bank features were extracted and fed into an attention-enhanced DenseNet Bi-LSTM model to identify audio manipulations. In [39], an end-to-end model which is based on the Residual network was proposed to extract deep features of audio data and then detect the synthetic speech.
Some researchers introduced approaches based on learning from different modalities to detect deepfakes. These approaches, which are often known as deepfake multimodalvideo detection methods, can be categorized into CNN-based methods [40,41,42,43], and methods that are based on using CNN and temporal network [44,45]. This is depicted in Fig. 3.
The work of [40] exploited the perceived emotion cues from speech and face modalities to detect any manipulation in a video. It employed the OpenFace-V1 technique to extract the facial features and the PyAudioAnalysis library to extract the MFCC speech features. Then, the Siamese network-based architecture and the triplet loss were utilized to model the similarity between both modalities within a video and distinguish the fake content. Since any modification of visual video frames or audio modality within a video lead to a loss of lip synchronization, and abnormal lip and facial movements, a multimodal video deepfake detection method was introduced [41]. This method was based on computing the dissimilarity score between visual video and auditory segments. The 3D-Residual network-based architecture was used for extracting visual video features from face segments, and the raw audio segments were converted into MFCC features and then fed into CNN. The contrastive loss was estimated over audio and visual video features for each segment, which forced the real representation of both modalities to be closer than the manipulated one. Additionally, the cross-entropy loss was applied on every single modality to confirm that each one independently learns informative features. The work of [42] presented a multimodal video deepfake detection method based on discovering the defects in manipulated mouth areas via employing genuine audio as a reference. The audios were aligned and clipped into partitions based on phonemes, and Mel-scale spectrograms were extracted and used as audio features. The mouth frames were extracted from videos based on facial landmarks using the dlib python library. Then, each mouth frame with a particular phoneme interval was matched to a fixed-length audio partition to produce auditory-visual video pairs. After that a CNN architecture was trained on these pairs to capture the synchronization degree between lip movements and speech by measuring the similarity score of auditory-visual video pairs. Zhou and Lim [43] employed the asynchrony property between fake visual video, especially mouth movements, and speech to detect any modification within a video. The Multi-Task CNN (MTCNN) was utilized for detecting the face from video frames and the Residual (2+1)D-18 network was applied for extracting visual feature representations of these frames. For audio, a simple 1D convolution network was utilized for extracting 1D waveform signal feature representation. In addition, a sync-stream was built by applying central connections to visual video and audio network feature representations between low-level features; spatial and temporal information, to higher-level semantic representations. At each layer, the representations of visual video and auditory modalities were fused with the current layer of sync-stream. The output of this was utilized as an input to the fusion at the next layer. This helped in modelling the synchronization patterns of both modalities and distinguishing the deepfakes. Based on the observation that machines cannot recreate human emotions naturally in manipulated videos, Gino [44] introduced a deepfake detection method depending on exploiting emotion features from visual video and audio modalities. The low-level descriptors (LLDs) were extracted from raw audio segments using the OpenSmile toolkit and passed to the LSTM architecture to extract emotional features of speech. In addition, the face frames were detected from videos using the BlazeFace tool and then passed into 3D-CNN architecture to extract visual emotional features. After that, two approaches were followed in the final deepfake detection phase. In the first one, the visual and auditory emotional features were combined horizontally. Then, these features were fed either into the LSTM network or into Lazypredict models. In the second, the average between the prediction scores returned by training the LSTM and Lazypredict models on the visual video and auditory modalities separately was computed. The work of [45] detected the fake content in videos by extracting visual video and auditory emotional features and passing them to a deep network. The OpenFace-V2 toolkit was employed for extracting 31 visual features related to the intensity of facial muscle actions, eye gaze, and head position. The python_speech_features library was used to extract 13 MFCC features and their respective derivatives; delta MFCC, and delta-delta MFCC, as audio features. The visual video and auditory features were normalized and concatenated to be passed into CNN blocks that were followed by two Bi-LSTM networks and dense layers for deepfake detection.
A few deepfake detection methods are concerned with multimodal videos. However, they do not consider whether a video is manipulated only on the visual video frames level, audio level, or bimodal level which combines visual frames and audio. Consequently, this paper introduces a novel smart deepfake video detection system that can check whether the manipulation is just applied to video frames, audio, or both. It then produces the final decision for detecting the deepfakes on these three levels: visual frames, audio, and the whole video. 411 | P a g e www.ijacsa.thesai.org

III. PROPOSED METHOD
The suggested deepfake video detection method consists of three base stages: pre-processing, feature extraction using unimodal and bimodal information, and classification. These stages are shown in Fig. 4 and each one will be described hereafter.

A. Pre-processing
The visual video and audio modalities are pre-processed individually. The face frames are extracted from videos and saved separately. They are rescaled to the size 224 × 224, and their pixel values are normalized into [-1,1]. These preprocessed face frames will become an input to the next stage for learning and extracting deep visual video features. The raw audio files are extracted from videos and stored separately in a wave format. Then, the audio files are segmented. The CQT method is applied to every audio segment to produce a timefrequency representation of these segments. The CQT method is used for transforming audio signals from the time domain to the time-frequency domain. In CQT, frequency bins are geometrically spaced and ratios between centre frequencies and bandwidths, which are called Q-factors, of all bins are equal [47,12]. The CQT of a discrete audio signal x(n) in the time domain is computed by the following formula [10,46,47,48]: where represents the index of frequency bin, denotes the floor function, and ( ) represents the sample of a speech time-domain frame. The symbol indicates window lengths, represents the sampling rate frequency, and indicates the centre frequency of the bin. The symbol denotes the centre frequency of the lowest bin, refers to the bins number per octave and practically it determines a trade-off between time and frequency resolution. The factor produces a constant frequency to resolution ratio for each bin. The term ( ) represents the complex conjugate of the complex-valued time-frequency atoms ( ) which is defined as follows: where ( ) denotes a window function; Hann (Wang et al. 2019) [49], which is sampled at points specified by . It is zero when does not belong to [0,1]. The ∑ ( ) represents a scaling factor, and represents a phase offset.
The CQT computations are implemented using the librosa python library. The audio files are resampled to 22,050 Hz. A frequency bins number of 84 with 12 bins per octave, a hop length of 128 samples, and a minimum frequency value of approximately 65 Hz are used during the CQT calculations. In addition, the Hann window function is applied. The output of the CQT is then transformed into a log scale; decibels, to cope with the wide range of sound intensity. This produces a decibel-scaled spectrogram that has the shape T×84 per audio segment where T relies on the audio file duration. The duration of audio files adopted here is three seconds and accordingly, T is equal to 65. The spectrograms are normalized into the range [-1,1] and then reshaped to (65,84,1) as three-channel images. They will become an input into the next stage to learn and extract deep auditory features. 412 | P a g e www.ijacsa.thesai.org

B. Feature Extraction using Unimodal and Bimodal Information
In this stage, the problem of deepfake detection is handled based on proposing two feature extraction methods for visual video and audio modalities. An upgraded XceptionNet is suggested to extract instructive deep spatial features from preprocessed face frames of videos. It outputs a visual feature representation of the unimodality; video frames. A modified InceptionResNetV2 is suggested to apply on the CQT spectrograms representing audio files to extract deep timefrequency features from audios. It produces a feature representation of the unimodality; audio. Then, the corresponding extracted feature representations from these modalities are first fused. This outputs a feature vector representation of the whole video using bimodal information. After that, these various resultant representations are independently passed into the GRU-based attention mechanism. This helps to learn the significant temporal information from the sequential feature representation per video on three levels: visual video frames, audio, or the whole video. Finally, a fully connected layer is applied to produce the final prediction about video authenticity. These components are explained in detail in the following subsections.

1) Visual video frames features:
The processed face frames of videos with the shape (h×w×3) are received as an input to the proposed upgraded Xception network where h=224, w=224, and 3 denote the height, width, and RGB channels per frame. The Xception original architecture consists of 36 convolutional layers divided into 14 modules. All modules have shortcut residual connections around them except for the first and last ones. The Xception comprises depth-wise separable convolution layers, which reduce the cost of convolution operation dramatically [50,51]. The proposed upgraded Xception network architecture is depicted in Fig. 5. The original XceptionNet is upgraded by first injecting seven layers before the last rectified linear unit (ReLU) activation layer of the last module. These seven layers are convolution with 1536 filters, batch normalization, ReLU activation, convolution with 1024 filters, batch normalization, ReLU activation, convolution with 1024 filters, and batch normalization. The convolution layers produce more informative and exclusive feature maps that help to differentiate between real and fake visual videos. The batch normalization layers, which standardize the input, have the effect of drastically speeding up the training and improving the model's performance by providing a modest regularization. The ReLU activation layers, which give a value of zero for all negative input feature values, add a nonlinear property to the model allowing it to understand and learn complex structures in data. Then, the dropout layer that randomly drops out units with a rate of 0.2 is injected between the last ReLU activation and the global average pooling layers to prevent overfitting and boost the model's generalization. After that two layers are injected after the global average pooling layer; the fully connected layer with 1024 units and ReLU activation function, and the dropout layer with a rate of 0.5. After applying the upgraded XceptionNet to the face frames of videos, the output becomes a vector representation of 1024 features per frame. The suggested modifications to the Xception network attempt to generate an instructive spatial hierarchical representation of frames. This helps to improve the deepfake detection method performance in real-world circumstances; number equations consecutively.
2) Audio features: The CQT spectrograms of audio files with the shape (65, 84, 1) per segment are received as an input www.ijacsa.thesai.org to the proposed modified InceptionResNetV2. The InceptionResNetV2 original architecture is built by joining the inception blocks and the skip connections. Each InceptionResNet block contains convolutions of different-sized filters that are combined by skip connections. These skip connections prevent the degradation problem that occurred via deep structures and reduce the time of training [52].
The proposed modified InceptionResNetV2 architecture is depicted in Fig. 6. The original InceptionResNetV2 is modified first by decreasing the repeating times' number of Inception ResNet blocks; A, B, and C, from 5, 10, and 5 to 4, 7, and 3, respectively. Then, some layers are injected after the last InceptionResNet block C and before the global average pooling layer. These layers are convolution with 512 filters on a kernel size of 1×1, batch normalization, ReLU activation, a couple of convolutions with 1024 filters on a kernel size of 1×1 where each one is followed by batch normalization and ReLU activation, and a dropout with a rate of 0.2. After that a fully connected layer with 1024 units and ReLU activation function is injected between global average pooling and dropout layers. In addition, filter units, kernel size, and stride for some layers are altered as shown in Fig. 6. After applying the modified InceptionResNetV2 to audio files segments, the output becomes a vector representation of 1024 auditory features per segment.
The proposed modifications to the InceptionResNetV2 aim to generate an informative deep timefrequency representation of audio segments. This aids to enhance the performance of the proposed deepfake detection method.  414 | P a g e www.ijacsa.thesai.org

3) Bimodal information-based video features:
The deep extracted features from visual video frames and audio modalities using the above-mentioned unimodality-based feature extraction methods are mid-fused at a concatenate layer. This produces a feature vector representation for the whole video, which is based on bimodal information.

4) Temporal information extraction-based attention mechanism:
Most deepfake videos are generated based on synthesizing faces frame-by-frame, cloning voices, and synchronizing lips. They suffer from flickering and discontinuity of the face frames and lack of normal emotions, breathing, pauses, and the pace at which the target subject speaks among audio segments. As a result, the GRU-based attention mechanism is applied to the three levels of the extracted features independently; visual video frames, audio, and the whole video. This aims to capture the instructive temporal information that helps to differentiate real videos from fake ones.
The GRU architecture is composed of two gates; update ( ) and reset ( ), that modulate the information flow from the previous time step to the current step. At each time step , the update gate decides the amount of previous information that should be retained, and the reset gate determines the amount of information that needs to be forgotten [53]. The GRU hidden state at the time is defined by the following formulae [54]: where refers to the input, and and represent the weight matrices. The symbol ( ) represents the sigmoid function, ( ) represents the Hyperbolic Tangent, denotes the Hadamard product, and ́ denotes the candidate hidden state. As can be seen in Fig. 4, a single GRU is applied to the above-mentioned feature representations on the three levels. It produced a matrix of hidden state vectors at each time step , which represents the learned temporal information per visual video, audio, or the whole video. The hidden state vector is defined as follows: The attention mechanism uses the weights to concentrate on the important features from the input sequence . It is defined by the following equations [17,55]: ∑ where is a result of feeding a hidden vector into a single-layer Multi-Layer Perceptron (MLP) with the activation function.
represents the weight matrix, and b refers to the bias term. The symbol represents the normalized attention weights that are produced by applying the softmax layer to . is a video representation that is formed by summing hidden vectors weighted by attention weights .

C. Classification
After the instructive temporal features are produced from the GRU-based attention mechanism, a fully connected layer is used as an output layer with two classes. Softmax function is used to decide deepfake videos from real ones. The Softmax formula is defined as follows: where denotes the values resulting from the output layer neurons.

D. Dataset
The proposed method has been evaluated on the FakeAVCeleb multimodal videos dataset. This dataset consisted of 490 celebrity genuine videos that were selected from the VoxCeleb2 dataset based on various ethnic groups, gender, and age. Its genuine videos are face-centered and cropped. The fake videos of the FakeAVCeleb dataset were generated using DeepFaceLab, Faceswap, and FSGAN, while fake audios were generated using a real-time voice cloning tool (SV2TTS). Additionally, the Wav2Lip was applied to the deepfake videos to re-enact these videos based on the cloned audios. Thus, the FakeAVCeleb dataset had more realistic deepfakes. The FakeAVCeleb was divided into four groups; genuine visual videos with genuine audios, genuine visual videos with deepfake audios, deepfake visual videos with genuine audios, and deepfake visual videos with deepfake audios [4].
To evaluate the proposed method, 1215 genuine and deepfake videos of the FakeAVCeleb dataset are employed. These videos are divided into three subsets: training, validation, and testing.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
The proposed deepfake video detection method is evaluated by the FakeAVCeleb dataset. Its performance is assessed using the following evaluation metrics [56]: where denotes deepfake samples' number that is correctly predicted. The represents genuine samples' number that is incorrectly predicted. The denotes deepfake samples' number that is incorrectly predicted. The refers to genuine samples' number that is correctly predicted. The symbol represents the predicted deepfake samples and denotes the predicted genuine samples. The higher the AUROC curve metric, the better the fake video detection method's performance at individuating the deepfake videos from the genuine ones.
The following three experiments are applied to the FakeAVCeleb dataset: Experiment 1: This experiment represents applying the proposed method to the FakeAVCeleb videos dataset for two levels; visual video frames and audio. The visual video frames and audio modalities are trained end-to-end separately. Thus, a single GRU-based attention mechanism with 1024 units is independently applied to the visual video features that are extracted using the proposed upgraded XceptionNet and the audio features that are extracted using the proposed CQT based modified IncetionResNetV2. This learns the instructive temporal features for each unimodality. The visual video modality is trained for 32 epochs using the stochastic gradient descent (SGD) optimizer [57] with a learning rate of which is decayed by , and a momentum of 0.9. The audio modality is trained for 27 epochs using the adaptive moment (Adam) optimizer [58] with a learning rate of . The batch size is 32. Then, the predictions are produced per modality. The performance of visual video and audio on the FakeAVCeleb dataset is shown in Table I and Table II, respectively. The proposed upgraded XceptionNet with a single GRU-based attention mechanism for the visual video modality has achieved 98.51% accuracy and 98.45% AUROC outperforming recent state-of-the-art methods by a large margin. Additionally, the proposed CQT based modified InceptionResNetV2 with a single GRU-based attention mechanism for the audio modality has achieved 97.52% accuracy and 97.62% AUROC outstanding other state-of-theart methods by a large margin. Experiment 2: In this experiment, the prediction results of visual video frames and audio modalities from experiment 1 are employed to produce the prediction for the whole video. Thus, the whole multimodal-video prediction is decided to be genuine if both modalities are predicted as genuine, otherwise, it's deepfake. Experiment 2 performance for multimodal video deepfake detection is recorded in Table III. It has yielded 96.04% accuracy and 95.49% AUROC. Experiment 3: This experiment represents applying the proposed method to the FakeAVCeleb videos dataset for the third level; whole multimodal video. As the FakeAVCeleb dataset is distributed into four groups: genuine visual videos and audios, genuine visual videos with fake audios, fake visual videos with genuine audios, fake visual videos and audios, the whole video label ( ) is considered genuine if the label of both visual video ( ) and audio ( ) modalities are genuine, otherwise, it's fake. This can be defined as follows: The single GRU-based attention mechanism with 3572 units is applied to the bimodal information-based video features. This helps to learn the instructive temporal features for the whole multimodal video. The details of GRU-based attention mechanism layers that are applied on top of bimodal information-based video features are described in Table IV. The proposed method is trained for 24 epochs using the SGD optimizer with a learning rate of and a decay factor of , and a momentum of 0.9. This is employed to update the weight parameters and is aimed to minimize the difference between actual and predicted labels. The batch size is set to 64. The cross-entropy loss ( ) function is utilized to measure the efficiency of the suggested deepfake video detection method on three levels: video frames, audio, and the whole video. Its formula [59] is defined as follows: where refers to the number of visual videos, audios, or whole videos. The and denote the actual label and predicted probability corresponding to the video. It can be seen in Table III that the proposed method, which represents experiment 3, for whole multimodal video deepfake detection has achieved 97.52% accuracy and 97.21% AUROC. Its performance exceeds that of experiment 2 because experiment 2 is unable to learn intercorrelations between different modalities. Additionally, it outperforms recent state-of-the-art methods by an average growth of 34.4% accuracy and 34.2% AUROC as can be seen in Table III. The experiments are carried out using an OMEN HP laptop with a 16-gigabyte Intel (R) Core (TM) i7-9750H CPU, a 6gigabyte RTX 2060 GPU, and Windows 11. The proposed method is implemented using the Python programming language. Python libraries such as Keras, OpenCV, Random, Tensorflow, Numpy, OS, and Librosa are used during the implementation.
The accuracy and loss curves of the proposed method on the training and validation subsets of the FakeAVCeleb dataset for the three levels; visual video frames, audio, and whole multimodal videos, are shown in Fig. 7. Additionally, the proposed method confusion matrix for deepfake video detection on the three levels is depicted in Fig. 8. Furthermore, Fig. 9 shows the receiver operating characteristic (ROC) curve and the AUROC curve of the proposed method performance. As shown in Fig. 9, the ROC curve is extremely close to the top left ensuring the high performance of the proposed method. Fig. 10 provides a comparison of the proposed method with contemporary state-of-the-art methods using evaluation metrics. As shown in Fig. 10, the proposed method has yielded better performance in comparison to the other methods on the three levels. It has a precision of 96.91%, recall of 100%, F1score of 98.43%, and specificity of 97.22% for detecting visual videos. Additionally, it has a precision of 100%, recall of 95.10%, F1-score of 97.49%, and specificity of 100% for detecting audios. Further, it has a precision of 98.43%, recall of 97.66%, F1-score of 98.04%, and specificity of 97.30% for detecting whole multimodal videos.
It can be concluded that the proposed upgraded XceptionNet generated a useful spatial hierarchical representation of faces, which contributed to distinguishing between genuine and fake videos. As well, the proposed CQTbased modified InceptionResNetV2 produced a valuable deep time-frequency representation of audio. This assisted to reveal deepfake videos and improved the detection method's effectiveness. Moreover, a concatenate layer that is applied to the features extracted from visual video and audio modalities produced an informative bimodal representation of videos. In addition, the GRU-based attention mechanism, which is applied to the visual video, audio, and bimodal features, assisted in capturing the most important temporal information of videos. This in turn helped to detect the deepfakes. Furthermore, it can be inferred that correlating features from different modalities can improve the chances of achieving accurate deepfake video detection. 417 | P a g e www.ijacsa.thesai.org    V. CONCLUSION AND FUTURE WORK A newly smart system for detecting video deepfakes has been presented. Two methods were proposed to extract features from visual video frames and audio modalities, respectively. These methods produced useful spatial information for visual video and valuable time-frequency information for audio, which improved the performance of the deepfake detection method. In addition, the feature representations of both modalities were passed into a mid-layer to produce an informative bimodal representation per video. It proved that using bimodal information boosts learning during training compared to the method that ignores intercorrelation between modalities. The GRU-based attention mechanism was then applied to the different feature representations to extract the most significant temporal information and detect the www.ijacsa.thesai.org deepfakes. The proposed method has been evaluated on the FakeAVCeleb multimodal videos dataset. It achieved 98.51% accuracy, 98.45% AUROC, 96.91% precision, 100% recall, 98.43% F1-score, and 100% sensitivity on the first level; visual videos. Additionally, it yielded 97.52% accuracy, 97.62% AUROC, 100% precision, 95.10% recall, 97.49% F1-score, and 95.10% sensitivity on the second level; audios. Moreover, it attained 97.52% accuracy, 97.21% AUROC, 98.43% precision, 97.66% recall, 98.04% F1-score, 97.66% sensitivity, and 97.30% specificity on the third level; whole multimodal videos. Consequently, the proposed method outperformed the current state-of-the-art methods by a large margin.
In the future, several optimization algorithms can be employed to enhance the performance of the proposed deepfake video detection method. Furthermore, a huge multimodal video dataset may be utilized to improve the detection method's performance.