Unsupervised Domain Adaptation using Maximum Mean Covariance Discrepancy and Variational Autoencoder

—Face Recognition has progressed tremendously from its initial use of holistic learning models to using hand-crafted, shallow, and deep learning models. DeepFace, a nine- layer Deep Convolutional Neural Network (DCNN), reached near-human performance on unconstrained face recognition for the La- beled Faces in the Wild (LFW) dataset. These models performed very well on the benchmark datasets, but their performance sometimes deteriorated for real-world applications. The problem arose when there was a domain shift due to different distribution spaces of the training and testing models. Few researchers looked at Unsupervised Domain Adaptation (UDA) to find the domain- invariant feature spaces. They tried to minimize the domain discrepancy using a static loss of maximum mean discrepancy (MMD). From MMD, the researchers delved into the higher-order statistics of maximum covariance discrepancy (MCD). MMD and MCD were combined to get maximum mean and covariance discrepancy (MMCD), which captured more information than MMD alone. We use a Variational Autoencoder (VAE) with joint mean and covariance discrepancy to offer a solution for domain adaptation. The proposed MMCD-VAE model uses VAE to measure the discrepancy in the spread of variance around the mean value and uses MMCD to measure the directional discrepancy in the variance. Analysis was done using the TinyFace benchmark dataset and the Bollywood Celebrities dataset. Three objective image quality parameters, namely SSIM, pieAPP, and SIFT feature matching, demonstrate the superiority of MMCD-VAE over the conventional KL-VAE model. MMCD-VAE shows an 18 % improvement in SSIM and a remarkable improvement in the perceptual quality of the image over the conventional KL- VAE model.


I. INTRODUCTION
In the past decade, Face Recognition (FR) research has achieved high accuracy using Deep Learning (DL) approaches. It has matched that of the humans and even transcended it. Advances in DL have facilitated the growth of large training datasets required to implement DL algorithms effectively. Presently we have datasets that use large amounts of labeled data from the internet, consisting of face images in an unconstrained environment, with a marked diversity of ethnicity, gender, and age.
At times, in real-world applications, one notices a certain discrepancy. The target face image dataset is acquired in different settings compared to the source. There is a difference in the performance of a learned model on a source dataset and a target dataset. Also, in some applications, it is not possible to have large datasets from a particular domain to train a deep learning model. So can one borrow pre-trained models from similar domains? This can help to improve the learning process. However, the caveat is that the performance is boosted only for trained and tested datasets with identical data distributions.
It is interesting to understand the learning process between the deep networks and the human person in this context. The way that learning happens in deep networks and human persons is different. Humans learn from a limited set of labeled data. The other advantage humans possess is that they can generalize their learning and apply it to new conditions or situations.
The authors in [1] have shown the theoretical limitations on the performance by studying the error bounds for different source and target data distributions. The term "data shift", as first used in 2009, in [2], is the change of distribution of features [3]. The change in the distributions is referred to as covariate shift in [4]. Even a Deep CNN can experience domain shift [5]. Domain Adaptation (DA) algorithms attempt to understand these different shifts in statistical distributions for adaptation in domains.
The paper is organized as follows. Section II presents a review of domain adaptation techniques. Section III describes the metrics for measuring distribution discrepancy. Section IV focuses on the deep domain adaptation for face recognition. Section V presents the proposed MMCD-VAE latent feature extraction model. Section VI elaborates the experimental results, and finally Section VII provides the conclusion and (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 6, 2022 future work of this study.

A. Domain Adaptation and Transfer Learning
The authors in their landmark paper [6] gave an overview of the Transfer Learning (TL) process, where they situated the DA task in the context of TL. Tasks were Inductive Transfer Learning, Transductive Transfer Learning or Unsupervised Transfer Learning based on label availability for the source and target domain. A summary is shown in Fig. 1 as in [6], which shows how DA is a subset of TL.  The authors in [7] define TL in terms of the domains and the given tasks. They classify TL as being homogenous when the feature space is the same and heterogeneous when the feature spaces are different, as shown in Fig. 2. They also clarify that the domain adaptation process seeks to change a source domain to match more closely with the target domain. The terms supervised or unsupervised refer to the source domain availability of labeled data. And for the target domain, as informed or uninformed. A word of caution is also given on Negative transfer when the learned information detrimentally effects the target domain.
The authors in [8] elaborate on the transfer learning categories and present about forty representative approaches to transfer learning along with experimental verification. The broad categories are shown in Fig. 3.
The notations given in [6] and [7] are used to explain the concepts of DA. Let the source domain labeled data be given by , with ith sample x s i , and label y s i . The number of source images is given by M .
Let target domain unlabeled data be given by The number of target images is given by N . The difference in data distributions as shown in Fig. 4, is given by P (X s , Y s ) ̸ = P (X t , Y t ). Many researchers have done surveys on TL [6], [7], [9], [10] [8] and DL [11], [12], [13], [14] and [15]. Beginning from Machine learning to Deep learning, the authors have methodically explained the nuanced terminology and clarified any inconsistencies in the terms that are used to explain the concepts of TL and DL.
Among the most commonly used is MMD. It finds the measure between the mean of the two distributions into a reproducing kernel Hilbert space (RKHS). Maximum Mean Discrepancy (MMD) [17], [21], [22] is thus a distribution distance metric. The MMD [23] between two distributions s and t, is given by Sup ("supremum") is the largest, least upper bound (generalizations of "max"), E is the expectation of the distribution. ϕ maps original data to RKHS. The detailed proofs are given in [24]. The notations for MCD are taken from [24] where one can find the detailed proofs.
where H is RKHS over X and {e i | i ∈ I} is an orthogonal basis of H, ∥a∥= i,j∈I a 2 ij 1/2 , with cov given by:

C. Maximum Mean and Covariance Discrepancy (MMCD)
The authors in [24] have shown that the MMCD-based domain adaptation achieves better results for image classification. MMCD has both the first-and second-order statistical information in the RKHS. The notations for MMCD are taken from [24], where one can find detailed proofs.
and β, used to balance the MCD term, is a non-negative parameter, and C is a centered covariance operator. They show that MMD and MCD of MMCD measures the difference between means and covariances of the distributions with the degree d = 1 of the polynomial kernel.

IV. DEEP DOMAIN ADAPTATION FOR FACE RECOGNITION
The authors in [25], [26], [23] discuss the approaches and challenges to deep domain adaptation in the context of face recognition which indeed is a challenging task. In reallife face recognition applications, there are domain shifts due to changing conditions, like background, location, change of pose, occlusion, illumination, and other factors.
In [25], the authors have used the TaoMM dataset created using face images of Chinese fashion models. They combined the CASIA-WebFace [27] and VGGFace-Good [28] datasets and used about 1.3 million images to train their model. They also trained the model on their TaoMM dataset. These trained models were then tested on the LFW dataset [29] which has a different distribution than the TaoMM dataset. The learned weights of labeled data are transferred to initialize the training model. They also refine all weights using face verification loss in an end-to-end framework.
Their system architecture consisted of a modified inception-v2 [30] model that enhanced training using Stochastic Gradient Descent. They used an NVIDIA GTX TITAN X GPU and pre-trained for 25 epochs that lasted 89.4 hours with a learning rate of 0.2 and decay half for every five epochs. A learning rate of 0.04 and decay half for every ten epochs was used and performed on two similar GPUs for 20 epochs that lasted 18.6 hours. The two GPUs were needed as the model was complex, and the mini-batch size was 360. Their results are comparable to the state-of-the-art single models like DeepFace [31], DeepID [32] and BaiduFace [33].
The authors in [23] use clustering-based domain adaptation (CDA). They elaborate on how the unsupervised domain adaptation methods for object classification are not applicable to face recognition tasks. The reasons are that a larger discriminating power for the classification of faces is required, and the classes in both domains are non-overlapping. CDA generates pseudo-labels and uses cosine-similarity to form a cluster. They also use deep domain confusion network (DDC) [34] and deep adaptation networks (DAN) [35]. Here MMD estimator is integrated into the CNN error to minimize domain divergence. Thus the end classification is done based on features invariant to domain changes.
They trained the CNN with labeled source data and finetuned it with clustered target pseudo-labeled data, which helps determine the target data's discriminative representation. They evaluated their method on GBU [36], IJB-A/B/C [37], [38], [39] and RFW [40] datasets. The architectures that they used were VGGNet [41] and ResNet-34 [42]. Both architectures are trained on CASIA-WebFace, the former tuned using Softmax loss and later with Arcface loss [43]. They preprocessed the images of datasets by resizing, aligning and augmenting them. A Gaussian kernel is used in the MMD.
Their results outperform LRPCA-face [36], Fusion [44], VGG [44], Arcface [43] DDC [34] and DAN [35] for the GBU dataset. They remark that a uniform face-aligned algorithm can achieve good FR performance. Also, incorporating MMD helps in minimizing domain discrepancy. Similarly, better performance is obtained for IJB-A/B/C and RFW datasets. They also showed the visual representations of the learned features using t-distributed stochastic neighbor embedding (t-SNE) [45].  it uses the selected features for latent representation shared between encoder-decoder pairs.
Latent representation is nothing but the distribution of collected traits used as the communication protocol between the encoder-decoder pair. In practice, encoding and decoding distributions are parametric models. Joint optimization leading to reliable reconstruction ensures latent features contain the most salient statistical features and capture variations over main features.
The Face Recognition task falls under categorical marginal distribution. Assuming that ϕ and θ are parameter sets for encoder and decoder, optimized for minimum reconstruction loss, then VAE objective function can be written as: where D is any strict divergence and γ > 0 is a scaling coefficient, E is the expectation operator, q ϕ and p θ are the distribution functions of encoder and decoder, respectively. The selection of divergence can play a crucial role. Traditionally evidence lower bound (ELBO) criterion is used in VAEs. The goal of the encoder is to obtain a simplified approximate distribution q and optimize the variational parameter ϕ such that q ϕ be as similar as possible to the true distribution of inputs. One of the approaches is to minimize Kullback-Leibler (KL) divergence. It is defined as: where p(w | D) is the actual distribution of input samples w. Intractability due to the integration term present in equation 5, is resolved by substituting an approximation for p in terms of q ϕ . This substitution results in the popular Bayes by Backprop [46], a tractable objective function. ELBO suffers from uninformative latent code and variance overestimations in the feature space. Also, ELBO-VAE tends to over-fit data, and as a result of the over-fitting, it learns a q ϕ (z) whose variance tends to infinity.

3) Proposed MMCD-VAE Model for Domain Adaptation:
The proposed MMCD-VAE Model for Domain Adaptation is shown in Fig. 6. The encoder generates the same distribution for all possible variations in a sample's inputs, which works for learning good features. Regularization is possible as the input is encoded to a distribution with some variance instead of a point. Regularization aims to have continuity and completeness in the generative process. Distributions are forced to be as close as to a standard normal distribution.
MMD evaluates the distribution as identical if and only if all their first moments are the same. Therefore, MMD divergence is a metric of differential moments of p(z) and q(z) distributions and is accomplished using the kernel embedding trick [47]. MMD prefers to maximize the mutual information between an input x and the latent representation z. Training ELBO on a dataset with complimentary samples will still try to obtain encoder q ϕ and decoder p θ as Gaussian distributions with non-zero variance. For ELBO regularization term γD(q ϕ (z) || p θ (z)) is not strong enough as against the loss function term E(x)E qϕ(z|x) [logp θ (x | z)]. Complimentary samples will have class means way apart, and accordingly, MMD optimization will end up by having two modes of q ϕ , pushed to stay far from each other. This will reduce ambiguity in reconstruction. In practice, this matters for datasets with fewer samples. The Loss function (objective function) indicates the degree to which the test image has been reconstructed and is given by: where µ[p] = E x [ϕ(x)] and β is a non-negative parameter.
Define Objective function (L) using Log likelihood and MMCD distance 12 Return trained encoder = q ϕ and trained decoder = p θ where x ∼ p, y ∼ q. Given limited X and Y sampled from p and q respectively there is n X is the mean vector and Σ p = 1 n Xµ n X ⊤ is the covariance matrix of X.
Substituting Equation (8) in (6) we get: The authors in [24] have experimented with different kernel and non-kernel based cases. The kernels used were linear, polynomial, Gaussian, and Exponential. When a linear kernel is adopted, MMD, MCD, and MMCD measure the difference between the mean and covariance of the distributions, respectively.

1) Bollywood Celebrities Dataset:
The Bollywood Celebrities dataset [48] contains the localized face of 100 Bollywood Celebrities. A class has 80 to 150 samples of size 64 × 64 pixels. These are in wild conditions with different orientations, illuminations, age transitions. The sample images are shown in Fig. 7. Experimentation is carried out on 64 × 64 size RGB images.

C. Objective Image Quality Comparison Metrics
The quality of the images needs to be evaluated using either a subjective or objective method. The former is based on human judgment, and the latter is by explicit numerical statistical parameters.

1) SSIM:
Traditionally the most popular metric for image quality assessment was Peak Signal to Noise Ratio (PSNR). A standard metric is Structural Similarity Index (SSIM) which measures the similarity between two images. It was developed by Wang [49], and looked at structural information changes in the images. SSIM considers three factors, loss of correlation, luminance distortion, and contrast distortion [50]. For the SSIM index, a value of 0 means no correlation between images, and 1 means the two images are the same.
2) PieAPP: PieAPP [51] is a perceptual image-error metric that robustly predicts visual differences like humans. It uses pairwise preference as a robust way to create large Image quality assessment (IQA) datasets and uses a new pairwiselearning framework to train an error-estimation function. A reference image and a distorted image are given as input resulting in a PieAPP value as an output. Lower the value of the PieAPP error metric better the image perceptual quality.

3) SIFT Features:
Face recognition is challenging compared to many other object recognition tasks as face features in the two domains are often non-overlapping. Global alignment of the source and target samples is not feasible for unconstrained face images. The goal of the proposed unsupervised domain adaptation model is to discover novel domain-invariant representations using scale-invariant features transform (SIFT) [52], as a parametric evaluation entity for the domain adaptation. Some authors [53], [54] have worked using SIFT for face recognition but have not used VAE. The challenge is to maximize scale-invariant features and thus get the corresponding match.
Many domain adaptation algorithms match the distribution without understanding the goodness in preserving key spatial features. This work analyses domain adaptation by optimizing encoder and decoder parameters. We use training samples and utilize unlabeled testing samples.

A. Experimental Setup
The domain adaptation experiments were conducted on NVIDIA GeForce RTX 2070 SUPER GPU. The PC configuration consists of a Multi-core (8 total) and Hyper-threaded (16 total) 3.80 gigahertz Intel Core i7-10700K. The memory is 32 GB, and the SSD hard drive has a 1TB capacity. The software used was Python version 3.8.5 (64-bit), libraries NumPy and Matplotlib, TensorFlow, and Keras.
The Bollywood Celebrities dataset was used for training. As the images for this dataset are 64 × 64, the target images were resized to 64 × 64. 300 epochs were used to train the model.

1) Training and Testing on Bollywood Celebrities Dataset:
In [24], the authors used MMCD and compared the classification performance using two benchmark datasets PIE and Office-Caltech. Their performance was better than nearest neighbor, principal component analysis, correlation alignment transfer component analysis, geodesic flow kernel, and joint domain adaptation. We have combined MMCD with VAE and the training and testing details are mentioned below.
The MMCD-VAE model was first trained with the Bollywood Celebrities dataset for 300 epochs and then tested on different images from that dataset. The generated images for KL-VAE and MMCD-VAE models with the Training and Testing on Bollywood Celebrities dataset are shown in Fig. 9. SSIM and PieAPP error metric comparison is shown in Table  I. MMCD-VAE performs better than KL-VAE. MMCD-VAE shows an average of 20 % improvement in SSIM and a remarkable improvement in perceptual quality of the image, as seen from the PieAPP error metric, over the conventional KL-VAE model. Fig. 10 demonstrates the SIFT features for the Bollywood Celebrities generated images. The proposed MMCD-VAE method is also applied to face images of the same class, but varying domains and generated face images are tested for  inter-class similarity, as shown in Fig. 11. It can be seen that MMCD-VAE generated images have comparatively more SIFT key points than conventional KL-VAE generated images. More scale-invariant features assure that the proposed MMCD-VAE can capture more information.
The reconstruction loss gives the measure of how well the test image has been reconstructed and is shown in Fig. 12. We    observe that the MMCD-VAE model training is stable like the conventional KL-VAE, and demonstrates that the MMCD-VAE reconstruction loss is a meaningful metric of progress.
2) Training on Bollywood Celebrities Dataset and Testing on TinyFace Dataset: In VAE networks, the latent representations correspond to different levels of abstraction mapped to multifarious face attributes. Better the hidden representations, the greater is the adaptation quality. The MMCD-VAE model trained with Bollywood Celebrities dataset for 300 epochs was tested on TinyFace data. The total dataset was not tested but only a sample was used to check the results. The MMCD-VAE model performs better than the KL-VAE model, as seen from the subjective quality of the generated face images given in Fig.  13. The TinyFace dataset images are low resolution images. Even in the case of an original blurry image, the generated image has clearer features of eyes, nose, and mouth. As seen in Fig. 14, there are more SIFT key points in MMCD-VAE than KL-VAE generated images. Table II. MMCD-VAE performs better than KL-VAE. MMCD-VAE shows an average of 18 % improvement in SSIM and an improvement in perceptual quality of the image over the conventional KL-VAE model. In this case, the PieAPP error metric difference between KL-VAE and MMCD-VAE is smaller than the one observed with the Bollywood Celebrities dataset images as the TinyFace are low-resolution images.

VII. CONCLUSION AND FUTURE WORK
This study reviewed the literature on domain adaptation, especially in Face Recognition. It began by looking into the challenging problem of how models trained on benchmark datasets, at times, fail in real-world scenarios. One example is test images collected from the online web. The benchmark dataset on which a model is trained is often high resolution and performs poorly for low-resolution target images. This happens because the source and target domain experience shifts due to changing conditions. Hence the need for domain adaptation and the various metrics for determining the distribution discrepancy.  In the experimental part, we compared the performance of the proposed MMCD-VAE model. Results are compared for sample images taken from the Bollywood Celebrities dataset and TinyFace dataset. TinyFace is a challenging dataset, because it is low-resolution and recognition performance drops with the decrease in resolution. Quantitative comparisons are shown for matching SIFT key points and SSIM. The MMCD-VAE domain adaptation method rendered images with better Objective Image Quality, as seen in the SSIM, pieApp, and SIFT key-points metrics.
The future scope is to look at detailed testing of RFW datasets to better understand how to improve face recognition across diverse races. The low-resolution surveillance face images of the QMUL-SurvFace dataset is another area to pursue further research. An emerging area of research is adversarial discriminative domain adaptation, which reduces the difference between the source and target domain distributions using adversarial learning methods.