Joint Deep Clustering: Classification and Review

—Clustering is a fundamental problem in machine learning. To address this, a large number of algorithms have been developed. Some of these algorithms, such as K-means, handle the original data directly, while others, such as spectral clustering, apply linear transformation to the data. Still others, such as kernel-based algorithms, use nonlinear transformation. Since the performance of the clustering depends strongly on the quality of the data representation, representation learning approaches have been extensively researched. With the recent advances in deep learning, deep neural networks are being increasingly utilized to learn clustering-friendly representation. We provide here a review of existing algorithms that are being used to jointly optimize deep neural networks and clustering methods.


I. INTRODUCTION
Clustering is a challenging problem in machine learning, as its purpose is to categorize objects into groups according to similarity measures. To achieve this, many clustering algorithms have been published in the literature [1]. These algorithms can be classified into two groups: hierarchical and partitional approaches. In hierarchical clustering, the data are organized into nested clusters that are merged into larger ones or divided into smaller ones. This yields a hierarchy of clusters called a dendrogram. Conversely, partitional clustering is based on the optimization of a specific cost function that allows separation between clusters. The performance of these different clustering algorithms depends on their accurate representation of the data. Hence, data representation learning is a critical step in the clustering process.
Over the past several decades, many traditional representation learning techniques have been proposed. Some of these techniques are designed to learn low-dimensional data representation with linear projections, such as unsupervised principal component analysis (PCA) [2], supervised linear discriminant analysis (LDA) [3], kernel-based PCA [4], and generalized discriminant analysis (GDA) [5]. To discover the intrinsic structure of high-dimensional data, manifold learning algorithms that are based on locality were introduced, such as isometric feature mapping (Isomap) [6] and locally linear embedding (LLE) [7]. In 2006, Hinton et al. [8,9] introduced the concept of deep learning by utilizing artificial neural networks (ANNs) for dimensionality reduction. Specifically, they introduced a greedy layer-wise pretraining process and a finetuning framework for deep neural network (DNN) learning. The resulting performance was better than that of state-of-theart algorithms on MNIST [9] handwritten digit recognition and document retrieval tasks. Following this groundbreaking work, a considerable number of deep representation learning algorithms were developed.
Recently, frameworks that perform deep representation learning and clustering procedures have attracted much attention. These frameworks are referred to as deep clustering algorithms, and they can be divided into (1) separated deep clustering and (2) combined deep clustering methods. In separated deep clustering, the deep representation is learned first, and then fed into a clustering algorithm. However, because these two tasks are optimized separately, the learned representation may not be suitable or sufficient for the clustering. In combined deep clustering, the deep representation learning and clustering are jointly optimized. This implies that the clustering assignments and network parameters are reciprocally affected in every learning iteration. Such an approach yields a representation that is more suitable for clustering. Two approaches to achieve combined optimization exist: the pretraining and finetuning approach, and the joint training approach. In the pretraining and finetuning approach, the DNN is pre-trained with nonsclustering loss (network loss) to initialize the network parameters and learn initial representation. Then, the clustering loss is used to train (finetune) the initialized network and output clusters. In contrast, in the joint training approach, the network is trained with a joint loss function that integrates the clustering loss with a nonclustering loss (network loss). In this review, we survey joint deep clustering algorithms by examining different network structures and analyzing the building blocks of these algorithms.
In Section 2, we introduce deep representation learning techniques. In Section 3, we will describe the clustering algorithms that are utilized in joint deep clustering. In Section 4, we provide a survey of the joint deep clustering approaches, and in Section 5, we present the conclusions from the results of this survey.

II. DEEP REPRESENTATION LEARNING
Deep representation learning techniques generate multiple levels or a hierarchy of representations. In this hierarchy, the high-level representations are constructed from multiple lowlevel ones. These techniques are based on deep ANNs. A typical (single-layer) neural network consists of input, hidden, and output layers. The input layer receives the raw input data, whereas the output layer produces the task results, such as object classification or clustering. The hidden layer applies nonlinear transformation to extract more abstract and composite representations from the input data. DNNs contain DNNs apply a supervised learning process, where a set of input-output pairs is provided for training. This learning process is composed of two passes: a forward pass (forward propagation) and a backward pass (backpropagation). The forward pass first randomly initializes the network parameters, that is, the connections, weights, and biases. Then, the input data are passed through the network layers, in the forward direction, to calculate the predicted output. Next, the predicted output is compared with the actual output through a taskspecific loss function. An optimization technique, namely, stochastic gradient descent (SGD), is then applied to minimize the loss function. Conversely, the backpropagation process is initiated by updating the network weights so that the predicted output is closer to the actual output. This can be achieved by minimizing the error of each output neuron in the entire network.
In the following subsections, we discuss three DNN types that have been used as representation learning techniques for clustering tasks. The first is feedforward neural networks (FNNs), which fall into two categories: completely connected networks (FCNs) [10] and convolutional neural networks (CNNs) [11]; the second is deep belief network (DBNs), which are composed of a stochastic probabilistic component called a restricted Boltzmann machine (RBM); and the third is the autoencoder (AE), which comes in two types: the stacked AE (SAE) and convolutional AE (CAE).

A. Feedforward Neural Networks
The FNN [12] is the simplest type of neural network, where the connection between neurons does not form a cycle. The information in this type of network moves forward (in one direction) from the input neurons to the output neurons. In this case, there is no feedback from the output toward the input neurons. FNNs are arranged in the form of layers, as are all neural networks. Depending on the number of layers, an FNN can be a single-or a multilayer network. As mentioned above, FNNs fall into two types: FCNs and CNNs.
An FCN, also known as a multilayer perceptron (MLP) [13], consists of multiple completely connected (FC) layers, where each neuron in one layer is connected to every neuron in the previous layer. In addition, every one of these connections has its own weight. FCNs are composed of an input layer, an output layer, and an arbitrary number of hidden layers. This type of feedforward network is tailored for supervised learning.
Inspired by biological process, the neuron connectivity pattern in CNNs mimics the organization of the animal visual cortex. The first and core building block of a CNN is the convolutional layer, where each neuron is connected to only a few neurons in the previous layer. The same set of weights is used for every neuron. The second layer is the rectified linear unit (ReLU) layer, which applies an elementwise nonlinear activation function to retain the positive parts of the inputs and remove the negative ones by replacing them with zero. The reason for applying ReLU layers in a CNN is to increase the nonlinearity of the inputs. A pooling layer is frequently inserted between two consecutive convolutional layers. The pooling layer applies a function to reduce the spatial size of the representation by combining the output of the set of neurons in one layer into a single neuron in the next layer. As a consequence, the number of parameters and computations throughout the network is reduced and overfitting is controlled. The final layer of a CNN is an FC layer to classify the input. Similar to FCNs, CNNs are designed for supervised learning, and specifically to classify image datasets.
Deep clustering algorithms that employ feedforward networks for unsupervised representation learning use clustering loss only to train the network. Hence, these algorithms aim to optimize the objective function, where is the algorithm loss function and is the clustering loss function. In the absence of other measures and depending completely on the clustering loss, such deep clustering algorithms may lead to a distorted representation space, wherein all data points are assigned to tight clusters. Such a trivial solution results in a small amount of meaningless clustering loss. To alleviate this problem, and in addition to the careful design of the clustering loss function, suitable network parameter initialization is required to enhance the performance.

B. Deep Belief Network
DBNs [14] are a branch of DNNs, and are composed of a stack of RBMs [15] followed by a softmax layer that applies a softmax activation function to the input. An RBM is a twolayer neural network, where the first is the visible (input) layer and the second is the hidden layer. A DBN is trained by greedy layer-wise unsupervised learning with RBMs as the building blocks for each layer. Then, the parameters of the DBN are finetuned according to a task-specific loss function. DBNbased deep clustering algorithms finetune the network parameters using the clustering loss function only, and thus optimize an objective function similar to the feedforward network loss function in equation (1). Hence, careful clustering loss selection and good network parameter initialization affect the performance of the deep clustering algorithm.

C. Autoencoder
An AE [16] is a special type of neural network designed for unsupervised representation learning. It consists of three building blocks: an encoder, a bottleneck layer, and a decoder. The encoder maps the input to its hidden representation through a nonlinear function 1 (•), as in equation (2), and the decoder reconstructs the input from its hidden representation by using a transformation function 2 (•) as in equation (3).
Here, 1 represents the encoding weight, and 2 the decoding weight. The encoder and decoder can comprise an FC network to construct an SAE [17], or a CNN to form a CAE [18]. The bottleneck layer controls the amount of information that traverses the network by learning a compressed representation of the input data. The learning problem can be 865 | P a g e www.ijacsa.thesai.org formulated as a supervised one that is aimed to output the reconstruction image from the input . The entire network can be trained by minimizing the reconstruction loss _ , which measures the differences between the original input and the reconstructed image : AE-based deep clustering algorithms seek to optimize an objective function that combines clustering and reconstruction losses: where is a coefficient to control the distortion of the representation embedding space. The existence of the reconstruction loss forces the algorithm to avoid trivial solutions and learn more feasible representations.

D. Variational Autoencoder
VAE [19] is a generative variant of AE that enforces the latent code to follow a predefined distribution. This goal is achieved by encoding the input data into two vectors instead of one: mean value and standard deviation. Unlike the output of the standard AEs that points directly to the encoded value in the latent space, VAE outputs point to the area where the encoded value can be. To be more specific, VAE initializes a probability distribution where the mean value controls the location point of the encoding center, and the standard deviation defines the area in which encoding can vary from the mean. As a consequence, VAE allows interpolation and generation of new samples. Mathematically, VAE measures the Kullback-Leibler (KL) divergence [20] from a prior distribution to approximate the variational posterior distribution. The objective function can be formulated as the following: where _ represents the reconstruction loss of the VAE, ( ) is the prior over the latent variables, ( | ) is the variational posterior to approximate the true posterior ( | ), and ( | ) is the likelihood function. Gaussian distribution is the common choice as prior; however, VAE-based clustering algorithms should choose a distribution which can describe the structure of the clusters.

E. Adversarial Autoencoder
Similar to VAE, AAE [21] utilizes a prior distribution to control the encoding of the input data. Hence, the decoder learns only the mapping from the prior distribution to the data distribution. The output of the AAE encoder, i.e. the encoded value, is fed as input to the decoder and to a special generative adversarial network (GAN) [19]. In AAE, the encoder and decoder together form the generator model ( ), while the GAN is known as discriminator ( ). Through the learning process, AAE establishes a min-max adversarial game between its generator and the discriminator. While the generator tries to map a generated sample from a prior distribution to the data space, the discriminator computes the probability to detect whether its input a real sample from the data distribution or a fake sample from the generator. The training process of AAE is handled through two phases: (1) a reconstruction phase and (2) a regulation phase. During the reconstruction phase, the generator is trained to minimize the reconstruction loss of the generated sample and produce a reconstructed image of it. In the regulation phase, the discriminator parameters are updated to distinguish the real samples generated by the priori from the fake samples generated by the encoder. The discriminator network is updated by the following discriminative loss ( ): (8) where ̂ and are the sample from prior distribution and input sample, respectively. Then, the discriminator is fixed, and the encoder is updated to confuse the discriminator by increasing the classification error of on the input latent representation with generation loss , as in the following equation: AAE-based deep clustering algorithms optimize a loss function that combines reconstruction loss, generation loss, and clustering loss: where , , and represent the reconstruction loss defined in equation (4), the generation loss in equation (9), and a clustering loss, respectively. and are hyperparameters to balance the importance of the generation loss and the clustering loss, respectively.

III. CLUSTERING TECHNIQUES
As stated previously, clustering techniques can be divided into two types: hierarchical and partitional clustering. Hierarchical clustering methods iteratively merge smaller clusters into larger ones, or split large clusters into smaller ones. The difference between hierarchical algorithms includes the similarity measures that are used to determine which clusters should be merged or split. The results of hierarchical clustering are organized in a tree called a dendrogram, which shows the relationships between clusters. Conversely, partitional clustering seeks to decompose data into a set of disjointed groups. This decomposition is achieved based on the minimization of a specific objective loss function. Centroidbased algorithms, such as K-means [22,23] and KLdivergence [20] clustering, distribution-based algorithms such as Gaussian mixture clustering [24], graph-based clustering algorithms such as spectral clustering [25] and RCC [26], and density-based algorithms such as DBSCAN [27] are all subtypes of partitional clustering algorithms. As existing joint deep clustering utilizes only centroid-and graph-based clustering, these two techniques are explained in the following subsections. Finally, we introduce some auxiliary clustering losses that are used in conjunction with other losses to guide deep representation learning.

A. Centroid-Based Clustering
Given a dataset = { 1 , … , } of points together with its extracted representation = { , … , } , centroid-based clustering partitions the data points into clusters with central www.ijacsa.thesai.org representatives called centroids. These cluster centroids, denoted by ℳ = { 1 , … , }, where is a predefined number of clusters, do not necessarily belong to the dataset. In joint deep clustering algorithms, two centroid-based algorithms are utilized: K-means and KL-divergence clustering.
1) K-means Clustering: K-means clustering first randomly selects centroids from the input data representations, each of which represents a cluster. A K-means algorithm minimizes the total mean squared error between the input data and cluster centroids according to the loss function: An additional variation of the K-means loss function is the weighted least squares error, referred to as weighted K-means. It optimizes the cost function as: where is a similarity weight that encodes the closeness of a data point to a cluster centroid; i.e., will be larger if the data point is close to the centroid . In the K-means learning process, the following two steps are repeated until convergence is reached:  Point assignment update, which is accomplished by (i) calculating the mean distance from the data point to every cluster centroid, and (ii) assigning points to the cluster with the minimum mean among all clusters.
 Centroid update, which is computed according to the following equation, where is the number of points in the ℎ cluster: K-means perform well when the distribution of the points is in circular form. Otherwise, K-means will attempt to group the points in circular form, which will affect the clustering result. To remedy this issue, K-means should be updated to employ a distribution-based model instead of a distance-based model.
Gaussian Mixture Model (GMM) [24] is a probabilistic soft clustering technique which tends to group points with the same distribution together. The clustering process starts by initializing the means and covariances of the Gaussian distribution for clusters. Then, the expectations of all points assignments are calculated for all clusters. Furthermore, the distribution parameters are re-estimated, and the log-likelihood function is computed. This process continues until a predefined convergence criterion is reached.
2) KL-divergence Clustering: KL-divergence clustering is a soft assignment clustering technique, in which each data point is assigned to all clusters with varying probabilities. This algorithm is initiated using K-means to obtain initial centroids. Next, the learning process is executed to optimize the following Kullback-Leibler (KL) divergence loss function: where is an auxiliary target distribution and represents the data point soft assignments. The KL-divergence clustering algorithm refines the point assignments by learning from higher confidence points utilizing the auxiliary target distribution . Specifically, the algorithm matches the soft assignments with the target distribution by computing the KL divergence. The clustering algorithm iteratively performs the following steps until convergence is obtained or the maximum iteration is reached: 1) Calculation of , , the probability that a data point belongs to cluster . Two means of calculating exist: (1) student's t-distribution [28], as in equation (15), and (2) a multinominal regression [28] function, as in equation (16).
2) Computing , a higher confidence distribution that can be obtained by calculating the soft cluster frequencies by considering the formula: 3) Updating clusters centroids according to:

B. Graph-Based Clustering
Given a dataset = { 1 , … , } of points together with their corresponding representation = { , … , }, graph clustering techniques first construct an undirected similarity graph = ( , ) , where = { 1 , … , } denotes a set of vertices to represent the input data, and is the set of edges between vertices. Several approaches for building a similarity graph [1] exist, two of which are specifically used in joint deep clustering. These approaches are the following:  K-nearest neighbor (KNN) graph: this graph connects vertex with vertex , if is within K-nearest neighbors of . One problem common to KNN is that the graph is asymmetric, which means that if is among the KNNs of , then is not necessarily among the KNNs of . Hence, the constructed graph is a directed one. To alleviate this problem, there are two solutions; first, to insert an undirected edge between the two vertices and , if one of them is within the KNNs of the other; second, to restricts the edges, two vertices and are connected by an undirected edge only if they are both among the KNNs of each other. The resultant graph in the latter solution is called a mutual KNN graph.
 Completely connected graph: this graph simply connects all vertices with each other by weighted edges. www.ijacsa.thesai.org The weight of an edge between two vertices and represents the similarity between them. Because the graph should express the local neighborhood relationship, a Gaussian similarity function is usually utilized.
The graph is represented by an adjacency matrix, in which the similarity b tween every two vertices is included. Two graph-based clustering algorithms are utilized in joint deep clustering techniques: spectral clustering [25] and robust continuous clustering (RCC) [26]. We briefly explain these two approaches.
1) Spectral clustering: After the construction of the similarity graph and the extraction of the adjacency matrix, the spectral algorithm transforms the data into a low-dimensional space. To achieve this, another graph representation matrix is computed, the Laplacian matrix. The graph Laplacian matrix ℒ is computed as: where is the degree of the vertex , which can be computed as: Then, the Laplacian matrix is utilized to find the eigenvalues and eigenvectors , such that.
Once the eigenvectors have been obtained, the lowdimensional data transformation is completed. Finally, a Kmeans clustering algorithm, explained in section 3.1, is applied to the transformed data (eigenvectors) to create clusters. L RCC = L data + λL pairwise (22) where is a coefficient that balances the two objective terms. The first term is the data loss that constrains the representations to remain near the corresponding data points. The data loss can be computed as: The second term, which is the pairwise loss , is designed to encourage the representations to merge, and pulls them together according to.
where { } represents appropriately defined weights, is a scale parameter, and is a redescending M-estimator that can be calculated according to a scaled Geman-McClure function [29]: The first stage in the RCC learning procedure is initialization, which includes the following steps: 1) Construction of the similarity graph 1 = ( , ) using mutual KNN.
2) Initialization of the data representation with = .
3) Initialization of the line process = {ℓ }, where ℓ is an auxiliary variable between two connected vertices and with ℓ = 1.

4) Initialization of a scale parameter with
The optimization is aimed to reveal the cluster structure latent in the data; thus, the number of clusters does not need to be known in advance. The following optimization steps are recursively repeated until a maximum iteration number is reached, or the difference between the clustering loss in two consecutive iterations is less than a predetermined threshold.

1)
Update ℓ according to the following formula.
2) Update the representations = { 1 , … , } using the following equation: where is the identity matrix, is an indicator vector with the ℎ element set to 1, and is computed as the following: Update the value of as.
where is a threshold set to be the mean of the lengths of the shortest 1% of the edges in . Then, RCC constructs a new graph 2 = ( , ℇ) with = 1 if ‖ * − * ‖ 2 > . Finally, the algorithm outputs the clusters given by the connected vertices of 2 .

C. Auxiliary Clustering Losses
Some clustering loss functions are designed to guide deep representation learning techniques to extract feasible clustering-oriented representations; they cannot, however, output clusters. These functions are known as auxiliary clustering losses. Considering a dataset = { 1 , … , } of www.ijacsa.thesai.org points together with its extracted representations = { , … , }, we present the auxiliary clustering losses that have been used in joint deep representation clustering algorithms.

1) Balanced assignment loss:
Balanced assignment loss is used in conjunction with other clustering loss to enforce balanced clustering assignments. The difference between two distributions, and , is measured based on KL divergence as follows: where is the target distribution proposed in equation (17) is the uniform distribution, and is the probability distribution, which can be calculated as.
2) Locality-preserving loss: Locality-preserving loss preserves the local structure property of the original data by pushing the nearby points together as.
where ( ) is the set of nearest neighbors of the data point and is a similarity measure between and .
3) Group sparsity loss: Group sparsity loss was inspired by spectral clustering, where a block-diagonal similarity matrix is utilized for representation learning. Specifically, the hidden units are divided into groups, where is the number of clusters. For each data point , after its representation has been extracted, a group unit { ( )} =1 is obtained. Then, the group sparsity is computed as.
where ( ) is the representation encoding function, is a constant, and is the group size.

4)
Self-expressiveness loss: Self-expressiveness loss is a property where a point in a subspace can be expressed as a linear combination of other points in the same subspace. Let be a column matrix of all data points; the self-expressiveness can then be represented as = , where is the selfrepresentation coefficient matrix. By minimizing a certain norm of , and under the assumption that the subspaces are independent, is guaranteed to have a block-diagonal structure. This ensures that ≠ 0, where and are two data points lying in the same subspace. The matrix can then be leveraged by spectral clustering to construct the affinity matrix. Given this fact, each data representation in a latent subspace is approximated by a weighted linear combination of other points { } =1 with weights . To encode selfexpressiveness, the following auxiliary clustering loss function is introduced: where 1 and 2 are two regularization parameters to account for data corruption, and ‖•‖ represents an arbitrary matrix norm.

IV. JOINT DEEP CLUSTERING
Given a dataset = { 1 , … , } of points, the goal of joint deep clustering techniques is simultaneously to learn a low-dimensional representation = { 1 , … , } for the data and to cluster it into groups jointly. This can be accomplished by optimizing a joint loss function that combines two losses: the representation learning loss and the clustering loss. Then, the low-dimensional representations, network parameters (weights and biases), and clustering parameters and assignments are updated jointly. In this section, we survey these algorithms, and provides a taxonomy from the perspective of clustering algorithms. Table I summarizes existing joint deep clustering algorithms.

A. Deep Kullback-Leibler Divergence Clustering
Guo et al. [28] proposed improved deep embedded clustering (IDEC), an algorithm that simultaneously learns low-level representation and cluster assignment. The IDEC algorithm consists of two phases: (1) parameter initialization, and (2) parameter optimization and clustering. In the initialization phase, IDEC initiates a denoising SAE [17], which reconstructs a data point from a corrupted (noisy) version ̃ to force the encoder and decoder to capture implicitly the structure of data that generate distribution. The SAE is trained based on reconstruction loss to obtain initial values for the network's weights and biases. The clusters' centroids are initiated by applying K-means to the representations extracted from the encoder element. When the initialization is completed, IDEC removes noise from the data to apply clustering to the representation learned from the clean data. When noise has been removed, the denoising SAE degenerates into a traditional SAE, which constrains the dimension of the hidden representation to be less than the dimension of the input data . Then, the optimization and clustering phase is executed by finetuning using KL divergence as clustering loss and SAE reconstruction loss. This results in the joint loss function = + (37) where is the reconstruction loss in equation (4), is the KL-divergence clustering loss in equation (14), and is a regularization parameter to balance the two terms. Clustering is achieved by alternating between computing the soft assignment based on the student's t-distribution formula in equation (15), and auxiliary target distribution in equation (17). IDEC jointly optimizes the cluster centers and the network parameters using an SGD algorithm [30]. The gradient is calculated for the clustering loss with respect to the cluster centroid and point representation , and then is utilized in backpropagation. Experimental results have demonstrated the importance of locality preservation. Guo et al. [31] developed a deep clustering method with CAEs (DCEC) for image clustering; the DCEC framework is very similar to the IDEC model, but instead of an SAE, DCEC employs a CAE to better incorporate the relationship between image pixels. The effectiveness of CAE over SAE has also been demonstrated for image datasets. Similar to IDEC, Zhou et al. [21] introduced Deep Embedded Clustering With Adversarial Distribution Adaptation (ADEC). Instead of SAE, ADEC utilizes AAE to learn from data space to feature space. With a backpropagation algorithm, ADEC iteratively optimizes the following objective function: L ADEC = L r + αL g + βL KLD (38) where , , is the reconstruction loss defined in , the generation loss in (9), and the KLdivergence clustering loss in equation (14), respectively, and and are hyperparameters to balance the importance of the generation loss and the clustering loss, respectively. In deep learning, the optimization of a neural network's loss function whose secondary component highly competes with the primary one may lead to feature drift. As a result, the global learning process will be affected, since the features learned by the primary loss can be easily drifted by updating the secondary one. Discarding one of the primary or secondary losses will lead to substitution of a significant portion of true labels for random ones, known as feature randomness. Mrabah et al. [32] enhanced the IDEC approach to reach a better trade-off between feature drift and feature randomness using AAE complemented with data augmentation.
Dizaji et al. [33] proposed the deep embedded regularized clustering (DEPICT) model to learn data representation and perform the clustering task. DEPICT has a complicated network architecture composed of a softmax layer on top of a multilayer CAE. More specifically, DEPICT consists of four components: two encoders, one decoder, and one softmax layer. The encoder and decoder elements of the DEPICT network are referred to as paths.
Thus, there are three paths in the DEPICT architecture. The first path is called the noisy encoder, which is the encoder part of the denoising CAE that accepts noisy input data to infer noisy hidden representations. The second path is called the noisy decoder (or just decoder), and is the decoder element of the denoising CAE for reconstructing the input from the learned noisy representations. The decoder element consists of a strided CNN, which is similar to the traditional one, except that the value of the convolutional kernel stride is greater than 1. The third path is called the clean encoder, a CNN that accepts clean input data to infer clean hidden representations. The clean and the noisy encoder paths share the same network parameters, i.e., weights and biases. The softmax layer (the fourth component of the network) is stacked on top of the noisy encoder top layer and clean encoder top layer to obtain the clustering assignments. The first phase of the algorithm is initialization, where the network parameters, cluster centroids, and target distribution are initialized. Instead of initializing the network parameters randomly, DEPICT assigns the weights from a Gaussian distribution, where the input and output variances are the same for each layer. This initialization www.ijacsa.thesai.org approach is known as Xavier (or normalized) initialization [34]. Next, DEPICT is trained with reconstruction loss only (without clustering loss) to obtain initial embedded representations for the input data. Then, the K-means clustering technique is applied to obtain the initial cluster centroids and the initial target distribution , when the initialization phase is complete, the optimization and clustering phase starts. In the softmax layer, DEPICT iteratively minimizes the following three-term joint loss function: L DEPICT = L r_DEPICT + L KLD + L BL (39) where and are the KL-divergence and balanced assignment losses that were introduced in equations (14) and (32), respectively. The first term is a data-dependent regularization term, which is a reconstruction loss function introduced in DEPICT designed to enhance the representation learning process and avoid the overfitting problem. The reconstruction loss between the noisy decoder and the clean encoder representations is computed as. (40) where is the size of the input data, is the number of noisy decoder and clean encoder layers, is the layer number, | | is the ℎ layer output size, is the ℎ layer of clean representations (from the clean encoder), and ̂ is the ℎ layer of noisy representations (from the noisy decoder). The second term of the DEPICT joint loss function is the KL-divergence clustering loss. A multinominal logistic regression function is employed to perform the soft clustering assignment. Note that DEPICT computes the soft assignment predictions based on noisy representations that are extracted from the noisy encoder, whereas the target distribution is computed from the clean representations extracted from the clean encoder path. The third term is a regularization term that encourages balanced cluster assignments and avoids the allocation of clusters to outlier samples. The effectiveness of DEPICT has been proven empirically, especially in terms of the running time complexity.

B. Deep K-Means Clustering
Huang et al. [36] introduced a deep embedding network, referred to as DEN, to learn clustering-oriented representations using a three-layer SAE. Similar to that of most deep clustering algorithms, the DEN learning procedure is composed of two phases: initialization (pretraining) and optimization. In the pretraining phase, a three-layer DBN [14] is trained based on the contrastive divergence loss only, to initialize the SAE parameters. Then, the learned representation from the DBN is fed into the three-layer SAE to begin the joint training optimization process. In this phase, the DEN minimizes the joint loss function.
L DEN = L r + αL LP + βL GS (41) where is the reconstruction loss in equation (4), is the locality-preserving auxiliary clustering loss defined in (34), and is the group sparsity auxiliary clustering loss expressed in equation (35)  and are tuning parameters. By considering these two auxiliary clustering losses, the DEN imposes two constraints on the learned representations: the first is the locality-preserving constraint to preserve the local structure property of the original data, and the second is the group sparsity constraint. These are imposed to facilitate the clustering process, and ensure that the learned representation incorporates cluster information, and thus, is more suitable for clustering. After the optimization phase, the traditional K-means clustering algorithm is employed to perform clustering.
Yang et al. [37] proposed a dimensionality reduction and K-means clustering framework named the deep clustering network (DCN). A DNN, specifically an SAE, is utilized by the DCN for dimensionality reduction and representation learning. The DCN algorithm is initiated by a pretraining stage based on reconstruction loss to initialize the SAE weights and biases. To initialize the cluster centroids, K-means clustering is applied to the obtained representations from the pretraining. Then, the joint training phase is executed by iteratively optimizing the joint loss function.
where is the reconstruction loss as defined in equation (4), is the K-means clustering loss function described in equation (11), and is a regularization parameter, which balances the reconstruction error by finding K-means-oriented hidden representations. Instead of applying the traditional SGD for the optimization process, the DCN introduces an alternating SGD optimization algorithm to update its parameters. There are three sets of parameters to be updated in a DCN: cluster centroids, data point cluster assignments, and network parameters. The proposed alternating SGD suggests that each set of parameters should be treated as a subproblem; thus, DCN optimizes the subproblems with respect to one of the cluster centroids, data point assignments, and network parameters while keeping the other two sets fixed. For instance, to update network parameters, both the cluster centroids and data point assignment are fixed, and then the corresponding gradient is calculated by backpropagation.
Fard et al. [38] proposed a deep K-means clustering algorithm named deep K-means (DKM), which is very similar to the DCN [37]. DKM differs from the DCN in the clustering loss only, where weighted K-means is employed instead of Kmeans. Equation (43) shows the DKM joint loss function: where is the reconstruction loss as defined in equation (4), is the weighted K-means clustering loss function described in equation (12), and regulates the trade-off between seeking good representation and good clustering results. The similarity weight of the K-means loss function is computed according to the softmax function.
where is the learned representation of data point , is the number of clusters, is the representation of the cluster centroid , and is a coefficient such that when its value is 0, www.ijacsa.thesai.org all of the data points in the embedding space are very close, and when its value is relatively high, the points are sparse in the space. The network architecture and learning process of DKM is similar to that of DCN, except that instead of alternating between continuous gradient updates and discrete cluster assignment steps, DKM relies on the gradient update only to learn both the representation and clustering parameters.
Chen et al. [39] proposed a deep manifold clustering algorithm called deep manifold clustering (DMC) to address multimanifold clustering (MMC) [40].DMC's architecture is similar to that of DEN [36], where an SAE [17] is employed for representation learning and a DBN [14] is utilized to initialize the SAE parameters. In DMC, a locality-preserving auxiliary clustering loss is introduced such that the locality of a manifold can be interpreted as similar inputs, and therefore, should have similar representations. Thus, a data point can be recovered using the representation of its nearby point. Based on this observation, the DMC [39] locality-preserving loss function is defined as.
where is the reconstructed image of data point and ( ) is the indices set of nearest neighbors of . After the SAE weights and cluster centroids have been initialized, the joint training procedure proceeds by iteratively optimizing the joint loss function: where is the reconstruction loss defined in equation (4), _ is the locality-preserving loss function defined in equation (45), is the weighted K-means clustering loss function presented in equation (12), balances the importance between the reconstruction of itself and its local neighborhood, and is a parameter to balance the contribution of the first two terms and . DMC uses the Gaussiandependent kernel as the similarity weight of the weighted Kmeans loss function.
Here, is the kernel bandwidth. The keystone point of DMC is to find the manifold center, because the cluster centers are most probably surrounded by nearby points with lower local density, and because they are at a relatively large distance from any points with a higher local density. Therefore, DMC calculates the density of the new representation by computing two metrics: the local density of a point, and its distance to points with higher density. The local density of the representation is defined as. (48) where Δ is the distance between the representation and and Δ is a cut-off distance. Then, the points in the new embedding space are sorted based on their density in descending order, denoted by { } =1 with 1 ≥ 2 ≥ ⋯ ≥ . The distance metric is therefore calculated as.
Next, a third metric is defined as Similarly, the points in the new embedding space are sorted based on , as computed in equation (44) in descending order, and denoted by { } =1 with 1 ≥ 2 ≥ ⋯ ≥ . Assuming that the number of clusters is known in advance, the cluster centers are determined by considering the largest . The experiments reported in [39] showed that DMC outperformed the state-of-the-art multimanifold clustering methods.

C. Deep Spectral Clustering
Ji et al. [41] introduced deep subspace clustering networks (referred to as DSC-Nets) based on CAE [18] to learn nonlinear mapping. The network architecture of DSC-Nets includes three parts: a CNN encoder, a middle layer called the self-expressive layer, and a CNN decoder. In the selfexpressive layer, the neurons are completely connected using linear weights without bias and nonlinear activation. The purpose of this FC layer is to encode the self-expressiveness property, as explained in section 3.3. Each node in this selfexpressive layer is a representation , and the weights correspond to the matrix in equation (36) which are further used to construct affinities between all data points. Therefore, essentially, the self-expressive layer enables the network to learn the affinity matrix directly. First, DSC-Nets pre-train the CAE without the self-expressive layer to initialize the encoder and decoder parameters. Then, in the finetuning process, the DSC-Nets deep network is first trained, and the following joint loss function is recursively optimized: where is the reconstruction loss defined in equation (4)  and is the self-expressiveness loss as expressed in (36). When the training is completed, the parameters of the selfexpressive layer are used to build an affinity matrix for spectral clustering, as explained in section 3.2. The experiments reported in [41] showed that DSC-Nets yielded superior results for small datasets. However, this method cannot be applied on large datasets because of the memory complexity of the algorithm [19].
Similar to DSC-Nets, in [42], Zhou et al. proposed deep adversarial subspace clustering (DASC) model which learns subspace clustering-friendly representations using AAE and self-expressiveness constraint. Given that, DASC optimizes the following objective function: where , , and represent the reconstruction loss defined in (4), the generation loss in (9), and the selfexpressiveness loss that defined in (36), respectively, and and are hyperparameters to balance the importance of the generation loss and the clustering loss, respectively. Upon the www.ijacsa.thesai.org completion of the training process, spectral clustering is applied to the resulting affinity matrix.
Yang et al. in [43] presented a deep spectral clustering (DSC) approach based on AAE. In the proposed approach, the generator is a dual AE network (one encoder and two decoders) to enforce the reconstruction constraints for the latent representations and their noisy versions. As a consequence, the resulting latent representation will be more robust to noise. Hence, the reconstruction loss is updated to be in the following format: where is the reconstruction loss in (4), is the reconstructed image of input , ̃ is the reconstructed image of the noisy version of the input , and balances the strength of the two losses. Then, the mutual information estimation is employed to boost the discriminator with more information from the inputs. To achieve this, the feature map of the middle convolutional layer of the encoder is extracted and combined with the latent representation to obtain a new feature map. Therefore, the generation loss will be as follows: where is the discriminator, represents the feature vector of the middle feature map at coordinates ( , ), is the latent representation of input , is the KL-divergence loss in equation (14), h and w represent the height and width of the feature map, and and are balancing parameters. Furthermore, the latent representations are embedded into the eigenspace to cluster them using a spectral clustering technique.

D. More Deep Clustering Algorithms
Shah et al. [44] presented deep continuous clustering (DCC), a framework for joint nonlinear embedding learning and clustering. The DCC framework integrates an RCC algorithm [44] with an SAE [17] as a deep representation learning technique. DCC consists of two stages: initialization and optimization. During the initialization stage, the denoising SAE is trained based on reconstruction loss only to initialize the network parameters, i.e., weights and biases. Then, the SAE is finetuned, using the reconstruction loss only, to complete the initialization. At the end of this stage, the learned representation is obtained from the bottleneck layer to have the initialization = . Then, the optimization is conducted by minimizing the joint loss function.
where is the AE reconstruction loss in equation (4), is the dimensionality of the original input dataset, and is the dimensionality of the lower-dimensional representations . DCC modifies the data loss introduced in RCC [44] as.
where is the scaled Geman-McClure function defined in equation (25). The pairwise loss is also modified by DCC as.
The parameters 1 and 2 control the radii of the convex basins of the estimators. The weights are computed based on.
where is the degree of in the graph. To balance the different terms, DCC sets and according to equations (29) and (30), respectively. The network parameters, the representatives , and the lower-dimensional representations are updated by an SGD optimization algorithm [45] through backpropagation. Other DCC parameters, such as , are iteratively updated during the optimization as in the RCC algorithm [44].
Jiang et al. [46] proposed Variational Deep Embedding (VaDE), a probabilistic generative clustering technique within a VAE framework. In VaDE, Mixture-of-Gaussian is assumed to be the prior of the probabilistic clustering. To model the data generative procedure, VaDE utilizes GMM to pick a cluster from which a latent embedding is generated. Then, VAE decodes the latent embedding into an observable. Then, VAE is trained to maximize the evidence lower bound (ELBO) [19] according to VAE loss ( ) in equation (7). After maximizing the ELBO, the cluster assignment can be inferred by the learned GMM model. GMVA [47] is another probabilistic clustering algorithm based on VAE with a Gaussian mixture as a prior distribution. The main contribution of this algorithm is in introducing the minimum information constraint [48] to the VAE in order to overcome the problem of cluster degeneracy, caused by the over-regularization of the VAE. The GMVA approach is more complex than VaDE, and has shown worse results in practice [19]. However, both VaDE and GMVA suffer from high computational complexity [19].
Mukherjee et al. [49] addressed the problem of clustering in the latent space of GAN [19] by introducing the ClusterGAN framework. In order to establish non-smooth geometry of the latent space, a mixture of discrete and continuous latent variables is utilized. To accommodate that mixture of variables, a new backpropagation algorithm is introduced to obtain the latent variable given a data input. The experimental results showed that GAN is able to preserve latent space interpolation across different categories. Table I, we compared the studied joint deep clustering algorithms based on their clustering technique, loss functions, and main contributions. From the presented review, deep clustering algorithms with autoencoders are the most common technique and this due to two reasons. We can summarize these two points as: (1) the ability to combine the autoencoders with the most clustering algorithm, (2) autoencoders reconstruction loss is capable to learn feasible representations and avoid trivial solutions. It is important to note that, the computational cost of autoencoder based deep clustering algorithms is highly affected by the clustering loss. However, for computational feasibility, such algorithms have limited network depth due to the symmetry architecture of autoencoder. On the other hand, deep clustering algorithms www.ijacsa.thesai.org with VAE, AAE, and GAN minimize the variational lower bound on the marginal likelihood of data which make them theoretically guaranteed. Unfortunately, these clustering techniques suffer from high computational complexity. Comparing VAE deep clustering algorithms with AAE and GAN clustering algorithms, AAE and GAN algorithms are more flexible and diverse than VAE algorithms. Nonetheless, AAE and GAN based clustering algorithms have slow convergence rate.

V. CONCLUSION
Recently, clustering algorithms have benefited from the new deep learning research field. In fact, new active research studies are focused on integrating deep representation learning with clustering tasks. Beyond joint deep clustering algorithms, more recent algorithms have been proposed, some of which have been classified as separated deep clustering approaches, and others categorized as combined deep clustering techniques, but not joint. DeepCluster, clustering by unmasking, rankconstrained spectral clustering, SDEC, parameter-free clustering, and learning deep graph representation are all examples of not-joint deep clustering algorithms.
In this article, we reviewed the existing joint deep clustering algorithms by describing their network structure and analyzing their objective functions. Based on the survey of algorithms discussed here, theoretical analysis of how and why jointly optimizing reconstruction and clustering losses significantly improves the clustering performance is itself significant. Also, studying whether deep supervised learning techniques, such as data augmentation and regularization, are applicable and useful for deep unsupervised clustering is meaningful. Exploring the feasibility of applying the proposed joint deep clustering algorithms on sequential data is highly encouraged. Moreover, exploring the viability of combining deep clustering techniques with other unsupervised learning tasks such as transfer learning is strongly recommended.