An Improved Depth Estimation using Stereo Matching and Disparity Refinement Based on Deep Learning

—Stereo matching techniques are a vital subject in computer vision. It focuses on finding accurate disparity maps that find its use in several applications namely reconstruction of a 3D scene, navigation of robot, augmented reality. It is a method of obtaining corresponding matching point in stereo images to get disparity map. With additional details, this disparity map could be converted into a depth of a scene. Obtaining an efficient disparity map in the texture less, occluded, and discontinuous areas is a difficult job. A matching cost using an improvised Census transform and an optimization framework is proposed to produce an initial disparity map. The classic Census transform focus on the value of pixel at the center. If this pixel is prone to noisy condition, then the census encoding may differ which leads to mismatches. To overcome this issue an improved census transform based on weighted sum values of the neighborhood pixels is proposed which suppresses the noise during stereo matching. Additionally, a deep learning based disparity refinement technique using the generative adversarial network to handle texture less, occluded, and discontinuous areas is proposed. The suggested method offers cutting-edge performance in terms of both qualitative and quantitative outcomes.


INTRODUCTION
Stereo matching has gathered attraction recently because of its applications in fields like visual entertainment, 3D reconstruction, autonomous driving, object detection [1], outdoor mapping, navigation and 3DTV [2], [3].It is a research area that tries to imitate vision systems in humans by using two or several 2D views of the same scene to get threedimensional depth details of the scene.It intends to find the corresponding relationship between matching pixels.A stereo matching algorithm uses stereo images that are rectified as an input [4], [5].The horizontal displacement between the matching pixels is called disparity.With additional details, a disparity map could be transformed to a depth of scene.Disparity map accuracy is very crucial as small inaccuracies may affect the result.Obtaining an efficient and precise disparity map is a tedious task because of the existence of noise, occlusions, low textures, ill-posed regions, and the lighting conditions.Hence, it is significant to create a good disparity map.
Stereo matching techniques are classified as conventional algorithms and deep learning methods.Conventional algorithms are grouped into local and global algorithms.In local approaches, disparity is computed by comparing small areas [6] [7].The disparity calculation relies on intensity in a defined support area.In real time the stereo images collected may be prone to noise, lighting distortions which reduces the efficiency of these algorithms.To overcome these drawbacks, a census transform in stereo matching is proposed in [8] which can decrease the effect of amplitude distortion.It aims at mapping the pixels to a binary string and then calculates the similarity between the pixels by means of Hamming distance.But, this method relies mainly on the central pixel, leading to false matching in a noisy environment.To reduce this shortcomings, a three-state census is proposed in [9] which is tolerant to any noise and enhances the robustness of stereo matching.An algorithm is implemented in [10] to perform census transform that reduces the noise interference and amplitude distortions in the images.A star-census transform (SCT) is introduced [11] that initiates the neighborhood pixel sampling in a symmetrical order that excludes the central pixel in the matching window.An improvised AD-Census stereo matching using gradient fusion (ADSG) is introduced in [12].The absolute difference is used along with census transform for cost calculation, the result is then combined with gradient cost.These methods focus only on the information locally and hence have a low complexity and execute in shorter time.But the results generated by these local methods in the areas of occlusion, texture less and discontinuities is not satisfying.
The semi global algorithm was proposed in [13].The accuracy and computational efficiency of semi global algorithms lies in between that of local and global algorithms.A global method considers disparity computation as a global energy minimization method for all disparity values.The energy function has two terms namely data term which penalizes pixels with inconsistent values and smoothness term with enforces smoothing constraint by considering the neighboring pixels.Some of the commonly used global algorithms are graph cuts algorithm [14] and belief propagation technique [15].A disparity estimation based on tree structure named Pyramid-tree is introduced in [16].It performs cross regional smoothing that can handle low texture regions.Global methods can generate a good quality disparity map, but they are also quite expensive and time-consuming.www.ijacsa.thesai.orgDeep stereo methods are popular these days.Zbontar et al. [17] used a network to get patch-wise details to compute matching cost.The proposed network is trained to find the similarity that exists between a pair of images.It is then processed using classic post processing.The GC-Net [18] is a network with a high performance.It applied 3D convolution kernel to the correspondence space and proposed disparity refinement.This provided improvement over the previous approach.A pyramid stereo matching network is proposed in [19].It improved the feature extraction by means of multi scale feature extraction network [20].A network namely cascaded residual learning [21] was introduced which uses a DispNet.This is made up of two sub parts called DispFullNet and a DispResNet.The first network computes the raw disparity map.The second network tries to optimize the raw disparity map by computing the multiscale residual information.Williem et al. [22] introduced a method known as self-guided cost aggregation that uses a convolution network for local stereo matching.The network is made up of emotional weight network and descent filtering network.In LEA Stereo [23] a search is performed to streamline matching pipeline.Shivam Duggal et al. [24] developed a trainable network.Many recent papers introduced refinement components steps to improvise the disparity map quality.The MSMD-Net [25] introduced multi scale technique in which the stereo images are processed using multi resolution pyramid network.The RAFT-stereo [26] consists of a network for stereo estimation along with refinement stage.The deep learning-based methods can produce depth map from a given stereo image pairs, but these stereo methods still find it difficult to find correct correspondences in texture less and the occluded regions.
Though several techniques have been proposed to improvise the matching accuracy, the low accuracy in the occluded and texture less regions has not been handled very well.A depth estimation technique using improvised census transform and disparity refinement using deep learning to enhance the results in occluded and texture less regions is proposed .The weighted sum of the center pixel and its four neighbors is used to calculate the center pixel value in the improvised census transform in order to reduce noise in initial disparity map.The occluded and texture less regions of initial disparity map are refined using Generative adversarial network (GAN) deep learning framework.The extensive experiments performed on Middlebury datasets shows the efficacy of our method.Our method improvises the efficiency of disparity map by a considerable amount.The suggested method is explained in Section II.The outcomes of the suggested method are shown in Section III.In Section IV, the paper's conclusions are discussed.

II. METHODOLOGY
The proposed method applies improved census transform is applied for the stereo images and a matching cost is obtained using Hamming distance.Then, a cost aggregation is carried using semi global method to compute an initial disparity map.Finally, a disparity refinement network using GAN is proposed to increase the efficiency of disparity map from which depth is estimated.An overview of the whole methodology is Fig. 1.

A. Improves Census Transform
Methods for stereo matching based on intensity difference contain a lot of errors, especially for the outdoor images.To overcome these drawbacks Census transform (CT) method is used for computing matching cost.It is a local method that relies on relative ordering of pixels rather than intensity within a fixed window.Hence it can efficiently handle radiometric variations like lighting changes and illumination differences and discontinuities.The traditional CT is shown in Fig. 2. Census transform consider the center pixel value, compares with all the remaining pixels and assigns the 1 if the center pixel value is less than the compared pixel, otherwise 0 is assigned.It is then represented as a binary bit string.
Here, is a bitwise operation, represent window with as centre pixel, is any point in and are pixel values of points and respectively.
The traditional CT can reduce the impact of distortions in amplitude, but it depends heavily on the middle pixel.It is prone to noise, as it measures the relative difference of the neighboring pixels based on middle pixel.When the center pixel is affected by noise, encoding from the census transform might vary drastically which may lead to mismatched pixels.Due to noise if the center pixel value changes from 15 to 35, the traditional CT transformation for a patch of image is depicted in Fig. 3. Since the traditional CT depends on the www.ijacsa.thesai.orgpixel at the center, the noisy center pixel value 35 is considered to compare with the remaining pixels in the image patch.The census code obtained is 00000000.Here there is a difference in 5 bits as compared to the initial code 01111001.Aiming to overcome this drawback, an improvised census method is proposed in this paper where the weighted summation of pixel at the center and the four neighboring pixels is used to update the center pixel.

Let
be the center pixel.The weight distribution of pixel of the center and the four neighboring pixels are: The weights are assigned in such a manner that the weight of each pixel lies in between 0 and 1 and the total weighted sum of the pixel at the center and four neighboring pixel is 1.
The weighted sum of pixel centered at (x, y) is computed using the following equation.(8) The following equation is used to update the value of the center pixel.
If the variation between the weighted sum of center pixel and the original center pixel is more than the threshold then the pixel value in the center is updated by the weighted sum otherwise the original is used.
The improvised technique for census transform proposed in the paper is depicted in Fig. 4. Due to noise if center pixel value change from 15 to 35, the improvised method proposed in the paper updates the value of center pixel to 27.95 using the weighted sum as shown in Fig. 4 (i.e 35*0.4+11*0.15+32*0.15+30*0.15+20*0.15).This value is compared with the remaining pixels in the patch to get the census code 01101000.The pixels differ only by 2 bit as compared to original code 01111001.This demonstrates how the approach is noiseresistant and improves matching performance.

B. Matching Cost Computation
To ascertain whether the values between two pixels indicate the matching point of a scene, a matching computation of cost is carried out.After the census transform the correspondence of pixels can be determined using Hamming distance [8].The Hamming distance between matching points is found to estimate the correspondence between matching points.Let be a binary bit array of pixel in the left stereo image and be the binary bit array of pixel in right stereo image for disparity .The following calculation uses the Hamming distance to compute the matching census cost between p and q.
[ ] The cost computation for the center pixel of image patch is depicted in Fig. 5.

C. Initial Disparity Estimation
Due to noise, the pixel wise cost may produce ambiguous results.Hence additional constraint is included to get a smooth disparity by penalizing the changes in the neighboring pixels [27].The smoothness constraint and pixel wise cost is represented by the energy function .
is the cost summation of all the pixel for disparity .
is the penalty applied to pixels in N P with low disparity difference.is the penalty for pixels in with high disparity difference.) is cost for pixel and is the pixel cost at direction with disparity , is cost at direction and disparity .denotes cost for disparity and direction . is minimum pixel cost at direction .
The following equation is then used to determine the initial disparity.

D. Disparity Refinement
The initial disparity calculated may contain wrongly matched disparities at the object boundaries, occluded areas and the texture less regions.Finding the correct disparities in these areas is a difficult task.Hence, an appropriate disparity refinement method is needed.A disparity refinement is performed based on deep learning-based technique using the generative adversarial network (GAN) to handle texture less, occluded, and discontinuous areas.The method proposed uses GAN network introduced by Good fellow [28] for disparity refinement.GAN includes two networks called generator and a discriminative network that are implemented based on neural networks.The generator takes initial disparity map as its input and focuses on generating a refined disparity map.The discriminator is fed with ground truth disparity along with disparity map produced by the generator.The discriminator aims to differentiate the ground truth disparity and generated refined disparity map.The feedback from the discriminator is given to the generator to fine tune the generated image.This procedure is repeated until the resulting disparity resembles the ground truth disparity.The disparity refinement network proposed in the paper is depicted in Fig. 6.A Pix2Pix GAN [29] is used to refine the disparity map.Pix2Pix GAN is an adversarial network.Pix2Pix GAN is known for the capacity of producing high quality images.The initial disparity is given as input to the generator.The generator generates the disparity map which is then fed to the discriminator.The various generator networks available are UNET 128, ResNet 6 and Resnet 9.The proposed disparity refinement network uses UNET128 as it can learn with few training images.The architecture of UNET 128 generator used in the proposed approach is depicted in Fig. 7.
Here is generated disparity map, is ground truth disparity map. is the total pixels.The discriminator is based on Patch GAN model.This PatchGAN model provides extremely high frequency information.The GAN's primary objective is described as, www.ijacsa.thesai.org Here, de represent the a ground truth disparity, denote the generated disparity and denotes the initial disparity map The generator G attempts to decrease the objective as response to the discriminator D which attempts to increase it.The result is as follows: (17) The aim of is to decrease the objective and the generator updates itself using Loss .It is computed as, The objective is updated as The performance of training model is depicted in Fig. 8.Here the training loss decreases gradually as the number of epochs increases.Lower the loss, the more effective the model is.

E. Depth
Once the disparity map is generated for the stereo images, the depth is estimated using the following formula: Here is focal length, is stereo camera baseline.These values are obtained from the stereo calibration.Once the depth is estimated the exact coordinates of each pixel in the scene can be computed.These coordinates are made used to construct point clouds.The point cloud generated for Motorcycle image is shown in Fig. 9.The coordinates are stored in polygon format file.The output ply file is plotted using ply file plotters.The point cloud shown above was constructed using open3d.

III. RESULTS AND DISCUSSION
The experiments were performed using Middlebury dataset [30], [31] images to analyze the performance.The refinement network is trained using the Pytorch framework on a personal machine.The computer hardware environment used is a Dual Intel-Xeon E5-2609V4 8C having 1.7 GHz 20M 6.4 GT/s and 128GB Memory.A Dual NVDIA Tesla server P100 GPU having 3584Cores and maximum of 18.7 TeraFLOPS is used.The datasets are downscaled to 256 pixels width and 256 pixels height for computational purposes.The Adam optimizer is used to optimize the discriminator.The learning rate is 0.0002.The GAN models does not converge, hence a balance has to be established between the generator and discriminator.The number of epochs is 100.

A. Middlebury Dataset
The Middlebury dataset includes rectified stereo images from indoor and outdoor surroundings utilizing a stereo vision concept.These images are complex, and it has images of different characteristics such as, different resolutions and low texture areas.Hence, the dataset consists of complex images for framework evaluation.Our stereo matching technique is robust to occluded and non-textured regions.The details of testing images like Cones, Teddy and Venus from Middlebury 2001 and 2003 are given in Table II.The details of higher resolution images from Middlebury 2014 are given in Table III.www.ijacsa.thesai.orgefficiency of the proposed improved census transform in a noisy condition, salt and pepper noise of 2% and 5% noise is applied to Cones, Teddy and Venus images.The qualitative results for Teddy image when 2% salt and pepper noise is applied is represented in Fig. 10 and Table IV shows the percentage of bad matching pixels (PBMP) of the initial disparity map for the proposed improved census transform and traditional CT.The outcome conclude that results of the method proposed in the noisy condition is remarkably good than the traditional CT.

C. Qualitative Results
The initial and improved disparity maps estimated by the suggested method are shown in Fig. 11.In the Fig. 11, the Fig. 11  The disparity maps generated for images such as Jade Plant, Adirondack, Motorcycle and Recycle are presented in first, second, third and fourth rows respectively in Fig. 12.The first row of the Fig. 12 shows a Jade Plant image from Middlebury dataset.This image is very challenging to match due to brightness difference.But, the proposed method has correctly discovered the disparities.The second and third rows of Fig. 12 shows Adirondack and Motorcycle images.The texture less surfaces in the initial disparity map of Adirondack image is highlighted by the yellow circular region.These regions are well recreated by the proposed approach.The fourth row of Fig. 12 shows Recycle image.The occluded areas in the initial disparity map is highlighted by the red rectangular region.The possibility of getting wrong matches in these regions are very high.These occluded areas are filled accurately in the estimated disparity map.We find that the proposed method produces efficient results in occluded and texture-less regions.

D. Evaluation Metrics
The quantitative analysis is performed using the evaluation metrics namely root mean square error (RMSE) and PBMP.The efficiency increases when PBMP and RMSE values decrease.N be the number of pixels.d t and d g be the disparity map estimated and ground truth disparity maps respectively.RMSE is calculated as: PBMP is calculated as follows:

E. Comparison with Existing Methods
The proposed method is compared with ADSG [12] and Deep Pruner [24].The results for the methods compared are obtained from Middlebury evaluation leader board.An improved AD-Census method using gradient fusion is used in

Fig. 4 .
Fig. 4. Census transform of the proposed method in noisy condition.
The stereo matching problem aims to minimize the energy function .Finding the minimum energy function is computationally expensive.The energy function is www.ijacsa.thesai.orgapproximated by aggregating the matching cost from all directions .The total number of directions is 8.The cost at direction r is represented by, [ ]

Fig. 7 .
Fig. 7. Architecture of generator.It uses a network consisting of several convolutional layer, batch normalization, dropout, and activation layers.It is trained using adversarial loss and then revised by means of L 1 loss.This loss drives the generator to generate image close to ground truth disparity.The generator is then updated using a sum of L 1 loss and a loss called as adversarial loss.A comparative study of UNet 128 with other networks such as ResNet 6 and ResNet 9 is given in Table I.ResNet 6 and Resnet 9 are the deep residual networks which include 6 residual blocks and nine residual blocks respectively.The information details are passed via a shortcut connection.Convolutional, batch-normalization, and corrected Liner Unit (ReLU) layers make up a traditional residual block.For evaluation, measurements like squared relative difference (SRD) and absolute relative distance (ARD) are used.Lower values of the above metrics indicate better performance.The efficiency of the generator architecture is shown in Table I. Results obtained for UNet 128 is better than the ResNet 6 and ResNet 9.∑ (14)

Fig. 10 .
Fig. 10.Visual results for initial disparity map on noisy Teddy image (a) Left reference image (b) Right image (c) Ground Truth Disparity (d) Initial Disparity Map using traditional CT (e) Initial Disparity Map using proposed method.
(a) is the left reference image.Fig. 11(b) is the right image.The ground truth disparity is given Fig. 11(c).The fourth column Fig. 11(d) represent initial disparity map.The refined disparity map is represented in Fig. 11(e).The red rectangular regions marked in Fig. 11(d) represent the occluded areas which are filled in the refined disparity map obtained by the suggested approach.The yellow circular region marked in the initial disparity of the Venus image shows the texture less region which is filled in the refined disparity map.It is discovered that the suggested method effectively creates highquality disparity maps in noisy, textureless, and occluded regions.

Fig. 11 .
Fig. 11.Visual results on Cone, Teddy, and Venus images (a) Left image (b) Right image (c) Ground Truth Disparity (d) Initial Disparity Map (e) Refined disparity map.

Fig. 12 .
Fig. 12. Visual results on Jade Plant ,Adirondack, , Motorcycle and Recycle images (a) Left reference image (b) Right image (c) Ground Truth image (d) Initial Disparity Map (e) Refined disparity map.

TABLE IV .
PBMP OF INITIAL DISPARITY MAP