How Images Defects in Street Scenes Affect the Performance of Semantic Segmentation Algorithms

Semantic segmentation methods are used in autonomous car development to label pixels of road images (e.g. street, building, pedestrian, car, and so on). DeepLabv3+ and PSPNet are two of the best performance semantic segmentation methods according to Cityscapes benchmark. Although these methods achieved a very high performance with clear road images, yet these two methods are not tested under severe imaging conditions. In this work, we provided new Cityscapes datasets with severe imaging conditions: foggy, rainy, blurred, and noisy datasets. We evaluated the performance of DeepLabv3+ and PSPNet using our datasets. Our work demonstrated that although these models have high performance with clear images, they show very weak performance among the different imaging challenges. We proved that the road semantic segmentation methods must be evaluated using different kinds of severe imaging conditions to ensure the robustness of these methods in autonomous driving. Keywords—Semantic segmentation; deep learning; cityscapes; DeepLabv3+; PSPNet


I. INTRODUCTION
Autonomous vehicles are vehicles that can move with little or no human interaction. It collects all the environment surrounding information to simulate human behavior in driving safely. Autonomous vehicles rely on sensors, actuators, driving algorithms, machine learning technologies, and powerful micro-controllers with GPUs to execute the self-driving software.
Self-driving software uses semantic segmentation algorithms that take road scene images as input and give a label to each pixel in the input images. These labels describe the object class that these pixels present (road, traffic light, vehicle, human, etc.). Fig. 1 shows an example of input and ground truth images used in semantic segmentation algorithms. Semantic segmentation is very powerful as it helps self-driving software with understanding scene images at the pixel level.
These networks are designed and tested to work efficiently with clear images. Also, all the images in the large scale datasets [8][9][10][11] are clear images. Yet, semantic segmentation methodologies don't take into consideration the different types of defects in images coming from video cameras.
Defects in images could be a result of bad weather or electronic noise. These defects in images decrease the performance and the accuracy of semantic segmentation methodologies and thus lead to a wrong driving decision taken by the vehicle's self-driving system.
Overall, the state-of-the-art methods take into consideration only the performance of these methods on clear images, as these methods are limited by the existing datasets. These methods ignore the performance with unclear images. Semantic segmentation methods should take into consideration these challenges and handle these severe imaging conditions. Although certain works studied object detection methodologies with challenges as foggy [12], rainy [13,14], blurred [15], and noisy [16][17][18] images, yet only a few works [12,19] studied these challenges with semantic segmentation methodologies. Here, we are studying road semantic segmentation methodologies with different challenges.
In this work, we address different kinds of severe imaging conditions: fog, rain, blurring, and noise. We study the performance of semantic segmentation with these four imaging defects. As collecting real datasets with these conditions is very hard, we decided to use Cityscapes dataset [11] and introduce fog, rain, blurring, and noise on the clear images of the dataset.
Even that author in [12] addressed the performance of semantic segmentation methods [1,2] with fog. These two methods have very low performance on the Cityscapes benchmark. The mIoU of these two methods is 73.6% and 67.1% respectively on Cityscapes test set. In this work, we are not only generating new evaluation datasets but also studying the performance of two powerful methods in semantic segmentation against imaging defects challenges. We study the performance of DeepLabv3+ and PSPNet [5,4] which are rated as two of the top methods in semantic segmentation. DeepLabv3+ and PSPNet score mIoU of 82.1% and 81.2% respectively on Cityscapes test set.
This work is an expansion to our previous work [20], which studied the performance of semantic segmentation methods with fog and blur challenges. In this paper, we added rain and noise to the challenges used in the performance evaluation of semantic segmentation methods.
In summary, this work contributions are: (i) Addressing the performance degradation in semantic segmentation methods with severe imaging conditions. (ii) Creating rainy, foggy, blurred, and noisy datasets for evaluation purposes. We made use of an algorithm provided by [12] to add fog in Cityscapes dataset. (iii) Using our newly created datasets in performance evaluation of two top semantic segmentation methods(DeepLabv3+ and PSPNet).
This paper is organized as follows: Section 2 reviews shortly the methods of semantic segmentation used in performance measurement. Section 3 describes the challenging evaluation datasets. Section 4 shows the experiments and the performance evaluation results. Finally, Section 5 makes a brief conclusion.

II. METHODS
In this section, we will describe briefly the semantic segmentation methods used in our methods performance search. DeepLabv3+ and PSPNet are two of the best-performing methods according to Cityscapes benchmark. These are two state-of-the-art road semantic segmentation methods used to label pixels of road images (e.g., street, building, pedestrian, car, and so on).

A. DeebLabv3+
DeebLabv3+, the extension of DeebLabv3, is a very powerful semantic segmentation model invented by Google. DeebLabv3+ is mainly composed of two phases: Encoder: In this phase, the model extracts the main features from the input image. It detects the presence of the objects and their location. DeepLabv3+ uses Atrous Spatial Pyramid Pooling (ASPP), which investigates convolutional features by applying atrous convolution at multiple scales. Decoder: In this phase, the model refines the segmentation results along the object boundaries. It applies 1 x 1 convolutions on the low-level features and concatenates it with the upsampled encoded features. It then applies 3 x 3 convolutions and upsamples the features to output the prediction image with the same size of the input image.
DeebLabv3+ scored a performance of 89.0% using the test set of PASCAL VOC 2012 benchmark [10] and 82.1% using the test set of Cityscapes benchmark. Fig. 2 shows the network structure of DeepLabv3+. Pyramid Scene Parsing Network (PSPNet) is a semantic segmentation model developed to enhance learning the full context representation of the input scene. PSPNet is mainly composed of four phases: (i) Creating the feature map of the input image using CNN. (ii) Applying pyramid pooling mechanism. This pooling mechanism contains four pooling levels presented in a pyramid hierarchy that is proceeded with a 1x1 convolutional layer. Each pyramid level is responsible for analyzing different parts from the input image in different locations. (iii) Upsampling and concatenating the pyramid levels outputs to give an initial feature maps which contain the local and global information of the input image. (iv) Applying a convolutional layer to the feature maps to generate the prediction image.
PSPNet scored a performance of 85.4% using the test set of PASCAL VOC 2012 benchmark and 81.2% using the test set of Cityscapes benchmark. Fig. 3 shows the network structure of PSPNet.

III. EVALUATION DATASET
In order to evaluate semantic segmentation methods, we chose to introduce fog, rain, blur, and noise to Cityscapes evaluation set which consists of clear images only. In this section, we will describe in details our proposed challenging datasets and examples from the datasets are shown in Fig. 4.
Due to the difficulty of collecting and annotating images for rainy weather, we choose to generate rain into clear weather images of Cityscapes dataset. In this work, we consider a rain image as a composition of a rain-free image and a rain layer. We formulate the rain image O(i,j) at pixel i,j as the following: where I(i,j) denotes the rain-free image and R(i,j) denotes the rain layer. The rain layer is created by the following processes: Algorithm 1 Algorithm of adding rain to clear weather images 1: function ADDRAIN(I(i, j), α) I(i, j) clear image, α rain density create black layer withe the same size of the Clear weather image I(i, j) 2: height, width ← I(i, j).shape return R(i, j) R(i, j) rainy image 8: end function (i) Creating a black layer B(i,j) with the size of the rainfree image. (ii) Adding Gaussian noise to the black layer. We used 1D Gaussian distribution. Its standard deviation α determines the rain density. (iii) Applying motion blur filter to the black layer with the Gaussian noise to create the rain layer. We chose the rain motion to be diagonal. We convolved a 2D filter (50 x 50) across the image. As the direction of 1's across the filter grid gives the direction of the desired motion, we used an identity matrix as a motion blur www.ijacsa.thesai.org filter.
Our rainy Cityscapes dataset images created are characterized by the parameter α used to create the rain layer. α determines the rain density. Rain density increases with an increase of α parameter. We created four rainy datasets with α of 15, 20, 25, and 30. Alg. 1 describes the procedures of adding rain to an input clear image.
The author in [12] developed an algorithm to add synthetic fog to the clear weather images of Cityscapes dataset. We chose to use this algorithm to create our evaluation foggy dataset. In this dataset, fog density is defined by the visibility range of the image. We created four foggy datasets with visibility ranges of 600, 300, 150, and 75 meters.
In order to evaluate the performance of semantic segmentation methods, we blurred Cityscapes clear dataset. We convolved the clear images with a Gaussian 2D-kernel that has a standard deviation γ. The standard deviation γ of the Gaussian kernel represents the density of blurring. By increasing γ blurring density increases. We created four blurred datasets with γ of 1, 3, 5, and 7.
Noise is defined as aberrant pixels. This means that the pixels are not representing the color or the exposure of the scene correctly. Noise in images can make it impossible to determine the objects in the scene. To determine the performance of the semantic segmentation models with noisy images, we chose to add noise to the clear images from Cityscapes.
One kind of noise that occurs in all recorded images to a certain extent is Gaussian noise. This noise can be modeled with an independent, additive model, where the noise has a zero-mean Gaussian distribution and described by its standard deviation σ. We used the standard deviation σ of the Gaussian model to represent the noise density. As σ increases noise density increases. We created four noisy datasets with σ of 5, 10, 15, and 20.

IV. EXPERIMENTS
In this section, we evaluated the performance of DeepLabv3+ and PSPNet methods using foggy, rainy, blurred, and noisy datasets. We used intersection-over-union metric IoU to measure the methods' performance.
where TP is the true positive labeled pixels, FP is the false positive labeled pixels, and FN is the false negative. mIoU is the mean intersection-overunion of the whole evaluation set.
DeepLabv3+ and PSPNet score mIoU of 78.73% and 76.99% respectively on Cityscapes clear evaluation set. Our experiment evaluates the performance of these models throughout different density degrees of fog, rain, blur, and noise.
By comparing the performance of these two methods, we found that DeepLabv3+ performance overcomes PSPNet performance. Even that the two methods have approximately the same performance on clear Cityscapes dataset, DeepLabv3+ has a higher performance than PSPNet on foggy, rainy, blurred, and noisy Cityscapes datasets. The two methods showed a stable performance on light fog and rain, while the performance harshly degraded on excessive amounts of fog and rain. Also, the performance of the two models decreased at a high rate with low densities of blur or noise.
Although DeepLabv3+ shows a higher performance than PSPNet during the evaluation of different semantic segmentation challenges, our experiments show clearly that these two semantic segmentation methods don't show robust performance with foggy, rainy, blurred, and noisy images. We demonstrated that our challenging datasets killed the performance of both methods. Fig. 5 shows the mIoU of the two methods among the different density degrees of fog, rain, blur, and noise.
In order to have safe autonomous vehicles, systems on these vehicles should work efficiently in all the different weather conditions. Also, semantic segmentation methods in autonomous vehicles systems should show robustness against different types of noise in road images. Fig. 6, Fig. 7, Fig.  8, and Fig. 9 show some qualitative results examples of DeepLabv3+ and PSPNet with our challenging datasets.

V. CONCLUSION
In this paper, we studied the performance of state-ofthe-art semantic segmentation methods with different severe imaging conditions and challenges. We used Cityscapes dataset which consists of clear images only to create new challenging datasets. We created foggy, rainy, blurred, and noisy Cityscapes datasets. We evaluated the performance of DeepLabv3+ and PSPNet methods using our new challenging datasets. We showed that although DeepLabv3+ and PSPNet have good performance with clear images, these two methods don't show a reliable performance with different challenging datasets.
Our created dataset can be used to boost the performance of semantic segmentation models. This could be done by finetuning these models during training using images from our datasets.
In this work, we prove that semantic segmentation methods must be evaluated with different kinds of severe imaging conditions to ensure the robustness of the methods and so the safety of autonomous vehicles.