Severely Degraded Underwater Image Enhancement with a Wavelet-based Network

—Underwater images are important in marine science and ocean engineering fields owing to giving color information, low cost, and compact. Yet obtained underwater images are often degraded and restoring and enhancing wavelength selective signal attenuation of underwater images depending on complex underwater physical process is essential in practical application. While recently developed deep learning is a promising choice, constructing sufficiently large dataset covering whole real images is challenging, peculiar to underwater image processing. In order to supplement relatively small dataset, previous studies alternatively construct an artificial underwater image dataset based on a physical model or Generative Adversarial Network. Also, incorporating traditional signal processing methods into the network architecture has shown promising success, though enhancement of severely degraded underwater images remains to be a big issue. In this paper, we tackle underwater image enhancement based on an encoder-decoder based deep learning model incorporating discrete wavelet transform and whitening and coloring transform. We also construct a severely degraded real underwater image dataset. The presented model shows excellent results both qualitatively and quantitatively in the artificial and real image dataset. Constructed dataset is available at https://github.com/tkswalk/2022-IJACSA.


I. INTRODUCTION
Underwater optical images are essential in sensing vast ocean environment. Optical cameras beneficially capture high resolution color information, as well as relatively low cost and compact compared to other acoustic devices. While underwater optical images are essential especially in tasks requiring color information, such as ocean monitoring, maintenance of port facilities, and resource development, serious image degradation is obstacle in efficient utilization. Specifically, wavelength selective color distortion which displays blueish, greenish, and yellowish appearances, or decreased contrast caused by complex underwater physical process worsens the visibility of an underwater image [1], [2], as shown in the upper part of Fig. 1.
To overcome the low visibility of underwater images and expand the scope of application, underwater image enhancement methods based on deep learning have rapidly improved by refining model architecture and training dataset. In underwater image enhancement, deep learning models are mainly trained by mapping degraded images to the corresponding clear images. However, collecting clear and degraded real underwater image pairs is high cost or inherently difficult especially in turbid water in a coastal region. Alternatively, artificial underwater image datasets constructed with a simplified physical model or Generative Adversarial Network (GAN) are employed for training, yet their effectiveness are limited because real underwater images depend on complex physical process and many physical parameters like water body or ambient light and may be apart from artificial images [3], [4]. Subsequently, an artificial underwater image dataset based on the revised underwater image formation model [4] is recently proposed which more reflects real underwater environment [5].
Under the constraint of limited amount of data, incorporating traditional signal processing methods into the network architecture is also effective in underwater image enhancement. While the shallow CNN based model incorporating white balance, histogram equalization, and gamma correction has shown measurable success, enhancement of severely degraded underwater images remains to be a challenging issue [6].
In this paper, we tackle severely degraded underwater image enhancement with an encoder-decoder based network combining discrete wavelet transform and whitening and coloring transform (WCT). The high frequency components of an input image is structurally extracted with discrete wavelet transform in the encoder part, and is preserved by passing them to the decoder part, thereby obtaining a sharp output. Also, as underwater images are quite diverse and display various tones of color and degrees of blurriness, input image features are whitened and mapped to style image features with WCT to stabilize training. The presented model is trained with the recently proposed physically revised artificial underwater image dataset [5] and an elaborated loss function. Also, we present a seriously degraded real underwater image dataset taken in Okinawa, Japan. The constructed dataset includes blueish and greenish images of divers, an underwater construction machine, and rubble mounds of port structures. Our underwater image enhancement model is evaluated with the artificial image dataset and the constructed real image dataset, showing fine results both qualitatively and quantitatively. Our main contributions are summarized as follows: • We present an underwater image enhancement model combining discrete wavelet transform and whitening and coloring transform.
• We construct a real underwater image dataset including severely degraded blueish or greenish underwater images.
• The presented model successfully removes overall blueish tones of seriously degraded underwater images, mainly outperforming state-of-the-art underwater image enhancement methods both in real and artificial datasets.

A. Previous Underwater Image Enhancement Methods
Supervised underwater image enhancement models based on Convolutional Neural Network (CNN), Generative Adversarial Network (GAN), and recently appeared Vision Transformer (ViT), have rapidly improved. As models mainly learn pixel transformation tasks, skip connection is often employed not to apart from the original input image. Also, encoderdecoder process is adopted to mitigate the input noise. To be specific, UWCNN is a densely connected CNN model where an input is injected to the different layers with no pooling layers or batch normalization steps [7]. FUnIE-GAN is a fully convolutional conditional GAN model. The generator has five encoder-decoder pairs with several skip connections to enable real time inference [8]. The above two models are either trained with an artificial underwater image dataset.
Recently proposed ViT based model is also equipped with several skip connections to stabilize training. To cope with wavelength selective and spatially variant signal attenuation of underwater images, channel-wise attention and spatial-wise attention are incorporated into the architecture [9]. As the difficulty of covering whole real underwater images, incorporating traditional signal processing methods to the network process is effective in underwater image processing. For example, Water-Net is a simple CNN based network which fuses the results of white balance, gamma correction, and histogram equalization [6]. First, three results of each signal processing methods and the original input are fed to the network to predict the three fusion coefficient maps. The predicted three coefficient maps are multiplied by the enhanced results which are obtained by passing through the three independent feature transformation units to reduce the artifacts introduced from the signal processing methods. The refined output is finally obtained by fusing the above three results. Also, discrete wavelet transform is employed to preserve fine image structure [10], [11]. Other than learning based methods, many unsupervised underwater image enhancement methods assume physical model and correct color distortion by imposing white balance, which often requires the estimation of ambient light or average color [12], [13].

B. Previous Underwater Image Datasets
As obtaining sufficient real underwater image pairs is challenging, construction of the dataset itself is important in underwater image processing. Based on a simplified underwater image formation model, [7] constructed an artificially deteriorated underwater image dataset to which visually matches real underwater images by setting the attenuation coefficient to a constant and neglecting other related physical parameters. More recently, based on the revised underwater image formation model [4], an artificial underwater image dataset is proposed which clearly takes into account the dependency of water types, lightning conditions, and camera sensors. The constructed dataset is implied to be more real compared to the previous one [5]. GAN-based approaches generate artificial underwater images by converting initially clear underwater images to degraded ones to cheat the classifier. The model is trained with an unpaired dataset by minimizing Cycle-Consistency loss [8], [14]. Other than artificial underwater images, clearly enhanced real underwater images are collected among results of many conventional enhancement methods by scoring human ranking by hand. This approach is expected to reflect human perceptions, yet is laborious and the sample size is limited to at most a few thousand [6], [9].

III. METHODOLOGY
Presented underwater image enhancement model is based on a simple encoder-decoder network architecture with several skip connections, similar to well known U-Net in image segmentation task [15], as shown in Fig. 2. Pooling and upsampling layers are respectively replaced with discrete wavelet transform and inverse discrete wavelet transform to maintain structural information. Whitening and Coloring Transform (WCT) mainly employed in style transfer task is also incorporated into the model to mitigate covariate shift between training data distribution and test data distribution. Brief introduction of discrete wavelet transform and WCT is described followed by the details of model architecture.

A. Signal Reconstruction with Wavelet Transform
The power of discrete wavelet transform (DWT) especially using Haar wavelet is shown in style transfer and inverse problems by generalizing conventional pooling operations www.ijacsa.thesai.org  like average pooling or max pooling, which simply subsamples and summarizes the neighboring pixel information [16], [11], [17]. Haar wavelet operation consists of four kernels, {LL T , LH T , HL T , HH T }, where L and H are respectively defined as L T := 1 Frequency information is efficiently retained and extracted with L and H, and low frequency signal is captured by L while high frequency signal is captured by H. Inverse discrete wavelet transform (IDWT) is the mirror operation of discrete wavelet transform and is employed for structural reconstruction in the decoder part with minimal noise amplification.

B. Whitening and Coloring Transform (WCT)
The aim of WCT in style transfer is to obtain a stylized image preserving content features [18]. After feature extraction with a pre-trained network, covariance matrix of high dimensional feature maps f c of content image is first made to be an identity matrix (whitening), followd by singular value decomposition. The whitened content featuref c is then projected onto the eigenspace of the style feature f s (coloring), described as following procedure:  WCT is incorporated in our model to mitigate the covariate shift between training data and testing data.

C. Network Architecture
The network architecture shown in Fig. 2 is a simple encoder-decoder based model with several skip connections and no pooling layers. In order to preserve detailed image signal, high frequency components extracted with discrete wavelet transform in the encoder part, {LH T , HL T , HH T }, are passed to the inverse discrete wavelet transform in the decoder part. WCT is incorporated in the color correction module to normalize feature maps.
Input images are first passed through a convolutional layer followed by several convolution, padding, and ReLU activation layers in the color correction module. Then, encoded features go through the discrete wavelet transform layer and low frequency component, LL T , is processed with WCT and subsequent deeper layers. The remaining high frequency components, LH T , HL T , HH T , are skipped to the decoder part to preserve detailed signal. The encoded features and the passed high frequency components are up-sampled with inverse discrete wavelet transform followed by several convolution, padding, and ReLU activation layers. The subsequent refinement module is similar to the color correction module, but WCT is removed and several convolution, padding, and ReLU activation layers and the last layers of padding, convolution, and hyperbolic tangent activation layer are added to mitigate input noise. Such repeated structure is designed to extract local image structure. Kernel size and stride of all convolutional layers are set to 3 and 1, respectively. We use pre-trained model on photo-realistic style transfer method [11], denoted as baseline, to normalize the complex input distribution caused by the complicated real underwater environment. Our model is similar to [11], but one DWT and IDWT layers, a few convolution, padding, and ReLU activation layers, and the last layers of padding, convolution, and hyperbolic tangent activation layer are added. Here, Fig. 3 shows that www.ijacsa.thesai.org our model recovers the severely degraded artificial underwater image better (right) compared with the baseline (left). As for quantitative metric, Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), full reference metrics of image quality which reflects human perception, are computed. PSNR and SSIM of our method and the baseline per 10 water types classified by [19], are shown in Table I. Scores of PSNR and SSIM are improved approximately 1 and 0.3, respectively, compared to the baseline in all water types.

D. Loss Function
We combine three loss terms, reconstruction loss L rec , Laplacian pyramid Lap1 loss L lap , and luminance loss L lum between a correct image I c and an estimated image I e for training our network, defined as follows: where α, β , and λ are hyper parameters.
Reconstruction loss L rec means the pixel wise l1 distance between I c and I e , denoted as follows: Laplacian pyramid Lap1 loss L lap measures differences between I c and I e in Laplacian pyramid representation to take into account various frequency components and get a structural image [20], [21], defined as: Here, L i (I) means the i-th level of the Laplacian pyramid representation of an image I [22]. Also, we propose luminance loss L lum . The luminance loss measures pixel wise difference between I c and I e of their luminance components after transforming to YCbCr color space, described as follows: where luminance component Y can be defined as: Here, R, G, and B mean the red, green and blue channels of the original image, respectively. The luminance loss is proposed to facilitate training, as luminance components are less susceptible to color tones of an underwater image.

A. Construction of Real Underwater Image Dataset
We collected real underwater images around rubble mounds of port structures in Okinawa, Japan. The constructed dataset contains significantly degraded underwater images of an underwater construction machine, a diver, and rubble mounds, which are taken with GoPro HERO4. As shown in Fig. 4, the underwater images are directly taken by a diver or a camera mounted with the upper part of the construction machine, showing blueish or greenish appearances. The constructed dataset is available at https://github.com/tkswalk/2022-IJACSA.

B. Experimental Setting
We train the model with a recently proposed artificial underwater image dataset [5] based on the revised underwater image formation model which more reflects real underwater environment. The model clearly considers the dependencies of related physical parameters, such as water types, lightning conditions, and camera sensors [4]. In the dataset, wavelength data of 10 water types classified by [19], two camera sensors, and the three light spectrum data are employed, namely 60 kinds of artificial images are generated per one image. Clear indoor RGB-D images from NYU Depth Dataset V2 [23] containing depth information are transformed based on the underwater physical model, resulting in 86940 image pairs in total [4], [5]. Among the 1449 original images from NYU Depth Dataset V2, first 1000 images are used for the training data, next 300 images are used for the validation data, and the last 149 images are used for the test data.
A degraded input image is first resized to 256 × 256 resolutions and mapped to an enhanced image. The coefficients of the loss function, α, β , and λ , are respectively set to 1, 10, 1. Adam optimizer [24] is adopted and the learning rate is set to 0.0001. The training epoch is 80 and the model is implemented with PyTorch and GeForce RTX 2080 Ti GPU.

C. Results and Discussions of Artificial Underwater Images
We qualitatively and quantitatively compare the restoration results with available state-of-the-art underwater image enhancement methods. As shown in Fig. 5, FUnIE-GAN (4th row) [8], UWCNN [7] (7th row), Water-Net [6] (8th row), and U-Transformer [9] (9th row) are evaluated for the deep learning based approaches, while results of retinex-based theory (5th row, denoted as Retinex) [13] and underwater dark channel prior (6th row, denoted as UDCP) [12] are also compared for the unsupervised methods. The first row of Fig. 5 shows the  TABLE II. RESULTS OF PSNR PER 10 WATER TYPES   PSNR  I  IA  IB  II  III  1C  3C  5C  7C  The artificial underwater image dataset contains various colors and degrees of degradation which reflects water types or lightning conditions [5]. In qualitative evaluation in Fig. 5, many restoration results are not sufficiently well recovered because of the severe image degradation of an input. While our model relatively well restored blueish, greenish, and yellowish artificial underwater images (3rd row), previous methods hardly improve the visibility (4th [8], 6th [12], and 7th row [7]) or insufficiently output whitish images (5th [13], 8th [6], and 9th row [9]). Also, PSNR and SSIM per 10 water types classified by [19] are respectively shown in Table II and  Table III. Our model achieves better performance compared to other methods in 9 out of 10 water types. While our model mainly outperforms other methods in almost all water types, output images are sometimes decolored as shown in the 4th column of the 3rd row.

D. Results and Discussions of Real Underwater Images
Next, we proceed to restoration results of real underwater images, as shown in the 1st row of Fig. 6. Real underwater images of 1st to 3rd column come from the constructed dataset collected in Okinawa, Japan, and the remains come from [6] which contains severely degraded underwater images. Our model (2nd row) restores significantly degraded blueish (1st, 2nd, 3rd, 6th column), greenish (4th column), and yellowish (5th column) underwater images. The output images contain less overall blueish tones compared to results of other methods. Among the results of previous methods, Water-Net [6] (7th row) combining white balance, gamma correction, and histogram equalization, are better also in the yellowish and greenish inputs, yet failed to restore the severely degraded input (1st column). The performance of Water-Net is mainly dominated by the signal processing results as Water-Net fuses outputs of them. FUnIE-GAN [8] (3rd row), GAN based model, hardly improves the visibility and adds grid artifact in severely degraded inputs. UWCNN [7] (6th row), CNN based model, introduces color bias as shown in the 4th and 5th column. Vision transformer based U-Transformer Original Indoor Images, 2nd Row Shows Transformed Input Images, 3rd Row Shows Results of Proposed Model, 4th Row Shows FUnIE-GAN [8].
[9] (8th row) failed to recover greenish and yellowish inputs, respectively shown in the 4th and 5th column. Among the non-learning based methods, Retinex [13] (4th row) corrects a greenish image (4th column), yet also adds reddish color bias in other images (2nd, 3rd, and 6th column). UDCP [12] (5th row), statistical method, hardly improves the overall visibility. As no ground truth is available in real underwater images, PSNR and SSIM are not computed. As real underwater images are tremendously diverse, many supervised models fail to enhance severely degraded underwater images. Among results of previous methods, better results are obtained with Water-Net [6]. Compared to Water-Net trained with a dataset less than 1000 real underwater images, our training dataset is about 100 times larger than that of Water-Net. Also, large amount of severely degraded underwater images are included [5], thus (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 8, 2022 perceptually better results are tend to be obtained with our model, as shown in the 1st column of Fig. 6.

E. Ablation Study of Loss Function
Ablation study of loss function in Eq. (1) is shown in this section. PSNR and SSIM scores per 10 water types are computed in Table IV. Results of employing only reconstruction loss L rec are denoted as L1, plus luminance loss L lum are denoted as L1 + lum, and all loss are denoted as ALL. Each loss functions contribute the scores in almost all water types. While proposed luminance loss L lum in Eq. (4) improves less, we observe that the luminance loss stabilizes the training, as it doesn't depend on input color.

V. CONCLUSION
This study tackles significantly degraded underwater image enhancement with a deep learning model incorporating discrete wavelet transform and whitening and coloring transform. The presented model is trained with the elaborated loss function and recently proposed physically revised artificial underwater image dataset. We also construct real underwater image dataset taken near the rubble mounds of port structures. The dataset characteristically includes severely degraded blueish or greenish underwater images. The presented model outperforms previous state-of-the-art underwater image enhancement models in 9 out of 10 water types in the evaluation employing an artificial underwater image dataset. Also, our model successfully removes blueish tints from real underwater images, showing splendid results qualitatively and quantitatively.