Fast Side Information Generation for High-Resolution Videos in Distributed Video Coding Applications

Distributed video coding (DVC) is an attractive and promising scheme that suits the constrained video applications, such as wireless sensor networks or wireless surveillance systems. In DVC, estimation of fast and consistent side information (Տ į) is a critical issue for instant and real-time decoding. This issue becomes even more serious for highresolution videos. Therefore, to minimise the side information estimation computational complexity, in this work, a computationally low complex DVC codec is proposed, which uses a simple phase interpolation (Phase-I) algorithm. It performs faster for all resolutions videos, and significant results are achieved for high-resolution videos with a large group of pictures (GOP). For the proposed technique, the computation time rapidly decreases with an increase in resolution. It performs 221% to 280% faster from conventional frame interpolation method for high-resolution videos and large GOP at the cost of little degradation in the visual quality of estimated side information. Keywords—Fast side information algorithm; phase-based interpolation (Phase-I); DVC; DVC decoder for high-resolution videos; real-time DVC decoding; real-time side information


I. INTRODUCTION
Wireless video sensor networks (WVSNs) are capable of capturing video at distributed video sensor nodes. Conventional video codecs are ill-suited for these nodes. Compression of the captured video has received significant interest in literature. The availability of high-resolution CMOS image sensors at low cost makes the WVSNs more trending [1], especially for real-time surveillance and environment monitoring [2,3] and medical applications [4], etc. The new applications of WVSN are emerging very rapidly and demanding efficient pre-processing and transmission [5]. Due to challenges of being battery supported, there is a need for efficient use of storage resources and lower energy consumption in WVSNs [6]. Such video sensors, therefore, demand low encoding techniques to compress the video to the lower bit rate before storing or transmitting it [7] and to reduce the transmission delay [8]. One of the supportive coding approaches is distributed video coding (DVC) that redistributes the coding complexity in such a way that there is much low encoding computation [9] while decoding can be more complex [10].
In DVC codecs, the frames are organised in a group of pictures (GOP) of size 2, 4 or more. The key-frames are Intraencoded first and then transmitted, and intermediate frames are WZ encoded. At the decoder, these WZ frames are estimated by the Intra-decoded key-frames. These DVC coding schemes achieved high compression while maintaining low encoding complexity by utilising the Intra-encoding at the encoder, and Inter-decoding (because the Տ į estimation depends on key-frames) at the decoder [11]. The DVC encoder is deliberately kept computationally very simple but the decoder is computationally very complex since it needs to accurately estimate the replica of the WZ frames known as side information (Տ į). The traditional Տ į generation algorithms are computationally extensive due to the complex nature of the prediction process and take a lot of time even for a lowresolution video.

A. Motivation and Contribution
The Տ į is estimated either by interpolation or extrapolation, and its quality determines the overall coding efficiency of codec [12]. Both prediction processes are considered to be time-consuming [13] activity of the decoding process. The huge computational complexity [14] is associated with these prediction processes, and it takes considerable time even for the low-resolution videos, and the decoding process slows down due to it [13]. The efforts were carried out mostly on low-resolution 1k-pixel and 176x144 pixels per frame videos [15] to improve the RD performance. The researchers are putting an effort to design the framework to achieve the low complex and real-time decoding for such low-resolution video [15] while achieving a consistent and high RD performance comparable with conventional codec.
As a low-cost standard definition (SD) and high definition (HD) mini video sensors [16] are widely available, so there is a need for real-time DVC decoding framework for such highquality videos. However, no DVC framework for real-time or fast Տ į generation is found in the literature for high-quality videos. Therefore, herein, an attempt is made to design a suitable DVC framework with low computational Տ į generation algorithms for high-quality videos. In this work, a Phase interpolation (Phase-I) is incorporated for Տ į generation in the DVC decoder.
The rest of the paper is organised as follows: Section II discusses the background, and Section III describes the proposed DVC model for fast Տ į generation for highresolution videos. In Section IV, results have been presented, and the performance of the proposed model in terms of computational complexity and quality analysis with (peak signal to noise ratio) PSNR is discussed. In Section V, the future directions are proposed. Finally, the conclusion is in Section VI.

II. BACKGROUND
The DVC [17] structure follows the Slepian Wolf [18] and Wyner-Ziv [19] theories proposed in the early 1970s. These theories proposed that the correlated sources can be independently encoded and jointly decoded. This way, they can still achieve the same rate as they are jointly encoded and decoded as long as the correlated Տ į is available and used in the decoder side. In WVSN, the video sensor nodes acquire frames at some rate, and the consecutive frames are highly correlated. These correlated frames are independently encoded because the prediction loop is not involved in the DVC encoding process, and that is how the DVC provides codecindependent scalability. DVC compression efficiency highly depends on Տ į quality [20]. It is important to remember that high-quality Տ į leads to better rate-distortion (RD) performance [21,22] which plays a significant role in achieving low bit rate and less error correction [23], which are the main factors for the low latency optimal transmission [24]. However, high-quality Տ į estimation is a difficult task, even for low-resolution videos. The WZ encoding and decoding are carried on either in the transform domain (TD) or the pixel domain (PD). Aaron et al. first time at Stanford proposed TD-based DVC framework [25]. In this framework, only the intra-frame statistical reliance is explored. It outperforms other codecs due to superior coding efficiency. Afterwards, the codec known as PRISM (Power-efficient, Robust, hIgh compression Syndrome based Multimedia coding) for TD was proposed by Puri et al. [26,27]. Most of the adequate DVC codecs are found on the Stanford TD based framework. The DISCOVER [28] also complies with the Stanford framework [25].
Despite all the developments made by the DVC codecs [6,10,21,[29][30][31], consistent RD performance [32] is still an issue and, does not meet the superior performance of conventional codecs for all acute and non-uniform motion feature videos. This commonly happens due to the substandard quality of the WZ frame replica (known as Տ į) which is estimated by interpolation or extrapolation [33] at the decoder. The superior coding efficiency and even the low bit-rate are achieved by making use of a highly correlated estimated replica of WZF [34]. However, the Տ į generation process consumes a lot of time due to the computationally complex prediction (motion estimation and compensation) activities [35]. These prediction activities are highlighted as a major source of high computational complexity at the decoder and cause latency in the decoding process [13] even for low-resolution videos. Moreover, the feedback channel for error-resilience also imposes the delay and increases the decoder complexity [36] due to the iterative requests for more bits that are required for error-correction. This, in turn, increases computation complexity and decreases the life of the video sensor because the transmission requires more resources as compared to other operations [37].

III. PROPOSED DVC MODEL WITH SIDE INFORMATION GENERATION SCHEME
In video processing, the vast amount of computation is involved in the estimation of the in-between frame due to prediction and estimation. Specifically, Տ į estimation is the most challenging task in DVC decoding. The conventional interpolation methods require extensive computation resources and time; therefore, this extended computation complexity prolongs the decoding activity, especially for high-resolution videos. Fig. 1 shows the proposed DVC Model for Տ į generation with Phase-I and residual frame (Ɽ) calculation to reduce the transmission rate. The coding efficiency is associated with the transmission rate; therefore, to reduce the transmission rate, the Ɽ is calculated and encoded with WZ coding. The consecutive frames have the similarity and taking their difference will extract the motion part only which can be encoded in lesser bits than the actual WZ frame encoding, and WZ coding gives further lossy compression. The information will be encoded in fewer bits. The Ɽ can be calculated by (1).
In (1), the Ⱳ is a current actual WZ frame, and K defines the previous key-frame.
Phase-I computes the pixel-wise phase modification without any extensive global optimisation and estimates pixel's motion by phase shifting of the individual pixels. In addition to this, the phase shift correction feature combines the phase information across all the levels of a multi-scale pyramid [38] in a very short time.
Avoiding the expensive global optimisation, which is a typical part of optical flow techniques, allows interpolating inbetween frames in a fraction of time of the traditional interpolation methods. Therefore, in the DVC decoder, its deployment exclusively decreases Տ į generation complexity and overall decoding time and complexity, even for the highresolution videos. 278 | P a g e www.ijacsa.thesai.org The Phase-I algorithm summarises the execution steps of Phase-I. The input of Phase-I are two images and interpolation parameter α, and the process starts with steerable pyramid decompositions of both images and calculation of their amplitudes. The output of this algorithm is the interpolated image.

Phase-I Algorithm
Inputs: Two input images are: 1 and 2 Interpolated parameter: α Initialisation: Steerable pyramid decompositions: 1 and 2 Amplitudes Calculation: 1 and 2 Output: Output (interpolated image) : Step7: Step10: Step12: ← Reconstruct ( ) In Fig. 2, the flowchart defines the basic steps for Տ į interpolation with Phase-I. All mathematical notations are available in [38]. The flowchart presents the step by step process for estimation of the Տ į. The Phase-I first decomposes the input images into the steerable pyramids, which is a linear multi-scale, multi-orientation image decomposition and calculates the amplitudes as well [38]. The Phase-I approach has a few intuitive parameters. The main parameters mainly used to control the number of orientations and levels corresponding to the different scales of the steerable pyramid. Better motion separation is achieved with a higher number of levels and orientations. The parameters setting used for generation of Տ į is as follows; the number of orientations used is 8, and the number of levels L is determined such that the coarsest level has a minimum width of 10 pixels. For the limitation factor, we use τ = 0.2. The size of the coarsest level together with this choice of τ leads to a theoretical limit of motion which can be modelled reliably as 2% of the image width.
Depending on the size of GOP, both input images can either be intra-decoded key-frames or consists of the one intradecoded key-frame and one previously estimated WZ frame, herein called Տ į. In the next step, the phases are extracted with decomposed steerable pyramids. After the phase difference calculation, it is adjusted by the shift correction process. Now, for interpolating the next frame, the new phase is estimated with interpolation parameters, previous frame phase and the new calculated phase difference. Now, for the new Տ į interpolation, a new amplitude is calculated by blending the interpolation parameter and extracted amplitudes of input frames. This new amplitude is combined with the new calculated phase to reconstruct the interpolated Տ į. The focus of this work is only on the fast Տ į generation for highresolution videos; therefore, the full performance of codec will be presented in future work.

IV. RESULT EVALUATION AND DISCUSSION
The test clips WashDc in .cif (288x352) and .sif (480x640) formats, Mobile_Claneder in NTSC (486x720) and Old_Town Cross HD (720x1280) are taken for experiments [39]. The purpose of taking the different clips with different resolutions was to evaluate the performance of traditional and proposed Phase-I Տ į generation approaches for different resolutions. The Տ į quality and computational performance of its generation algorithm, are evaluated with quality metric PSNR and computational complexity metric execution time, respectively. To fairly evaluate the computational or coding complexity, the data cache size, memory access bandwidth, instruction cache size, storage complexity, execution time, parallelism and pipelining, all these dimensions should be measured [40]. However, practically it is difficult to measure all these dimensions [41]. Therefore, the coding time on a computing platform is usually considered for measuring the computational complexity as it is relatively easy to measure. It not only shows the computational complexity but also 279 | P a g e www.ijacsa.thesai.org partially indicates the effects of other dimensions such as memory access, etc. in the coding process [41]. Consequently, this is a widely used metric for complexity measure. Therefore, for convenience, this is also used in this paper for coding complexity measures.
The number of calculations performed during a specific task defines the computational complexity. The total time of processor usage is directly affected by the number of calculations performed for a task; therefore, the computational complexity is always assessed and presented in processing time [42]. The performance of Տ į generation algorithms is measured and compared with each other for the same video but with different high resolutions.
The performance is measured for two different GOP's of sizes 2 and 4. When GOP=2, the frames sequence will be the I 1 W 1 I 2 W 2 I 3 W 3 I 4 , where I 1 I 2 I 3 I 4 are intra-encoded and decoded key-frames, and 1 2 3 the Wyner-Ziv frames whose estimated replicas are called Տ į. When GOP=4, the frames sequence will be the I 1 W 1 W 2 W 3 I 2 , where I 1 I 2 are the key-frames, while W 1 , W 2 , W 3 are the Wyner-Ziv frames. In GOP=4, first, the Տ į 2 is estimated with I 1 and I 2 , then Տ į 1 is estimated with I 1 and Տ į 2 and, Տ į 3 estimated with Տ į 2 and I 2 .

A. Computational Complexity Measure
The simulations are carried out on the Core(TM) i7-7820HQ, CPU 2.90GHz with 64-bit OS, and RAM 32 GB. The computation complexity of Տ į algorithms; conventional interpolation called motion-compensated temporal interpolation (MCTI) and Phase-I for test sequences of 288x352 and 480x640 formats are presented in Fig. 3 and 4, respectively. The computation is measured for a single GOP of size 2 and 4 in different time slots of a test sequence. Time for single GOP of 2 and 4, is measured after every 30 frames.   Computationally, the Phase-I is much faster than MCTI for all resolution videos and delivers optimum performance for different GOP sizes. This reduces the overall decoding complexity and hence leads to the faster decoding for high resolution of videos.

B. PSNR Performance and Discussion
The PSNR is conceived as one of the image quality measuring metric. It reflects the quality of estimation frame relative to the actual frame. Fig. 5 presents the PSNR performance as a function of frame numbers in a given sequence of frames. The comparison is made among the MCTI method and Phase-I for the GOP size of 2 (only PSNR of estimated Տ į) and PSNR of relevant intra-decoded frames for both .cif and .sif formats.
The simulation results implemented for .cif format points out the average PSNR of 35.5dB and 31.2 dB for MCTI and Phase-I, respectively. The implementation results of .sif format showed the average the PSNR 35 dB and 31 dB for MCTI and Phase-I, respectively. The proposed approach performance is approximately 4-4.3 dB poorer than that of MCTI for both formats. Although, the Phase-I lag behind the MCTI by 4.2 dB but deliver consistent performance throughout the sequence. The proposed method's efficiency falls by 11.26% of MCTI. However, the 30dB PSNR is considered as the minimum acceptable quality for a human vision [43]. Fig. 6 presents the PSNR performance graph for the MCTI and Phase-I for a GOP size of 4 (only PSNR of three estimated Տ į of each GOP) and PSNR of relevant intradecoded frames for both .cif and .sif formats. The simulation results implemented for .cif format pointed out the average PSNR of 33.24 dB and 27.95 dB for MCTI and Phase-I, respectively. The implementation results of .sif format presented the average PSNR 32.34 dB and 30.8 dB for MCTI and Phase-I, respectively. The MCTI achieved average 4.29 dB better PSNR from Phase-I but inconsistent performance for each GOP. While MCTI performance degraded for .sif format, whereas Phase-I performance improved for .sif format. Therefore, the Phase-I has great potential to deliver better for high-resolution videos with a large GOP size.
The computational complexity and quality performance evaluation of both Տ į algorithms are presented in Table I. The computational complexity and quality evaluation are in the form of average time and average peak-to-signal ratio (PSNR) of Տ į of the GOP respectively.
Simulation tests were also carried out on several other videos of different formats and motions, and few of them are also listed in Table I. Simulation conditions are frequently changed to analyse the visual quality and computational complexity of the proposed approach. The Phase-I based Տ į visual quality is dependent on the quality of intra-decoded frames. With high visual quality intra-decoded frames, the Phase-I algorithm generates better Տ į and vice versa. On the other hand, with one high quality and another low-quality intra-decoded frame, the Տ į quality degraded according to the low one.
In the current scenario of GOP 4, the reference frames are changed according to the mentioned methodology, and visual quality of every estimated Տ į frame varies accordingly. However, an attempt is made to generate more than one Տ į frames by keeping the same reference frames, a bit low but almost consistent visual quality Տ į's are generated in minimal time. In that approach, the computation time was 2-2.5 times less than the currently implemented strategy at the cost of a small degradation in the visual quality of all Տ į's. This will open the door for implementation of low-delay or real-time Տ į generation with large GOP size. However, the study is required to analyse its effect on transmission rate in the channel decoding step because further correction in Տ į should be needed to achieve consistent high visual quality in video.
The visual performance slightly depends on some parameters of Phase-I algorithm like decomposition level of steerable pyramids, phase extraction step, phase shift correction step and amplitude calculation step when only one Տ į is computed. However, the computation time rarely changed with the change of these parameters.

V. FUTURE DIRECTION
In current Phase-I algorithm, the visual quality changes with reference frames (frames that are used for Տ į estimation) quality. Especially the performance of algorithm degrades when one of the reference frames is of low quality. Therefore, the focus can be put to design this algorithm in a way that it changes performance with respect to high-quality reference frames to get the Տ į with consistent visual quality in either condition. Getting a consistent high-quality Տ į will also assist in reducing the number of bits (bit rate) which are required for Տ į correction in the channel decoding step. It reduces the transmission rate efficiently for both low and high-resolution video.
In large GOP size, the visual quality almost remains the same for all estimated Տ į when the same reference frames are used to estimate all the intermediate Տ į frames. This method of generating the intermediate Տ į frames makes the Phase-I computationally very effective, but it lack-behind in visual performance. Visual quality remains a bit low from other adopted approach. Therefore, the algorithm should be designed in a way that it estimates more than one high-quality intermediate frames at once in a very short time. If this less computational complexity method estimates the consistently high-quality Տ į, then it will be effective to achieve the lowdelay or real-time DVC decoding for both low and highresolution videos. Moreover, it also reduces the transmission rate efficiently for both low and high-resolution videos with large GOP size.
The visual performance slightly depends on some parameters of Phase-I algorithm like decomposition level of steerable pyramids, phase extraction step, phase shift correction step and amplitude calculation step when only one Տ į is computed. But these parameters somehow put their effect when two reference frames are far away from each other and the computation time rarely affected by changing these parameters. Designing an adaptive Phase-I will be a productive step for auto-selection of these parameters, to generate a consistent high-quality Տ į in small computational time and transmission rate for video applications can be controlled with it. Along with the high-quality Տ į estimation in small time, the proposed residual frame calculation at the encoder further reduces the transmission rate and improves the codec coding efficiency.

VI. CONCLUSION
The DVC decoder faces the computational complexity while estimating the replica of the WZ frame known as side information (Տ į) due to the involvement of the prediction process. The traditional Տ į generation algorithms raise a high computation complexity in decoding process because of the complex and composite prediction process and even took a long time for low-resolution video. However, the emergence of high-resolution video sensor demands high-speed DVC decoder with faster Տ į generation algorithms. This research work proposed the DVC model with the Phase interpolation (Phase-I) algorithm for Տ į estimation. It computes the pixelwise phase modification without any explicit correspondence estimation and pixel's motion by phase shifting of the individual pixels. In addition to this, the phase shift correction feature combines the phase information across the levels of a multi-scale pyramid in very little time. Therefore, in the DVC decoder, its deployment exclusively decreases Տ į generation complexity and overall decoding time and complexity, even for the high-resolution videos. It works efficiently and even better for high-resolution videos with large GOP. It exhibits low computation complexity for both low and high-resolution videos. Moreover, it delivers significant efficiency in the computation for different GOP sizes at the cost of some degradation in the quality of estimated Տ į. However, for highresolution video with a GOP size of 4, the results were acknowledgeable because the performance of the traditional algorithm drops out rapidly, and where on the other hand, Phase-I remains stable.