Design and Implementation of Deep Depth Decision Algorithm for Complexity Reduction in High Efficiency Video Coding (HEVC)

High efficiency video (HEVC) coding made its mark as a codec which compress with low bit rate than its preceding codec that is H.264, but the factor that stop HEVC from many applications is its complex encoding procedure. The rate distortion optimisation (RDO) cost calculation in HEVC consume complex calculations. In this paper, we propose a method to cross out the issue of complex calculations by replacing the traditional inter-prediction procedure of brute force search for RDO by a deep convolutional neural network to predict and perform this process. In the first step, the modelling of the deep depth decision algorithm is done with optimum specifications using convolutional neural network (CNN). In the next step, the model is designed and trained with dataset and validated. The trained model is tested by pipelining it to the original HEVC encoder to check its performance. We also evaluate the efficiency of the model by comparing the average time of encoding for various resolution video input. The testing is done with mutually independent input to maintain the accuracy of the system. The system shows a substantial saving in encoding time that proves the complexity reduction in HEVC. Keywords—CNN; HEVC; deep learning; RDO; encoding time; complexity reduction


I. INTRODUCTION
Video compression is an area to explore while considering the flourish of video acquisition devices, social media, live transfer of videos etc. The high efficiency video encoding HEVC system possess a better compression compared to its previous system advanced video coding AVC. However, the computational complexity of HEVC is a matter of discussion because of its Rate distortion optimisation [1] (RDO) cost calculation for coding tree unit [2] (CTU). There for this computational complexity in HEVC is a matter of research interest, the focus will be to reduce the computational complexity [3] with better efficiency. Before going into the details let's review the evolution of HEVC its drawback and goodness compared to its ancestral system.

ITU-T video compression standard introduced H.261 is the
year November 1988. In the H.26x family, first member, H.261 in video coding standards in the domain of the VCEG (ITU-T Video Coding Experts Group) then Specialists Group on Coding for Visual Telephony" [4]. "H.261 was originally designed to transmit data over ISDN lines with data rates as multiples of 64 Kbit/s. The coding algorithm was designed in such a way to work at video bit rates in 40 Kbit/s to 2 Mbit/s" [5].MPEG-2 consists of "three different kinds of coded frames: I-frame /intra-coded frames, P -frame/predictivecoded frames, and B-frame/ bidirectionally-predictive-coded frames" [3]. The I-frame is a single uncompressed or raw frame that is a separately-compressed version. "The I-frame coding takes the advantage of spatial redundancy and the persistence of vision of human eye ie the inability of human eye to detect several changes in the image. I-frames do not depend on data in the previous or the next frames," [6] Unlike in P-frames and B-frames, and because of that its coding matches with the still photography. The raw frame is spitted into 8X8-pixel blocks. The data in all block is transformed using discrete cosine transform (DCT) [6]and results is a matrix of size 8×8 of coefficients that have real number values. The DCT transform converts spatial domain into frequency domain, but it does not change the information in the block; if the DCT is calculated with perfect precision, the original block can be recovered clearly by applying the inverse discreet cosine transform" [5] "H.263 [7]is a popular video compression standard for low-bit-rate compressed format focusing on videoconferencing. It was standardized by the organisation ITU-T, Video Coding Experts Group (VCEG) in 1995/1996" [4] . H.263 is the member of the H.26x family of video coding standards in the domain, ITU-T. Like other H.26x standards, H.263 is also based on (DCT) discrete cosine transform video compression. It was later advanced to add different additional enhanced features in the year 1998 and 2000. "H.264 is one of the popularly used codec on the planet, with significant note in optical disc, broadcast process, and streaming in video markets etc. [5] The applications are noted in Table I. Still, many uses of H.264 are subject to royalties, something that should is taken into considered before Google's WebM, as well as the general availability of decoding abilities on target platforms and devices" [8]. H. 264 mostly called as AVC (Advanced video coding), its block segmentation based, motion compensated with DCT technique. The aim behind AVC was to transfer video in low bit rate with better efficiency for UHD videos to its adaption of it. 553 | P a g e www.ijacsa.thesai.org "High Efficiency Video Coding, is also known as HEVC or H.265, is the step in this evolution. It builds off a lot of the techniques used in AVC/H.264 to make video compression even more efficient. When AVC looks at multiple frames for change those macroblock chunks can be a few different shapes and sizes, up to a maximum of 16 pixels by 16 pixels. With HEVC, those chunks can be up to 64×64 in size much larger than 16×16, which means the algorithm can remember fewer chunks, thus decreasing the size of the overall video" [9]. HEVC's quad tree [10] partitioning uses the brute force search for RDO (rate distortion optimisation) cost calculation. The complexity of the procedure is more when used with normal signal processing steps that makes the HEVC [11]complex. Fig. 1(a) shows the procedure of rate distortion optimisation as a flowchart. It is divided into check procedure and comparison procedure. It initially checks for the rate distortion cost of the parent CTU [12] and the total cost of splitting it till end. Once this procedure is done comparison is done. In comparison it will the checking the RD [11] cost of parent and the cost after splitting. if the RD cost after split is more than the system will not split it further and if the RD cost of parent is more then it will proceed with the split. This calculation procedure in HEVC is tedious that make the system complex. This issue was addressed by many algorithms, some provides enhancement to the existing HEVC system while other set provides a totally new algorithm [3] providing a new architecture [13] to perform the procedure of compression. Deep learning based algorithms [13] [14] started working on this in recent years. So a depth decision algorithm with deep CNN [15][16] is modelled to solve this issue. Fig. 1(b) shows the level and depth of CTU. Understanding this depth concept [6] helps in designing deep CNN [13] algorithm to predict depth and thus to make intra prediction less complex.
The paper aims in complexity reduction in video compression (HEVC) by reducing encoding time. It is achieved by designing a deep learning-based system that predicts the depth of the CTU by making the intraprediction procedure less complex. The design is evaluated by pipelining it with the original HEVC and evaluates the complexity of the system. The overall design idea is shown in Fig. 2. The paper is divided mainly into two halves, 1) design of the deep depth decision algorithm, here the deep depth decision algorithm is designed tested and validated for datasets and 2) evaluation and experimental results of the model pipelined with original HEVC, were the model is pipelined with the original HEVC and the performance is evaluated for various resolution videos. The paper is concluded with the results showing encoding time reduction and future scopes. 554 | P a g e www.ijacsa.thesai.org

II. DESIGN OF DEEP CNN DEPTH DECISION ALGORITHM FOR INTER PREDICTION
The inter prediction and its computational complexity was the issue took for analysis to model a new network. The design of this network should possess less computational complexity compared to the existing system and should be compatible to the existing codec. The design chosen should be compatible for faster transmission of frames while coding, so scalability and compatibility will also be the focus while designing, considering all this convolutional neural network (CNN) is chosen for this purpose so that all features are extracted correctly from the frames to produce better prediction as shown in Fig. 3.  The CNN [17] used here is having multiple layer, the initial layer is the input layer. The input used here is video frames. The video frame can be of various properties, the YUV is the format chosen for this evaluation, other formats are also compatible in this model. The next layer in CNN model is convolutional layer, this is the layer that extracts the features of the frame based on the kernel used. The kernel size can be chosen based on the features that need to be extracted, if the kernel size is big it collects the global features or information from the frame whereas the small kernel [12] extracts the local features. Based on the need of the feature kernel can be chosen. In the design 5X5, 3X3 along with the 16X16, 4X4 [18] are also used, so model clearly extracts the global and local features from the frame. To cover all the inputs zero padding is used in this model. The stride used is same as the width of the kernel used in each case in the design. After extracting the feature its max pooled to reduce the size and converge the multiple values to a single value or less values. The activation function helps to decide the neuron is fired or not, so activation function is the node is kept in between and in the end of neural network. Here the activation function [19] used is ReLU rectified linear unit. ReLu maintains a value between (0-∝) zero and infinity by avoiding negative values. It's a simple function that returns if input is negative else returns the same value in other cases. Both forward and backward propagation exist in CNN [20] network. Here in the model training uses backward propagation while validation uses forward propagation.
The model designed takes the input as YUV CTU of 64X 64, the first convolution layer users its kernel and convers as 32X32 coding unit. The both CU of 32x32 are concatenated to extract more features and its polled to 16X16 patch. The next stage of convolution with 3x3 kernel extracts its fine features, and a 4x4 for global features. After feature extraction in each stage the data are polled by 2x2. In final stage the fully connected later flatten the information and compress it using SoftMax to 256 to 64 to a 16-length vector holding all the information of the CTU depth. The model is trained with various resolution input varying from 240p to 4k. After training the model will be having a training loss factor of 3.1049. the loss function estimated in this model is the cross entropy. Cross-Entropy Loss can be evaluated for separate images and independently and finally added together to obtain the final cross entropy as each path are mutually independent. A 66.12% of accuracy is obtained by the trained model.

A. Dataset
The dataset is the collection of sample video frames used for testing training and validation of the design proposed. Multiple and verity in dataset helps in the improvement of accuracy in the model. Here the dataset contains the Coding Unit image file extracted from the YUV video files as set of input and their corresponding depths for HEVC intraprediction as output to train the proposed system. The dataset chosen here has multiple resolution and are not of same pattern videos to maintain the quality and efficiency of the model.
In HEVC intra-prediction, each I-frame is divided into 64x64 (CTU). For each 64x64 CTU, there's a depth prediction represented by a 16x16 matrix. The elements in the matrix are 0, 1, 2 or 3, indicating depth 0/1/2/3 for a 4x4 block in the CT. The dataset contains images and corresponding labels. There're three folders: train, validation, test Image files: Each image may have different size based on the resolution of the video, and it is one frame extracted from a video. While using in the system, split the image into several 64x64 images or 32x32 and so on. In the final stage for evaluation and comparison CPIH data set is also used to know the performance of the proposed system verses the existing models. The CPIH data set is not used in any of the testing or training for proper quality check.

B. Input and Pre-Processing Layer
The input used here is the YUV image patch derived from video frames. Each of this will be saved in a folder with separate labels in a python dictionary. The raw inputs need to be pre-processed by down sampling and splitting into 64x64, 32x32 and so on.

C. Convolution Layer
This layer performs convolution operation between the input and the kernel. If i is the pre-processed input and k is the kernel of size varying from 5X5 ,4x4,3x3 etc the convolution block output can be formulated as equation 1 and * represent the convolution operation. The size of the kernel decides the nature of the feature extracted.
In the design both global and fine features are extracted with variant kernels.

D. Fully Connected Layer
The fully connected layer initially flattens the output of convolution layer to a large single dimensional vector. The SoftMax operation helps to compress it further to required size without losing the information in it. The FCN1, FCN2 and FC3 along with averaging help it to shrink to 16 length vectors changing from 256 to 64 to 16.

E. Other Layers
The system is having a loss of feature dropping as the stages are crossing so the activation function ReLU [21] [22], rectified linear unit shown in equation 2 where z is the input and R(z) is the output of ReLu . It should be noted that the output is activated by sigmoid function represented by S(z) in equation 3.
In original HEVC the prediction process is complex and time consuming as it should predict the RDO cost, so here the CNN network [23], [24]with depth decision [25][26] helps to predict the depth of each patch of 64X64 to a 16-length vector whereas original HEVC needs a matrix of size 64X64 to store it. The model converts the input patch to a 16 vector which can predict all the characteristics of that CTU with depth information as 0/1/2/3. The model is designed to take the input as 64x64 but while processing its split into 32x 32. Predicting for 64x64 patch directly doesn't make sense so the actual input is 32x32. The depth is 0 when the patch is not split and encoded as it is. The 64x 64 patch represent 16 length vectors.
So, for representing 32x32 the vector required is 4x4. So, in the output blocks of four 4x4 patches will b available as output for a CTU of 64x64. Each value in the vector indicate the depth of the CTU. if the first vector is 0 it says that It is a 64x64 patch and if depth is 1 means the 64x64 CTU is split once into four 32x32 CTUs and so on.

III. EVALUATION AND EXPERIMENTAL RESULT ANALYSIS OF DEEP CNN DEPTH DECISION ALGORITHM FOR INTER PREDICTION
The designed model is allowed to work with the HEVC codec as shown in Fig. 4. To simulate the original HEVC, HM software is used. The evaluation is done between the original HEVC and proposed model for intraprediction, pipelined to HEVC using CPIH dataset. Integrating neural network models in HEVC encoder helps to test the complexity reduction using deep-learning-based method in HEVC intraprediction. Using neural networks, the system can directly predict the Coding Unit (CU) depths for each frame. The intention is to speed up the encoding process of HEVC encoder. Thus, after we have a trained model, another thing that needs to be done is to integrate the deep learning prediction process into the HEVC encoder.   The results show that the time of encoding with and without pipelining the deep CNN network is shown in the Table I for some sample input. The input chosen for the test is mutually independent from the training set to maintain the accuracy and the wide range of resolution is also considered to check the performance of the system foe different resolution video frames. The results clearly show that there is a change in encoding time and thus the system proves it can reduce the and bit rate in each case, it supports the encoder with a better performance. Computational complexity of the original HEVC is high due to the RDO cost calculation. The experimental results show the time of encoding is drastically changed to low values for the proposed method. The PSNR curve is slightly low here compared to original model but the system performance is not affected by this. The total process is done in python environment, when it's done, it will output with all information on the command line, like the encoding time, YUV-PSNR and so on. A sample output is shown in Fig. 5 and the comparison graphs are shown in Fig. 6. The proof of reduction in complexity is shown in Fig. 7 with the change in encoding time.

IV. CONCLUSION
In this paper, a deep learning based inter-prediction is proposed to avoid the computational complexity issue in HEVC which predict the depth of the CTU in a 16-length vector than calculating the RDO cost by traditional signal processing method. The modelling adopted CNN network to perform this model with deep layers to predict the depth. The data set used for training was YUV and its tested on CPIH dataset to maintain the accuracy of the system and to avoid transfer or copied learning. The trained model is converted to system and pipelined to the original HEVC system to check the performance. The system evaluated the time of encoding with and without pipelining and calculated ∆ T . The results and simulation clearly show that the design suits for the HEVC to work with less encoding time thus by reducing the complexity of the HEVC. The results prove it, the future enhancement on this can focus on the extension of this to inter prediction that improve the HEVC more.