Wyner-ziv Video Coding Using Hadamard Transform and Deep Learning

Access to the published version may require subscription. Abstract—Predictive schemes are current standards of video coding. Unfortunately they do not apply well for lightweight devices such as mobile phones. The high encoding complexity is the bottleneck of the Quality of Experience (QoE) of a video conversation between mobile phones. A considerable amount of research has been conducted towards tackling that bottleneck. Most of the schemes use the so-called Wyner-Ziv Video Coding Paradigm, with results still not comparable to those of predictive coding. This paper shows a novel approach for Wyner-Ziv video compression. It is based on the Reinforcement Learning and Hadamard Transform. Our Scheme shows very promising results.


I. INTRODUCTION
Video compression schemes such as MPEG4 and H.264 [1] are the current state-of-art, where correlation between or among frames are exploited at the encoder side.Such schemes usually achieve high compression with a fairly low complexity at the very expense of a high complexity encoder.Compression schemes like MPEG4 or H.264 are suitable for scenarios where the encoder has enough power computation, like video-ondemand servers.
Mobile phones are today the de facto device for communication.People want to do more and more with their mobile phone.They want to be able to have a real-time video communication experience comparable to that of computers.Unfortunately, current video compression technologies [1] barely permit it: encoder complexity is the bottleneck.Either comparable frame rate can not be achieved or conversation cannot last long because of battery is scarce.In either case quality of experience will be dropped.
The video community is aware of that issue as a great deal of research has been conducted since the emergence of camerabased mobile devices.The common insight toward tackle the issue is the so-called Wyner-Ziv Video Coding (WZVC) or Distributed video coding.
WZVC is the consequence of information-theoretic bounds established in the 1970s.First by Slepian and Wolf for distributed lossless coding [2], and then Wyner and Ziv for lossy coding with decoder side information [3].
k=1 be a sequence of independent drawing of a pair of dependent variables (X, Y ) taking values in the finite sets X and Y, respectively.The decoder has access to the side information Y .Illustration is shown in figure 1. Wyner and Ziv suggest that whether or not the side information Y is available at the encoder, X can be compressed -to Z and decoded to X -at a rate R X|Y (D) where D = E[d(X, X)] is an acceptable distortion.
In WZCV, unlike that of predictive coding paradigms (i.e.H.264), individual frames are encoded separately but decoded conditionally.According to [3], the compression effectiveness of WZVC schemes should be comparable to that of predictive coding.A typical WZVC setup is shown in figure 2 where both terminals are lightweight modern mobile phone capable of decoding MPEG4 frames for example.The corresponding Wyner-Ziv decoder is thought to be powerful computer capable of exploiting statistics between frames and output MPEG4 streams in real-time, using much more complex algorithms.
Most of the conducted studies in the area of WZCV have been using binary codes.Major contributions come from Standford University [4] and UC Berkeley [5].Both methods followed a common pattern; those methods were first developed to perform in pixel domain and later in transform domain (namely Discrete Cosine Transform).Those methods suffered from three major drawbacks: • The overhead of working in binary domain -since DCT pixels or alternatively transform coefficients have to be converted back and forth from and to bit planes during decoding process.
• Rate control -All the pixels values, alternatively transform coefficients need to be converted to binary with the same amount of bits, making the rate control difficult.
• The decoding algorithms used -either generative or discriminative -were somewhat too simplistic and did not work well in practice.
We propose a new practical compression scheme, based on Hadamard transform and reinforcement learning.In contrast to previous works, our method deals with non-binary codes.The encoding is relatively of low complexity with an inherent rate control.We also show that our algorithm outperforms that of the state of art [4] and [5] and really is comparable to predictive coding schemes.

II. LOW COMPLEXITY VIDEO ENCODING
The encoding challenge is to implement an encoder with lower encoding complexity than that of predictive video coding methods [1] and still achieve comparable codec effectiveness.www.ijacsa.thesai.orgH(Y |X) is the entropy or the amount of information (in bits) needed to represent the frame X conditioned on that of Y .Usually, Y and X are highly correlated after motion compensation [6], H(X|Y ) < H(X).In practice the challenge is to encode X at a rate even lower than R < H(Y |X)leading to lossy compression -and still achieve reconstruction with satisfying fidelity, as suggested by Ziv et al [3].

B. The encoder
Let Z * be the compressed frame or Wyner-Ziv frame from X.In our case, compression with compression ratio n : m is achieved simply projecting the row version of frame X ∈ R 1×n onto the n : th first dimensions of the orthogonal Hadamard vector basis G ∈ R n×m , where where σ 2 is the variance between the current frame X and the previous frame X −1 .Its given by (1)

C. The Hadamard Transform
The Hadamard transform is an orthogonal transform that has been used in numerous image coding applications [7], [8].The transform matrix of dimension 2 k for k ∈ N is given by the following recursive formula and The aim here is to reconstruct the encoded frame Z to X.At time t, the decoder has knowledge of the incoming Wyner-Ziv frame Z and previously decoded frame Y .
To be able to reconstruct X, Z has somehow to contain enough information about Z, possibly together with Y ; that is the case in predictive coding schemes, where Z carries information about Y , usually motion vectors (pixel configuration for that matter) and Huffman or Arithmetic coded residuals of the motion-compensated frames [1].Since we do not really know the real pixel setup X we will be left to "guessing" it out of the pixel space Λ.Mathematically, the problem could be formulated as follow: at a time t, the decoder observing Z and Y , aims to find the "best" pixel configuration X. .The term "best" used here actually points out that X is unobserved.This formation leaves us to a Maximum Likelihood problem

B. Maximum Likelihood
Decoding X by simply estimating doing X = Z × G −1 is obviously not optimal, but it may be worth mentioning why at this point of the study.As X grows in size, pixels in it decrease in term of correlation.Consequently, the coefficients in Z won't explain X well.Estimating X become thus equivalent to solving an under-determined system of equations -fewer equations than unknowns.
To find the "best" estimate of X we have to model an optimal Maximum Likelihood Estimator (MLE).That is, designing the Maximum Likelihood estimator to "capture" as much decoding information in Y and Z as possible, that is capable of estimating the following joint quantity where Θ is the generative model.Hidden Markov Mode (HMM)l [9] and Reinforcement Learning models (such as Q-Learnining) [10] are two good candidates for such problem.But modelling such MLE problem could be rather complex if applying either HMM or Q-learning due to the dimensionality of the tuples.Fortunately Q-learning has a variant, using function approximators and experience replay [11], that has shown to deal well with high dimensions.

C. Q-learning
Q-learning is a deep learning technique.Generally spoken, the learning model tries to learn the optimal so-called actionselection policy.We are given an agent, states S and a set of actions per state A. At a time t, the agent receives a reward r t by executing an action a t being in state s t .The goal of the agent is to maximize its total reward by learning optimal action for each state; that is the cumulative discounted longterm reward Q(a, s), starting from the current state.During learning process, the Q(a, s) value is updated as follow Due to high dimensionality equation ( 4) cannot be applied directly to our ML problem.Instead we use a variant called Q-learning with experience replay [11].
Q-learning with experience replay, the agent's state-action pair is stored in a data set and re-sampled, with respect to some significance criteria, in some later episodes in conjunction with other selected, usually randomly, data set of state-action pairs.Our Q-learning scheme is endorsing the same intuition and motivation albeit somewhat different in design; The following setup is adopted • the output of our Q-learning scheme -Q-values -is two dimensional, as opposed to other schemes [11] outputting one-dimensional.
• An episode ends at either n iterations or p dB of PSNR.Whichever comes first.For example n = 20 or p = 45.
• The reward is delayed, i.e. until episode ends Each column of the Q-values corresponds to the probability distributions -that are assumed to be Gaussian -over the colocated pixels candidates of X; at every position X[i] in X, we consider 2σ pixel candidates Our Q-learning design is illustrated in figure 3.

D. Maximum Likelihood through Experience Replay
Recall that the idea behind this whole Q-learning business is to fit a likelihood function.We aim to find out how to capture information out of the previously reconstructed frame, an encoded stream (that is syndrome and mean-squared error) so as decoding is as effective as possible.A Maximum Likelihood function does permit us to estimate the (degree of) truthfulness of a pixel combination -possibly with some measure of confidence of interval.
Even though likelihood could be measured for every pixel configuration of X in Λ estimating the best configuration still remains intractable as X is of high dimension.We use Expectation-Maximization (EM) [9] to estimate the Maximum Likelihood.In our setup, the E-step correspond to the estimation of the Q-values, while the M-step choose the pixel combination that maximizes their probabilities: 1) E-step: In the E-step the derived Q-values at each iteration are used to perform a probability update of the side information as follow where i is the ith side information and j is the jth distribution from the ith side information 2) M-step: The best pixel combination with respect to recent probability distribution update is selected according to The Learning process is depicted in figure 4 www.ijacsa.thesai.org

IV. EXPERIMENT RESULTS AND DISCUSSION
As mentioned in section III-C, the size of the Function Approximator's output is related the variance information.Thus a side information can in theory be up to n = 256 pixel away from its counterpart.This means that the size of our Qvalues is M × N × 256, where M is the frame height and N is the width.That is, for grayscale QCIF video sequences with M = 144 and N = 176, as used in our experiments, the size of our Q-values will be 144 × 176 × 256 = 6 488 064.That means the Q-values alone require a more than 50 Gigabytes of RAM memory!As were running the experiment on a personal computer with 8 GB (4 × 2GB) of RAM, we figured that only around 1024 × 20 Q-values could fit at a time.We performed therefore a "cherry-picking" procedure for the sake of assessing the effectiveness of our novel algorithm.
Recall that we aim to decode a frame X given its side information Y .To fit the likelihood function, a set of QCIF video sequences were used as training samples.X and Y were divided in blocks and paired up x 1 , x 2 , x 3 , ..., x K and y 1 , y 2 , y 3 , ..., y K respectively.The pair (x i ,y i ) was selected as a training sample if their variance was less than 100.That means pixels in x i is are at most 10 intensities away from pixels in y i .The length of the block were chosen to 1024 as the width of the Hadamard matrix has to be a power of 2. Recall that the encoder sends z i and the decoder only has access to y i and z i and tries to estimate x i .Thus [x i z i ] will be the input to our Function Approximator.
During training phase, we used minibatches of size 1000, while adopting a constant − greedy algorithm of 0.1.The input was scaled between -1 and 1 prior entering the Function Approximator.We used 5 hidden layers -with tanh activation function -with 200 nodes per layer.Figures 5 and 6 show the learning ability in terms of rate distortion of the Function Approximator though iterations/episodes.We notice the increase of the PSNR at each episode.
The same "cherry-picking" procedure was used for testing purpose, since we were computationally limited.We tested the algorithm on QCIF video frames for the sequences Salesman   For each frame block, the encoder generates 3 syndrome coefficients (3 integers = 3 bits) and 1 variance information (1 double =1 bits), 25 blocks and 12 fps.A full frame has 25 blocks.Arguably, compression ration 1024:3, 1024:10 and 1024:20 could thus be comparable to 33, 100 and 200 kbps respectively.This insight shows the high potential of our scheme for low complexity, low bitrate and low distortion video coding.

V. CONCLUSION
We have presented a new and practical Video compression scheme based on the Wyner-Ziv framework [3].The novelty in our scheme lies mostly in the integration of Q-learning in the decoding process.The Wyner-Ziv coding problem has been subject to a great deal of research for at least the past 15 years.The mainstream of in Wyner-Ziv Video Coding has been based on binary codes [4], [12], [13].Our algorithm is the first really dealing with non-binary codes.A second advantage is its inherent scalability.Previous schemes, such as punctured codes [12] have used different methods for rate control.Nonbinary codes have he advantage of reducing the computation complexity at the both at the encoded and decoder, since calculations do not have to be performed at a bit level as in [14] for example.
We also showed that the Wyner-Ziv problem -at least in our case -can be solved using Q-learning algorithm as Likelihood Estimator with a inherent embedding of the EM framework.However, due our computational limitation, we assessed the algorithm in a "cherry-picking" manner.The results shown are very good.We arguably showed that our Video Coding scheme was that of low complexity, low bitrate and low distortion Low complexity, low bitrate and low distortion is specially meaningful for lightweight devices such as surveillance cameras, mobile phones or probably Google T M Watches or Google T M Glasses in the near future when provided with cameras.
Our encoding scheme is of a very low complexity compared to that of motion estimation based video encoders.The bulk of computation is shifted from the encoder to decoder.The decoder is thought to be powerful server station.This is of a great advantage, especially on lightweight devices, such as mobile phones.www.ijacsa.thesai.org

Fig. 1 .
Fig. 1.Source coding with side information available at the decoder

Fig. 4 .
Fig. 4. Learning process of our video codec

TABLE I .
TABLE SHOWING THE RATE DISTORTION PERFORMANCE FOR SALESMAN VIDEO SEQUENCE

TABLE II .
TABLE SHOWING THE RATE DISTORTION PERFORMANCE FOR HALL VIDEO SEQUENCEand Hall Monitor.Frames blocks with variance information ranging from around 10 to around 200 were selected.The rate distortion performances are given in tables I and II for the Salesman and Hall video sequences.It is important to notice that even though the Function Approximator is trained on variance information less than 100, we tested our algorithm on variance greater than 100.The reconstruction quality is still good to very good for compression ration 1024:10 and 1024:20, respectively.Compression ratio 1024:3 was also to assess the compression limit.The idea was to check if we could still achieve reasonable distortion by minimizing the number of bits to send.