A Study of FR Video Quality Assessment of Real Time Video Stream

To assess the real-time transmission video’s quality, this paper persents a approach which used FR video quality assessment (VQA) model to satisfy the objective and subjective measurement requirement. If we want to get the reference video in the measuring terminal and to make a assessment, there are two problems which are how to certain the reference video frame and how to make the objective score close to the subject assessment. We present in this paper a novel method of computing the order number of the video frame in the test point. In order to establish the relationship between the objective distortion and the subjective score, we used the “best-fit” regressed curve model and the BP neural network to describe prediction formula. This work is the mainly aim to get the high accurency assessment results with the human subjective feeling. So we select huge video sources for testing and training. The experimental results show that the proposed approach is suit to assess the video quality using FR model and the converted subjective score is available.


INTRODUCTION
Video quality assessment (VQA) plays improtant role in various video communication applications, while the demand for video applications into everybody's life are rapidly growing.In these applications, the digital video is always processed through many stages before it display in the receiving terminal.The display video must be degraded in the end device, but there is an improtant question that the video has been more or less degraded.VQA provides the procedure to measure the quality of the video.The quality of the video can be described by the distortion to the original video or the feeling with human visual system (HVS).Then there are two mainly methods: the subjective score method and the objective estimation method.The subjective methods for video quality are consistent with actual conditions, but cost too much time.The subjective methods are not fit to real-time video application.So the objective video qualtiy assessment methods have been widely used, such as signal-to-noise ratio (PSNR), which needs the original video as the Full Reference (FR).But in some realtime video applications, the reference videos could not be obtained for the measure system or not get the reference video frame synchronization.Some existing researches proposed novel No-Reference (NR) objective methods to solve the problem about the absence of the reference video.However, it is difficult to design an objective NR VQA because of the limited understanding of the HVS.
The published papers about VQA can be classied into: (1) objective FR methods, (2) objective NR methods and (3) from objective to predict subjective scores motheds.These methods can be explained as follows: Ref. [1] proposed a structural distortion measurement method for image based on FR, which is different from traditional error sensitiveity.The proposed method used multiscale structural similarity index (MS-SSIM) algorithm, which correlate with human visual preception significantly.Ref. [2] provides us the visual information fidelity index (VIF) and Ref. [3] explains video quality metric (VQM), which also use full reference video to measure the video quality.Ref. [4]- [6] give us some new algorithms based on SSIM to product different "structure" to estimate the video quality.
Ref. [7] provides an estimate of the mean square error distiortion at the marcoblock level without original content, duing working at the receiver sider for H.264 codec and considering the action of error concealment techniques.Ref. [8] introduced NR video quality assessment measure method based on coding quantization, packet loss and error progation and temporal effects of HVS.
Ref. [8] generally describes the objective and subjective VQA method, and emphatically introduces the subjective video database which made for wireless video applications and included 160 distorted videos.Ref. [9] gives us a standard test video database with the subjective sroces and the definition of the objective algorithms to calculate the video distortion, which was produced by VQEG born from a need to bring together experts in subjective video quality assessment and objective quality measurement in 1997.
After above introduciton, it is obvious that the objective measurement method is suitbale for real-time video quality measurement and the FR methods can estimate the distortion more accurately by using original video content.Although the NR methods got the achieved progess, the objective FR methods are still more reliable in practical industry applications.However, because we can not obtain the original video and synchronize the original video frame in the receiving terminal easily showing in Fig. 1, the FR methods were only used in encoding efficiency and distortion research generally.Then how to synchronize the frame of the decoding end to the www.ijacsa.thesai.orgoriginal video is an important question to promote the objective FR methods for real-time video quality measurement.VQEG [9] reported and provided us that the PSNR, one of the objective FR performace, and the subjective scores of the test video database, also recommended the "best-fit" regressed curve to get the best evaluation relationship between the objective performances and the subjective scores.But in the practical real-time video application, such as video surveillance, video conference, and so on, the video contents are similar, or are belong to the same type, which is not like the test video database include a wide variety of the video contents.In realtime vdieo applications in a special scene, is there a better algorthm to evaluate the subjective scores from the objective performance which can be automatically calculated by computer.The main contribution of this paper is to propose a novel approach to synchronize the frame order between the sending and receiving end by adding the special marcoblock mark and study in using BP neural network to train and establish the maping of the objective performance and the subjective scores in oder to esitmate the VQA more better in real-time video applicaiton.

Encoder &Sender
Receiver &Decoder Objective VQA Figure 1.Which frame is the reference in the sender to the reveiver The rest of the paper is organized as follows.Section II descirbes the novel approch how to add the special mark in the video before the encoder and to compute the frame order number to obtain the objecive preformance based on FR.Section III discusses the BP neural network principle and the training and evaluation procedure of the quality esitmaiton, which estblished the correlation with human visual results.In seciton IV, the experimental results are compared and discussed.Finally, the paper makes a conclusion of this paper.

II. MARKING METHOD
In this section, we propose a novel method to add special mark in some macroblock and not destroy the video content.Using this adding mark, the frame's order number can be calculated in the receiver.Fisrtly, the marking method is introduced.Secondly, the judgement criterion of the parsing procedure is described.Finally, the processing results and the feasibility are discussed.

A. Adding special mark
In the past, the number of the video sequence can be marked by the semantic method in different video compression standard.But this method can only provide us the frame index, which clear the order to find the prediction relationship.In real-time video communication system, we also cannot confirm which frame into the encoder is the displaying one in the reveiving end.Then we cannot determine the reference frame to calculate the objective performance.In this paper, the proposed method is based on adding some additional information in the video content, which can be easily parsed to contain the reference frame order number.
The novel method to mark the order contains two procedures as Fig. 2 shows: adding the special mark to the original video before encoding, parsing video sequence order number after decoding.The details of the first procedure is described as follows.
 Define the order number of the original video sequence (Format YUV420) frame in binary system.
 Select first 4 marcoblocks (16x16) of the luminance component of the original video, (maybe select more marcoblocks accoding to how many frame).
 Conbine every 2-bit of the binary of the frame number, and replace the 2-bit value by 4 possible values complying with the rule.The rule is (00) to 0, (01) to 85, (10) to 170, (11) to 255.
 Using these values to fill the selected marcoblocks' luminance component, then the added-mark video sequence was produced.
For example, one video sequence contains 256 frames.Then we can select the first 4 marcoblocks to fill 4 possible values (4 4 =256) to represent the 256 frames' number.If we are dealing with the number 60 frame, it's number can be notated by (00111100) in binary.Then we can deduce that (0, 255, 255, 0) are the values to fill the first 4 marcoblocks accoding to the rule.

B. Parse the frame number
Through the real-time video communication system, the receriving video can be decoded from the payload video stream.If the decoder video format is YUV 4:2:0, we can observe the luminance component of the decoded video.By parsing the first 4 marcoblocks' values, we can calculate the average of the pixel value in one marcoblock and get the number of the decoded frame in the encoding end.The analysis criterion and the processing is as follows.
 Calculating the average pixel value of the marcoblock.
We define A to represent the average, then A= ∑ www.ijacsa.thesai.org Finally, (A 1 , A 2 , A 3 , A 4 ) is converted to 8 bit binary which is the order number of the frame in original.It is easy to transfer it to decimal digits.
Additional explanation is that the pixel values of the marcoblock maybe be changed after coding, transfering and decoding.However the features of the marcoblock an not changed more, and the average value will not exceed the boundary of the renge.

C. Discussion
In order to prove the availability of the novel method, we input the video adding the marks to the encoder with extreme QP and transmission parameters setting, it also can promote correct the order of the video frame.
For this purpose, the test video sequence is named BUS, which is CIF format (352x288, YUV 4:2:0, 30fps, 150 frames) and download form JVT test video.Firstly, because this sequence has 150 frames which is not exceed 255, we only select the first 4 marcoblocks to mark the frame number.They are on the top-right of the picture.According to the order number of the frame, the different values are filled to the each marcoblock's luminance pixels.Fig. 3 shows the number 39 frame had been added the mark marcoblocks.The marked marcoblock's luminance value are (0, 170, 85, 255), which means (00100111) in binary.In order to test the limit case in the worse, we select the H.264 encoder's QP as 48.In this situation, the picture quality is very bad with very low bit-rate.Fig. 4 shows the encoded video after adding the mark of the first 4 marcoblocks.The test result prove that the high compression distortion cannot change the feature of the adding mark, also means the coding error cannot effect the mark's judgement.And the wide boundary of the judgement range can guarantee the correctness of the method.The right frame number will be obtained in the reveiving end, which helps to seek the objective distortion of the decoded video and the reference video.
Another effection is the transmission error.This error maybe loss the first 4 blocks or change the features of the mark.If this happened, that means the channel cannot transmit the correct data.The first 4 marcoblock is following the picture header information.If they are not correct, the other data are very likely incorrect.Then the video quality becomes very low and the system are not suit video communication.So if the added mark exceed the range, we cannot find the right number of the frame, it represents the video system is abnormal.Error judged number brings the objective performance to bad, this just explains the poor video quality.

III. ESTABLISH THE SUBJECIVE AND OBJECTIVE ESTIMATION
In order to research the relationship between the subjective score and the objective distortion, the BP neural network training method and the "berst-fit" regressed curve model will be introduced in this section.Using those methods, it is easy to convert the objective results based on full reference model to the subjective score as human visual evaluation.The subjective score is regarded as the final VQA result.

A. BP neural network
Back-Propagation (BP) neural network (NN) is a common method of training artificial neural networks so as to minimize the objective function.BPNN is a good idea for the situation, while there are a large amount of input/output data is available and we do not sure the realtionship which is complex and variable problem.The past researches can provide a large amount of the training video and the subjective scores as output data.Because the subjective human visual is not a certain www.ijacsa.thesai.orgquestion, which depend the human experience and mood.Recently, some neural-based approaches to video quality evaulation have been proposed [10]- [12].
BP neural network can learn and store a lot of input -output mode mapping, without prior mathematical equations that describe the mapping relationship.The learning rule is to use the steepest descent method, back-propagation to adjust the network weights and thresholds, so that the squared error of the network is minimum.BP neural network model topology including input layer, hidden layer and output layer.BP network is a multilayer feed-forward neural network.The neuron transfer function is S-shaped function, the output of the continuous quantity between 0 to 1, it can be any nonlinear mapping from input to output y = f(x).The mapping function y = f(x) is estimated during a traning phase, where the network learns to associate input vectors to output vectors.The main advantageof the NN method is its ability to process nonlinear problems of VQA [12].
We design the BPNN include 2 part researches, one is the structure of BPNN and another is the training rule of studying of the network connection point weights variation.Fig. 5 shows a basic BP neural model, which has R inputs, each input through an appropriate weight and the next layer is connected to the network output can be expressed as:

 
In (2), f is the transfer function to describe the relationship between inputs and outputs.IW is the input layer to hidden layer weights.LW is the hidden layer to output layer weights. 1  b is the adjustment paremeter of the input layer to hidden layer. 2  b is the adjustment parameter of the hidden layer to output layer.

C. The "best-fit" regressed curve model
In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables.The data are fitted by a method of successive approximations.Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints.
The "best-fit" regressed curve model is recommended by VQEG [13] and has been widely used in evaluating the performance of algorithms.We used a 4-parameter logistic function, constrained to be monotonic to transform the objective score to the subjective one, as in (3),.

  
In the following experiments, we use function nlinfit() in Matlab softare to find the best values for fixing the parameter  .In (3), the VQA value of the objective esitmation method is represented by , 1, 2,3...150

IV. EXPERIMENTAL RESULTS
For the experiment, a total 150 video sequences are generated from 10 reference video sequences (named "Blue Sky", "River Bed", "Pedestrian area", "Tractor", "Sunflower", "Rush hour", "Station", "Park run", "Shields", "Mobile and Calendar") download from VQEG [13]- [14] and LIVE database [15].The uncompressed RAW sequences are YUV 4:2:0 format with a resolution of 768x432.One original video are subject to H.264 video coding with different bit-rate (from 200kbps ot 5Mbps) and different packet loss rate (3%、5%、 10% 、 20%) to simulate the network transmission loss, different bit error (0.5%-10%) as the stream in the wireless channel, MPEG-2 video coding with different bit-rate (from 700kbps to 4Mbps).Each RAW sequence changes into 15 video sequences through different processing, which include different main coding distortion and transmission distortion to improve the measurement experiment stability.These 10 RAW vdieo sequences are describled as Table I.Then we calculate the objective performance using MS-SSIM and MOIVE.Paper [16] compared the evaluation performance and computational complexity of the various objective algorithms.The PSNR is only used as an indicator of quality.MS-SSIM seems to perform the best amongs the algorithms.It is easy to implement and real-time estimates for available.The multiscale SSIM index [17] corrects the viewing-distance dependence of SS-SSIM and accounts for the multiscale nature of both natural images and human visual system.The MS-SSIM index performs better (relative to human opinion) than the SS-SSIM index on images.The computation formula of MS-SSIM is as (4)

  
Motion-based Video Integrity Evaluation (MOIVE) [18] intergrates both spatial and temporal aspects of distortion assessment.MOIVE explicitly uses motion information from the reference video and evaluates the quality of the test video along the motion trajectories of the reference video.MOVIE contains five main processing steps.The first composition named Linear Decomposition is to decompose the reference and test videos into multiple spatio-temporal bandpass channels using a family of Gabor filters.The second composition is to measure spatial distortions of the reference and test videos, named Spatial MOIVE.The third composition is Motion Estimation, which computes from the reference video sequence in the form of optical flow fields.The fourth composition is Temporal MOVIE which uses the spatio-temporal Gabor decompositions of the reference and test video sequences, and the optical flow field computed from the reference video using the outputs of the Gabor filters to esitmate the temporal video quality.The last step is to compute the MOVIE Index, as (5).
In (5), Paper [18] said The experimental results of the "best-fit" regressed curve model is shown as Fig. 7.The scatter plots of the MS-SSIM versus DMOS is with the best parameters We use 120 video sequences as the training inputs for BP neural network, and the rest of 30 sequences as the test data.As mentioned earlier, the training is for study of fixing the network parameters or relationship, and the testing uses this network to estimate the subjective scores.For performance evaluation suggested by the VQEG, we used three metrics to evaluate the performance of the proposed method.The root-mean-squared error (RMSE), the Pearson correlation coefficient (PCC), the spearman rank ordered correlation coefficient (SROCC) were computed between the fitted objective data and the corresponding subjective data.The performance comparison of those algorithms is as Table II shows.From the results, we concern that the BPNN is close to the best-fit curve performance, and MOIVE algorithm is slightly better than MS_SSIM.V. CONCLUSIONS This paper has proposed a novel method to calculate the order number of the reference video in receiving end by adding special mark in marcoblocks.The method had been used in video conference system VQA and web stream VQA to solve which frame can be used as the reference in FR model.The method can resist to high coding error from the video encoder, while it can obtain the right frame number in the receiving terminal after the video decoded.Then this paper designs a two-level BP neural network structure to fix the relationship between the objective performance and the subjective DMOS score.We use a large of video sequences to train the BPNN and to fix the parameters of each level.The experimental results shows that it's performace is close to the "best-fit" regressed curve.In future, it is an interesting study on using BPNN to sovle the NR VQA.We will also care the research on wireless video quality assessment application.

Figure 2 .
Figure 2. The procedures of the adding mark and parsing number

Figure 3 .
Figure 3.The added-mark video, frame number is 39.

Figure 4 .
Figure 4.The encoded video with QP 48 after adding mark

Figure 5 .
Figure 5.The BP neural modelB.Training and prediction methodEvaluation model is to train the neural network on fact.Before training it is need to construct network architecture.It requires four input conditions were: maximum and minimum of R-dimensional input samples constitute dimensional matrix, the number of neurons in each layer, the transfer function of the layers of neurons, and the training function.Now we build a two-level neural network.The first level contains 15 neurons n 1 , the transfer function is tansig().The second level is single neuron n 2 , its transfer function named traingd() is linear model for training.Fig.6shows the two-level structure of the BP neural network, in the figure it only darws one-dimensional network structure of the input samples. 1 , 1

Figure 6 .
Figure 6.The structure of the two BP neural network for training In training, another important thing is the input training data.For the aim of the training, we use MS_SSIM and MOVIE which are the objective quality performace parameters as the input vector p.
the subjective estimation method form human.

Figure 8 .
Figure 8. BP Neural network experimental results (MS-SSIM VS.DMOS) Fig. 8 gives the experimental results about MS-SSIM versus DMOS.The two polylines are the subjective values by real test and the predictive values from BPNN computation.