Real-Time H . 264 / AVC Entropy Encoder Hardware Architecture in Baseline Profile

In this paper, we present a new hardware architecture of an entropy encoder for an H.264/AVC video encoder. The proposed design aims to employ a parallel module at a pre-encoding stage to reduce a critical path. Additionally, the arithmetic table elimination method is used to eliminate the memory cost. Besides, the reduction in the size of VLC tables offers area saving. This architecture is synthesized on an FPGA Virtex IV. The simulation results show that this design can operate up to 234 MHz, which allows processing a 4CIF video format in real time. Keywords—H.264/AVC; CAVLC; Exp-Golomb


II. ENTROPY CODING ALGORITHM IN H.264
In the baseline profile, H.264 uses two tools for entropy coding: the CAVLC coding and the Exp-Golomb one, as presented in Fig. 1.The residual information (quantized coefficients) is coded using the CAVLC, while the other data are coded utilizing the Exp-Golomb.

A. CAVLC algorithm
The CAVLC is the entropy encoding used to encode the residual information in 4x4 or 2x2 blocks, which are generated by the quantification step [1].Each block must be firstly scanned in a zigzag order to produce five main syntax elements.The latter were defined in [1] as:  The coeff-token represents two values : the total number of non-zero coefficients (total-coeff) and the number of trailing ones (TT1s) in the block.The trailing ones (T1s) are non-zero coefficients whose values www.ijacsa.thesai.orgare+/-1 at the end of the zigzag sequence.Each block has at most three T1s.
 The signs of T1s are the coefficients with absolute value equal to one from zero to three bits wide.They represent the signs of the T1s coefficients in the reverse order.
 The levels are the values of each non-zero coefficient in the block, other than the T1s case.They are taken in the reverse order.
 The total-zeros is the total number of zero coefficients before the last non-zero coefficient in the zigzag sequence.
 The run-before represents the runs of zeros before each non-zero coefficient in the reverse order.
After that, these syntax elements will be encoded into five sequentially coding steps.The coeff-token, run-before and total-zero steps are encoded through different VLC LUTs.The CAVLC encoder steps are depicted in Fig. 3.  In step 1, the coeff-token are encoded using four VLC LUTs, based on the number of the total coefficients in the left block (nA) and the upper block (nB) of the current block (the context-adaptive notion), as shown in Fig. 2.
 In step 2, each T1s is encoded with its corresponding bit sign in a reverse order.The positive sign is represented by "0", and the negative sign is represented by "1".
 In step 3, the level values of the 4x4 block are encoded in a reverse order using seven VLC LUTs selected by the total-coeff and TT1s.The choice of the VLC LUTs to encode each level depends on the magnitude of the last encoded level ( the context-adaptive notion).
 In step 4, 15 VLC LUTs are utilized to encode the total zeros, indexed by the total-coeff value.
 In step 5, the run-before is coded with codewords taken from seven VLC LUTs selected by zero-left values, which is the total number of the remaining zero coefficients.

B. Exp-Golomb algorithm
The Exp-Golomb coding is performed on two stages as provided in Fig. 3.

Inputs (K)
Classification of each entry according to the type of mapping (me, ue, se or te)  Firstly, each syntax element to be coded with the Exp-Golomb noted k is mapped to a non-negative integer named "codeNum."Based on the statistical characteristic, each syntax element is represented by a codeNum in various ways [1].

Calculation of codeNum according to specific table in ITU-
 If a syntax element is always larger than zero or equal to zero and if the most frequently occurring values are the lower ones, the applied process will be called "unsigned Exp-Golomb (ue) coding".The value of the corresponding codeNum is the same value of the unsigned element.
 If a syntax element is signed and the expectation value is zero, the applied process will be named "signed Exp-Golomb (se) coding".The value of the corresponding codeNum is mapped to the syntax element value k as follows:  If an unsigned element has different statistical characteristics from the ue, its corresponding codeNum is then mapped to its value in a special way, as indicated in ITU-T recommendations [1].The applied process is called "mapped Exp-Golomb (me) coding."  If an unsigned element has 1 as the largest possible value, then "the truncated Exp-Golomb (te) coding" will be applied ; i.e., the bit representing the syntax element is the inverted value of the element.www.ijacsa.thesai.org Secondly, the codeNum parameter is mapped to coded string bits.The latter has the following generic form: {M-zeros, 1, M-bit INFO} (1) where M and INFO are given by equations 2 and 3. III.

PROPOSED CAVLC ARCHITECTURE
The suggested design processes each 4x4 block through two sequential stages.The pre-coding stage produces the Syntax Elements (SEs) to be encoded from the residual input frames, and the encoding stage translates each SE into a related codeword length and codeword value.In the following subsections, both stages are described.

A. Pre-encoding CAVLC stage architecture
The pre-encoding architecture is depicted in Fig.The zigzag module is responsible for ordering in an inverse zigzag order the residual information coming from the quantification process.After that, the zigzagged reordered coefficient is stored in a first memory called "inverse-zigzag reordered-coefficient RAM."This module is not included in the CAVLC modules, but it is required for its correct operation.
The generator module of syntax elements has as an input the reordered coefficients.This module generates the first syntax elements to be produced, which are the TT1s, the totalcoeff, and the total-zeros.When the values of these syntax elements are calculated, the next two modules, shown in red squares, start to be processed.Both modules are independent.Consequently, they are processed in parallel.
The parallel module on the top is responsible for storing the T1s and the level values into a "non-zero coefficient RAM" memory.The total number of levels and TT1s represent all total non-zero coefficients.Each non-zero coefficient is saved with a new format that represents the absolute value of the nonzero coefficient in 11 bits and the sign bit in the 11 th bit, as illustrated in Fig. 5.This format allows simplifying the level encoding process.www.ijacsa.thesai.orgThe second parallel module is formed by combinatorial circuits and two RAMs needed for storing each run-before and zero-left syntax element, respectively.First, this module permits calculating the different run-before values.After that, each calculated run-before value will be put into the "Runbefore RAM" memory.When all the run-before values are detected and stored, the controller enables the process of the next module.This latter calculates the set of zero-left values and stores them into a "Zero-left RAM "memory.The zero-left value is initially equal to the total-zeros, and then this value is decremented with the accumulation of run-before values.The mathematical relationship between the zero-left and the runbefore is shown below.

non-zero cofficient value
It is worth noting that the size of all used memory is 16 elements, which is the maximum number of non-zero runbefore and zero-left coefficients per 4x4block.Besides, the use of the inverse-zigzag reordered-coefficient, Run-before and Zero-left RAM memories is required for bitstream correctness.
The nC is also generated at this stage by a combinatorial circuit shown in Fig. 6.It selects the appropriate VLC LUTs for coeff-token coding.
The controller at the pre-encoding stage is in charge of defining the control unit of the different RAMs and synchronizing the various modules.When the end-preencoding signal is set active, all the syntax elements will be ready to be encoded.

B. Encoding CAVLC stage architecture
The encoding CAVLC architecture is illustrated in Fig. 7.The CAVLC hardware design has the outputs of the CAVLC pre-encoder design as inputs.It is composed of seven main modules: five modules in charge of encoding the different syntax elements, one module for the main controller, and another one for the output packet.These various modules and the optimized techniques used at this stage are detailed in the following subsections.www.ijacsa.thesai.org

1) Optimized VLC for coeff-token and total-zero encoders:
The coeff-token and total-zero encoders are conventionally coded by different VLC LUTs in the ITU-T Recommendations [1].However, large memory size is required to store the whole codewords" values and lengths, as presented in these traditional VLC LUTs.In the light of these details, we suggest a new representation of the codeword length and codeword value into small size.For instance, the length of the original codewords in conventional VLC coeff-token LUTs is in the range of 1 to 16, and their values are in the range of 0 to 63.Therefore, 5 bits are enough to represent the length information into the "coefftoken codeword value ROM" memory, and 6 bits are enough to represent the value information into the "coeff-token codeword length ROM" memory.An example of the new representation of codewords is given in Table I.This method is applied for all VLC LUTS needed for coefftoken and total-zero sub-module encoders .It enables optimizing the VLC LUTs for both coeff-tokens and totalzeros.An example of an optimized VLC LUT is depicted in Fig. 8.

2) Arithmetic table elimination technique for level encoder:
Levels are encoded using the arithmetic table elimination technique to replace seven level VLC LUTs represented in the ITU-T recommendations [1].This technique reported from [6] permitted the reduction in the memory cost area.Table II reports the pseudo-code describing the elimination procedure, which presents the advantage of a very simple implementation circuitry.
The format of the level code is arranged as follows.The maximum width of codewords' length is 28 bits.Code = 0…0 1 x…x s ( 5)

4) Main CAVLC controller:
The proposed CALVC controller is presented in Fig. 10.The "idlestate " represents the initial state.When the preencoding stage is finished (indicated by the signal "end preencoding CAVLC"), the finite state machine will go to the "coeff-token state".When the coeff-token encoder process is finished (indicated by the signal "end coeff-token encoding"), the finite state machine will affect the appropriate value of the signal "mux-selector" to select the output of the coeff-token encoder as final outputs.Afterwards, the finite state machine will go to the "T1s state" .When the T1s encoder process is completed, the finite state machine will produce an appropriate value for the signal "mux-selector" to select the outputs of the T1s encoder as final ones.www.ijacsa.thesai.orgThis process, which is produced in the coeff-token and T1s states, will be replicated at level, total-zero and run-before states.At the end of the run-before encoding process, the signal "end run-before encoding" is set high, informing that the CAVLC completely encodes the 4x4 block, and a new block can be encoded.

5) Output packer:
The output packet receives as an input the signal "muxselector" from the main controller and all the outputs of the encoder modules (codeword values and codeword lengths).Two-word multiplexers compose this module: one to select the appropriate codeword value and the other to select the appropriate codeword length.The codeword value and codeword length serve as final outputs of a CAVLC coder.

IV. PROPOSED EXP-GOLOMB ARCHITECTURE
The proposed Exp-Golomb design is presented in the form of modules in Fig. 11.Every module represents the functioning way of each stage of the Exp-Golomb algorithm already explained in section II.Firstly, in every k entry, the"codeNum generator" module generates the corresponding codeNum value according to a mapping type (ue, te, se or me).When the mapping type is ue, te or se, the codeNum value will be generated from the"codeNum generator1" Module.This block produces the codeNum according to various mathematical operations described in section 2, which only involves shifting, complementation, and increasing by 1. Otherwise (mapping type =me), the codeNum will be generated by a second generator module, called "codeNum generator2", based on four ROMs according to two mode types (intra or inter) and to the prediction mode.The detailed architecture of these two generators of codeNum values is shown, respectively, in Fig. 12 and Fig. 13.The logarithm operation is required to produce the value of M, which is utilized for the calculation of the codeword length (equivalent to 2M+1).However, its implementation requires an expensive circuit that constitutes the hardware challenge of implementing an Exp-Golomb encoder.This problem can be solved in the following way: Consider that log2(N) is equivalent to the number of M times divided by 2 until the output reaches the zero value as in equation 3. Thus, we acquire an approach to get the value of M by computing the shift operation number.

Word
The suggested architecture of the logarithm operation is given in Fig. 14.The output of the barrel shifter is loaded in the register FF.The output Q of this register is connected to the inputs of the multiplexer and the combinatorial circuit of the OR gates.This circuit is responsible for checking whether the output Q reaches the value 0 or not by producing a one-bit value, noted C, as an output.
Initially, the counter is set at 0. If the value of Q is different from zero; the value C is equal to 1. Consequently, the AND gate will be an ascending counting; the counter will count up by a single step.In this case, the multiplexer is going to assign the value Q to the input K of the barrel shifter.When Q reaches the value 0, the value of the output C is set to 0. Therefore, the M times www.ijacsa.thesai.orglogic 0 generated on the output of the AND gate stops the counter.In this case, the output of the counter corresponds to the output value M such as M =log2 (N).Through the obtained results, it is possible to verify that the CAVLC coder achieves an operation frequency of 234.14 MHz and requires an area occupancy of 847 LUTs.The maximum frequency of the Exp-Golomb architecture is 234.14 MHz, and the memory cost is 847 in terms of LUTs.
It is worth mentioning that no external or embedded memory is used to give a platform independent estimation of memory cost reduction, suitable for ASICs and FPGAs of different generations and families.The simulation results provided in Fig. 16 show that the processing time per block exhibit a large variety.We take an average of 131 cycles per block.The performance of our proposed architecture is calculated as follows: The number of clock cycles needed for 4CIF (704 x 576) video with 30-fps= The number of clock cycles per block x the number of blocks per macroblock

Fig. 14 .Fig. 14 .
Fig.14.Architecture of binary logarithm After the logarithm operation, the INFO value should follow formula (3), which involves shifter and subtraction operations.The last module (Exp-Golomb bit-generator) is in charge of producing the output code word considering the value of M and INFO.It is designed by the implementation of the finite state machine that contains two states, as shown in Fig.15.The i th counter is initialized to state 1.The second state corresponds to the generation of the output codewords bit by bit, following the structure presented in formula (1).Each bit is generated in one clock.

TABLE .
Table VI and Table V.

TABLE .
IV. PHYSICAL RESOURCES UTILIZATION OF CAVLC MODULES ON VIRTEX VI