Critical Path Reduction of Distributed Arithmetic Based FIR Filter

Operating speed, which is reciprocal of critical path computation time, is one of the prominent design matrices of finite impulse response (FIR) filters. It is largely affected by both, system architecture as well as technique used to design arithmetic modules. A large computation time of multipliers in conventionally designed multipliers, limits the speed of system architecture. Distributed arithmetic is one of the techniques, used to provide multiplier-free multiplication in the implementation of FIR filter. However suffers from a sever limitation of exponential growth of look up table (LUT) with order of filter. An improved distributed arithmetic technique is addressed here to design for system architecture of FIR filter. In proposed technique, a single large LUT of conventional DA is replaced by number of smaller indexed LUT pages to restrict exponential growth and to reduce system access time. It also eliminates the use of adders. Selection module selects the desired value from desired page, which leads to reduce computational time of critical path. Trade off between access times of LUT pages and selection module helps to achieve minimum critical path so as to maximize the operating speed. Implementations are targeted to Xilinx ISE, Virtex IV devices. FIR filter with 8 bit data width of input sample results are presented here. It is observed that, proposed design perform significantly faster as compared to the conventional DA and existing DA based designs. Keywords—Critical Path; Multiplier less FIR filter; Distributed Arithmetic; LUT Design; Indexed LUT


INTRODUCTION
Digital Signal Processing (DSP) systems are generally implemented using sequential circuits, where numbers of arithmetic modules in the longest path between any two storage elements are members of critical path. The Critical Path Computation Time (CPCT) determines the minimum feasible clock period and hence maximum allowable operating frequency of DSP system. Finite impulse response (FIR) digital filter is one of the widely used Linear Time Invariant (LTI) systems, has gained popularity in the field of digital signal processing due to its stability, linearity and ease of implementation. However, attention need to pay specifically while designing the high speed FIR filter, as CPCT is affected by both, system architecture as well as techniques used to design arithmetic modules. For such critical design of system architecture, fixed structure offered by Digital signal processor is not suitable. However, high nonrecurring engineering (NRE) costs and long development time for application specific integrated circuits (ASICs) are making field programmable gate arrays (FPGAs) more attractive for application specific DSP solutions. FPGA also offers design flexibility to arithmetic modules then ASICs.
For an N th order FIR filter, each output sample is inner product of impulse response and input vector of latest N samples [1] given in (1).

 
For critical path minimization, direct implementation of (1) is not a cost effective solution because of two reasons. First, critical path increases with the order of filter and second, multiplier is an expensive arithmetic module with respect to area and computational time. More than two decade, many researchers [2][3][4][5][6][7][8][9][10] have worked on various multiplerless techniques for FIR filter design. In case of constant coefficient multiplication, look-up-table (LUT) multipliers [11][12][13] and distributed arithmetic (DA) [14][15][16][17][18][19][20][21][22][23][24] are two memory based approaches found in FIR filter design. An improved distributed Arithmetic technique is addressed here to design for system architecture for FIR filter, as its operating speed is almost independent with order of filter.
In recent years Distributed Arithmetic has gained substantial popularity due to its regular structure and high throughput capability, which results in cost-effective and efficient computing structure. This technique was first introduced by Croisier [14] and further development was carried out by Peled [15] for efficient implementation of digital filters in its serial form. Apart from its several advantages; DA based structure is facing a serious limitation of exponential growth of memory with order of filter. Many researchers [16][17][18][19][20][21][22][23][24][25][26][27] have addressed this problem, while dealing with this issues. Partial or full parallel structure with two and more than two bits [16,25] has been exploited to overcome the speed limitation, inherent to bit serial DA structure. Attempts were also been made to reduce memory requirement by recasting input data in Offset Binary Coding(OBC) [16], modified OBC and LUTless DA-OBC [19], instead of normal binary coding. Yoo and Anderson [22] extended this work and proposed a hardware efficient LUTless architecture, which gradually replaces LUT requirements with multiplexer/adder pairs. However gain in area reduction is achieved at the cost of increased critical path over the conventional design. LUT decomposition or slicing of LUT, proposed in [23], is one of the ways to restrict the exponential growth of memory. Though www.ijacsa.thesai.org this technique has elucidated a problem of exponential growth of memory, involves the fact that latency and access time are the dependent parameters of level of decomposition.
As the operating speed of a filter is governed by worst case critical path, improved technique is suggested in this paper to increase the speed of operation by reducing critical path. In proposed technique, a single large LUT of conventional DA is replaced by number of smaller indexed LUT pages to restrict exponential growth and to reduce system access time. Indexing the LUT pages eliminates the use of adders of existing techniques [16,17,19,[22][23][24].
Selection module selects the desired value from desired page, and feed the value for further computation. Trade off between access times of LUTs and selection module helps to achieve minimum critical path so as to maximize the operating speed.
In organization of the paper, section II elaborates lookup table concept of conventional DA and proposed DA structures. Critical Path Computation Time (CPCT) analysis of previous and proposed techniques is given in section III. Section IV presents the realization of proposed architecture. Initially component level access time analysis of proposed design is presented in section V, followed by comparison of operating frequency of proposed and previous techniques. Paper is ended with conclusion, in section VI.

II. CONVENTIONAL DISTRIBUTED ARITHMETIC ALGORITHM FOR FIR IMPLEMENTATION
Distributed Arithmetic is one of the preferred methods of FIR filter implementation, as it eliminates the need of multiplier, particularly when multiplication is with constant coefficients. By this technique, sum-of-product terms in (1), can easily be transformed into addition. Let B be the word length of input samples, then, in an unsigned binary form, X(n) can be represented as: where x n,i is the i th bit of X(n). By Substituting the value of X(n) from (2) into (1), inner product can be expressed as: Interchanging the sequence of summation in (3), results into: Further, compressed form of (4), can be expressed as: Where, Thus (5) creates 2 N possible values of γ. All these values can therefore be precomputed and stored in form of look up table shown in table. I. The filtering operation is performed by successively accumulating and shifting these precomputed values, based on the bit address formed by input samples, X(n). A method is proposed to choose desired size of LUT for minimum Critical Path Computation Time of LUT unit. Let N= (n+m); where n and m are arbitrary positive integers. A single large LUT size of 2 N , in conventional design is converted into 2 m LUT pages, each page with 2 n memory locations. Applying this concept to the (5), number of terms in γ can be divided into two groups: n LSB terms and m MSB terms. It is represented by: LSB n bits, defines the size of each LUT page, however, MSB m bits defines number of LUT pages. Instead of consisting coefficient sum in conventional look up table, proposed design LUT consists of indexed-sum-of-filtercoefficients.
A page selector module selects desired output from one of the LUT pages, addressed by m bits. A desired combination of n and m facilitates to select the minimum execution time of LUT page and page selector module to attain maximum operating frequency. LUT page structure of 6 th order filter, for n=4 and m=2 and indexed term of each page, is elaborated in table II and table III respectively. Each LUT page contains summation of filter coefficients and index term I.

III. CRITICAL PATH COMPUTATION TIME ANALYSIS OF PROPOSED ARCHITECTURE
In this section, CPCT analysis [13] of conventional DA [14][15][16], LUTless DA [19,22], sliced DA [16,17,23,24]and proposed DA based FIR filter techniques are elaborated. These designs are taken into consideration as they are found more comparable with proposed technique.
Conventional form of distributed arithmetic FIR filter given in fig.1 consists of bank of input registers, LUT unit, and accumulator/shifter unit. Apart from these hardware units, it needs control unit, which defines sequence of filter operation. A. LUTless DA based FIR filter Exponential growth of LUT is key issue while designing DA based FIR filter. Elimination of LUT is an attempt found in [13,24] to overcome exponential growth of LUT. In such LUTless structure, shown in fig.3, LUT is replaced by multiplexer-adder pair. On-line data generated by multiplexers are accumulated to create the filter output.
DFG of LUTless DA based FIR filter, shown in fig.4, consists of multiplexer node M, adder nodes T a and Assuming the adders in adder tree are arranged in 4:2 form, access time of log 2 (N) adders are taken into consideration while calculating CPCT of structure C a . It will be expressed as: C a = log 2 N x T a (8) Thus C a is highly filter order dependent as indicated in (9). CPCT of structure becomes: CPCT (LUTless) = C M +C a + C as (9) where C M -access time of multiplexer. C a -access time adder tree C asaccess time of accumulator/shifter unit.

B. Sliced LUT DA based FIR filter
Another well-known attempt found in [21,22,27] to restrict the exponential growth of LUT, is the use of multiple memory banks.
Latest, Longa and Miri [23], highlighted that, FIR filter structure will be an area efficient structure by replacing a single large LUT by number of 4-input, smaller LUTs. However, this arrangement leads to put a burden of an adder tree, as it is required to add partial terms generated by each smaller LUT. Generally such LUT arrangement is referred as partitioning or slicing of LUT. Architectural details of sliced DA based FIR filter is shown in fig.5.  Access time of LUT get reduced from C L to C SL due to slicing technique, however it has added the over heads of adder tree access time C a in CPCT (slice) .

C. Indexed LUT DA based FIR filter
LUTless and SlicedLUT has restricted the exponential growth [22,23], however it has increased the burden of access time of adder tree.
So an attempt is made, to eliminate the use of adder tree by designing an indexed LUT based FIR filter technique. In proposed design of Indexed LUT (ILUT) DA structure, node L of fig.2 is replaced by smaller, desirably indexed LUTs L i and multiplexer M.  (6), is shown in fig.7. CPCT of this structure, contributed by L i -M-A nodes, will now be: CPCT (Index) =C i +C m +C as (11) Where C i = access time of an indexed LUT. C m = access time of multiplexer C as = access time of accumulator/shifter Access time C i and C m are interdependent. The trade off of an exponentially varying LUT with linearly varying multiplexer size helps to choose optimum CPCT of a structure. Hence, improves overall operating frequency of filter. It also eliminates the need of adder tree, which further helps to improve the operating frequency.

IV. REALIZATION OF PROPOSED ARCHITECTURE
Proposed structure of indexed LUT DA based FIR filter is elaborated in following sections. It is built up with four major components bank of input registers, look-up-table unit, accumulator/shifter unit and control unit.

A. Input register bank
Register Bank, shown in fig.8, built up with N serial-in parallel-out shift registers, accepts X(n) input samples, n=0,1,..,N-1. In every clock pulse, register contents take a right shift and generates B terms of length N.

B. Proposed LUT unit
Indexed LUT DA based FIR filter, comprises of indexed LUT pages, each of size 2 n and m bit multiplexer unit as a page selection module. It selects the desired value from desired page. Structural details of an example, considered in section 2A, of 6 th order FIR filter, with n=4 and m=2, is shown in fig.9. Four LUT pages, each with 16 locations are connected in parallel, by set of 4 address lines. A multiplexer unit of size 4:1 selects an appropriate output for further stage.

C. Accumulator and Shifter Unit
Accumulator and shifter are two separate combinational units, however jointly these are responsible for calculating the dot product term of filter output. Its hardware complexity is greatly influenced by the way of LUT addressed and accordingly a shift is given to accumulator/shifter unit to generate partial products.

D. Control Unit
It is a finite state machine, shown in fig.10, defines sequence of operation and has overall control on filtering operation. Filtering operation remains in idle state with application of reset. It starts with enable signal E and takes iteration equal to input precision for every clock cycle. At the end of count it gives filter output and operation begins with next fetch cycle.

V. PERFORMANCE ANALYSIS
Performance is evaluated based on operating frequency. Design is implemented on FPGA Vertex IV, for particular filter order N and for all possible combinations of n and m, as shown in table IV. Each node of proposed structure is critically analyzed for CPCT of proposed structure, for the range of filter from 4 to 8. Table IV gives the details of filter operating frequency with variation in access times of LUT page C i and multiplexer unit C m.
Graphical representation for 8 th order FIR filer is shown in fig. 11. It indicates that, access time of LUT page C i increases exponentially with n, at the same time access time of multiplexer C m decreases linearly.
If f max is assumed to be the maximum operating frequency, T sample is the minimum time required to process each output sample, then T sample ≥ CPCT ≥ C i + C m + C as (12) As f max = 1/ T sample f max ≤ 1/ C i + C m + C as (13) As CPCT minima of filter is obtained at the point of intersection of LUT access time C i and MUX access time C m , which leads to maximum operating frequency. Thus filter design corresponds to these values of m and n will be treated as optimized design.   This technique can further be extended to any desired order of filter. Filter performance upto 256 order is shown in fig. 12. Results obtained by the proposed technique are compared with Conventional DA, LUTless DA [22] and Sliced LUT DA [23]  techniques, which were implemented on Altera Stratix FPGA chip. To surmount the platform differences, these techniques are faithfully implemented on same platform as that of the proposed technique. Desired filter coefficients are obtained from FDATool, a special toolbox of MATLAB, which are truncated and scaled to 8-bit precision. Xilinx Integrated Software Environment (ISE) is used for performing synthesis and implementation of the designs.
To validate the correct functionality using random input, each implementation is simulated with the simulation tool provided by Xilinx.
A comparative study of maximum operating speed of conventional DA,LUTless DA, Sliced DA and proposed DA based filter techniques is presented in table V and its graphical representation is in fig.13.  Operating frequency reduces with the order of filter is one of the obvious observations indicated in table V. It is also observed that operating frequency of proposed technique is higher than conventional DA and existing DA [22,23] techniques. No much gain in rise of frequency is obtained at 4 th order as techniques are get correlated with technology platform, however frequency growth is increasing along with the order of filter.
Structural complexities of N th order filter are analyzed and performances are compared for random input samples x(n). Word length of input sample and filter coefficient is assumed to be of B bits, which makes size of input register bank to be same for all designs under consideration. Latency and throughput found same in all DA based structures; however operating speed of individual technique makes the value to differ.
For implementation of N th order conventional DA based FIR filter requires memory array of 2 N x B bits and the size of decoder is N:2 N . CPCT of the structure is (C L + C as ), increases exponentially due to exponential rise in C L , however C as is independent with order of filter. Thus it is almost constant in all structures. Structural complexities of conventional DA based FIR filters are considered as bench marks for performance comparison.
Slicing of single large memory reduces the memory requirement of design from 2 N X B of conventional DA to (a X 2 l ) X B; where a and l are the factors of N. Thus decoder also get changed from single N: 2 N to a, l:2 l . As multiple terms are generated by this technique, need at least a-1 adders to generate coefficient sum as partial term. A single large LUT is replaced by smaller LUTs, reduces LUT access time from C L to C SL , however it adds adder access time C a , tending to increase CPCT of structure.
LUTless technique selects filter coefficient on-line by multiplexer, eliminates the need of memory and corresponding decoder at the cost of N-1 adders. As LUT is replaced by multiplexers and adders, C M and C a are the contributors of CPCT, which are highly filter order dependent.
In proposed technique, indexing of LUT pages reduces its access time C i instead of C L as well as eliminates C a as a prime contributor of CPCT of LUTless and sliced LUT DA based techniques. It adds a small burden of LUT page selection www.ijacsa.thesai.org module Cm, to CPCT of structure. However it leads to reduce overall CPCT, leading to increase in operating frequency. This rise in frequency is significant with higher filter order as indicated in table V.

VI. CONCLUSION
For high speed FIR filter implementation in distributed arithmetic, the exponential rise of memory access time with the filter coefficients has always been considered to be a fundamental drawback. LUTless DA and sliced LUT DA based technique restricts exponential growth, however needs adders to generate partial term. Number of adders and depth of adders, is governed by order of filter in LUTless technique. However in sliced LUT based technique, number of slices defines number of adders. Even for particular filter order, number of adders increases with increase in number of slices, tending to increase CPCT of structure. An innovative technique to reduce CPCT of FIR filter is designed and implemented successfully, which leads to increase in operating frequency. Indexing of LUT restricts exponential growth and also completely eliminates need of adders which results in significant reduction in CPCT and maximizes operating frequency.