HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR CONSTRAINED ONE-BIT TRANSFORM BASED MOTION ESTIMATION

Anıl Çelebi, Öğuzhan Urhan

Electronics & Telecom. Eng. Dept., University of Kocaeli
Kocaeli, Turkey

phone: + (90) 262 303 33 54, fax: + (90) 262 303 33 49, email: anilcelebi@kocaeli.edu.tr
phone: + (90) 262 303 33 52, fax: + (90) 262 303 33 49, email: urhan0@kocaeli.edu.tr
web: http://ehm.kocaeli.edu.tr

ABSTRACT
Motion estimation (ME) processes is considered as the most computationally intensive part of the conventional video compression standards. Low bit-depth representation based ME approaches provides an important alternative to reduce this computational load by making use of a lightweight and hardware efficient matching criteria. Constrained one-bit transform (C-1BT) based ME employs only two bit-planes and stands out its superior performance compared to other low bit-depth based ME approaches. Recently an adaptive search range determination algorithm is proposed to further speed-up C-1BT based ME. This paper presents novel hardware architecture for adaptive search range determination based ME method mentioned above. Proposed architecture implements spiral search method. No on-chip memory is needed neither for reference and nor for current macroblocks. Thus, there is no need to design a complex memory hierarchy and control logic to implement the spiral search method. A data reuse scheme among adjacent search windows is utilised with thanks to two axis rotatable two dimensional shift register architecture. Thus, a very low off-chip memory bandwidth can be achieved.

1. INTRODUCTION
Many devices such as camcorders, mobile phones have video capturing capabilities. Because it is not feasible to store captured image frames in raw form. Efficient compression methods and standards are used to effectively reduce data amount to be stored and/or transmitted. Temporal redundancy between image frames is exploited using motion estimation (ME) approaches. However, block based ME processes consumes up to 80% of the total computational load in a typical video encoder [1]. Thus, many researchers focus on reducing computational complexity of ME without significantly degrading ME accuracy.

In block based ME (BME) approach non-overlapping image blocks are searched in the reference frame within a predefined search range. When all possible search centres are used in this matching process it is called as full search based ME (FSBME). There are many approaches to reduce computational complexity of FSBME in the literature. These approaches can be divided into several categories. The first group employs reduced number of search centre. Three step search (3SS) based ME presented in [2] makes use of a total of 27 search centre. Thus, total computational load of FSBME is significantly reduced at the cost of limited ME accuracy. The approach in [3], utilizes a hexagonal shaped search pattern to control only limited number of search centres at a reduced computational cost compared to 3SS based ME. Another group of methods takes only certain number of pixels into account in the computation of matching criteria using certain sub-sampling pattern. Thus, number of computations to compute motion vector can be reduced. For example, an N-Queen lattice structure is utilized to reduce number of pixel to be used in the matching process in [4]. Early termination based methods aim to finalize search process in a very early step to reduce computational load. In [5], search centre prediction in combination with an early termination approach is presented. This method claims up to 160 times speed-up compared to FSBME for sequences containing very low motion activity. Another group targets to change the pre-defined search range adaptively for each block as in [6].

The last group propose to use lightweight and hardware efficient matching criterion [7-12]. In [7], video frames are initially converted to binary images using multi band-pass filtered video frames as adaptive threshold. Next, a Boolean EX-OR operation based matching criteria is utilized. This method is called as one-bit transform (1BT) based ME. In [8], an additional bit-plane is derived using local image features. Then, two bit-planes for each frame are employed together to compute matching criteria. This method is known as two-bit transform (2BT) based ME and has higher accuracy compared to 1BT based ME. However, computational load of 2BT is higher than 1BT. In [9], multiplication operations at multi band-pass filtering stage are omitted. This, so called multiplication-free one-bit transform (MF-1BT) based approach provides similar ME accuracy compared to 1BT, resulting in a lower transform cost. In [10], a constraint mask (CM) is created to discriminate reliable pixels in 1BT based ME. The CM and 1BT bit planes are employed together to compute matching criterion. Additional cost of this method (C-1BT) is quite low compared to 2BT. Furthermore, it provides better performance than 1BT and 2BT based ME methods. In [11], hardware architecture for C-1BT method is presented first in the literature. Recently, truncated versions of Gray-coded bit-planes are employed in the matching process in [12]. This approach makes...
use of three bit-planes and provides better performance than previously proposed low-bit depth methods. Several novel architectures that clearly show the effectiveness of the low-bit depth pixel representation based ME methods are presented in [12-16].

Recently, there have been attempts to further reduce the computational load of low-bit depth based ME approaches. In [17], diamond search and 1BT based ME is combined to further speed-up 1BT based ME. However, motion vector accuracy is considerably degraded in this case. In [18], a predictive hexagonal search approach and partial distortion search method is combined with C-1BT based ME and it is shown that the significant reduction on computational load is possible with small amount of performance loss. In [19, 20] special early termination approaches are utilized for 1BT and 2BT based ME. Combination of adaptive search range with low bit-depth methods is recently investigated in [21]. This method initially determines the search range for each block using a simple computation then, motion estimation is performed. Experiments have been show that this approach can provide up to 90% gain in computational load. Note that although the method presented in [18] provides significant computational gain, its irregular data access may hinder efficient hardware implementation. However the method in [21] has regular data access and this enable efficient hardware implementation.

In this paper we present an efficient hardware architecture method proposed in [21]. Proposed architecture does not need dedicated on chip memory since a D-type flip flop (DFF) based two axis rotatable two dimensional register array is utilised to implement both a shifting engine and memory module at the same time. Thus, the control logic of whole architecture is reduced and complex data read schemes are eliminated with thanks to sequential row-column read operations is eliminated.

2. C-1BT BASED ME USING ADAPTIVE SEARCH RANGE

Constrained one-bit transform based (C-1BT) based ME approach makes use of two bit planes. The first bit plane is created by simply filtering the video frames with a diamond shaped kernel proposed in [9] and then comparing the filtered frames against original video frames. This operation is formulated as follow:

\[ B(i,j) = \begin{cases} 1, & \text{if } I(i,j) \geq I_{\gamma}(i,j) \\ 0, & \text{otherwise} \end{cases} \]

\[ CM(i,j) = \begin{cases} 1, & \text{if } |I(i,j) - I_{\gamma}(i,j)| \geq D \\ 0, & \text{otherwise} \end{cases} \]

where \( I \) and \( I_{\gamma} \) show original and filtered image frames, respectively. The second bit plane which corresponds to constraint mask (CM) is evaluated by subtraction operation between the respective pixels of original and filtered video frames. The pixels that have higher difference are considered as reliable. The computation of CM is given in (2).

\[ D \] is fixed to 10 in [10]. The matching criteria used in C-1BT based ME is called as constrained number of non-matching points (CNNMP) and it is given as follows:

\[ \text{CNNMP}(m_x,m_y) = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} \left( |CM(i,j)| \right) \]

\[ \left( B(i,j) \oplus B^{-1}(i,j) \right) \]

(3)

where \|, \bullet, \text{ and } \oplus \text{ denote binary OR, AND, and EXOR operations, respectively. The location } (m_x,m_y) \text{ giving the lowest matching error is considered as the motion vector for the current block. In [21], one-bit images of C-1BT based ME is employed to decide search range for each block. This approach is based on the block activity measurement. If a block has low activity then, a small search range will probably provide enough motion vector accuracy. On the other hand, if the motion activity is high then, a larger search area should be employed. Based on this discussion, the search range (SR) for each block in C-1BT is decided as follows in [21]:

\[ SR = \frac{\sum_{i=0}^{N-1} \sum_{j=0}^{N-1} \left[ B'(i,j) \oplus B'^{-1}(i,j) \right]}{\alpha} + \beta \]

(4)

where \( \alpha \) and \( \beta \) are fixed to 12 and 2, respectively. In this paper, a modified version of (4) is proposed and utilised to make the computation easier as follow:

\[ SR \_ \text{mod} = \frac{\gamma \times \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} \left[ B'(i,j) \oplus B'^{-1}(i,j) \right]}{\alpha'} + \beta' \]

(5)

where \( \gamma \), \( \alpha' \) and \( \beta' \) is fixed to 3, 32 and 1, respectively. Thus, 1 bit right shift, 10 bits addition and 5 bits left shift is needed. Neither arithmetic multiplication nor arithmetic division is needed. Note that \( \beta' \) is not needed for hardware implementation since the default search range is arranged as \([-1, -1]\) in proposed architecture. Note that all these operations can be performed using only integer arithmetic. Maximum search range is limited to \([-16, 16]\) by a simple magnitude comparison on hardware.

According to the experimental results given in Table 1 there is only a slight difference (0.01dB fall in average) in ME accuracy caused by the change performed on (4). On the other hand, proposed modified search range determination approach provides 1% computational gain compared to [21]. Note that average of search range (Av. of SR) is also provided in this table.

3. PROPOSED HARDWARE ARCHITECTURE FOR C-1BT WITH ADAPTIVE SEARCH RANGE

In this paper, novel hardware architecture for C-1BT based ME with adaptive search range method is proposed. Additionally, spiral search scheme is utilised which is not mentioned in [21]. The block diagram of the data path of the hardware architecture is shown in Figure 1.
Figure 1 – a) Proposed Hardware Architecture, b) Basic datapath for parallel counter, c) Processing element (PE)

Table 1. ME Performance Comparison.

<table>
<thead>
<tr>
<th>Sequence</th>
<th>The method presented in [21]</th>
<th>Proposed Method</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Gain (%)</td>
<td>Av. of SR</td>
</tr>
<tr>
<td>Football</td>
<td>78.91</td>
<td>7.35</td>
</tr>
<tr>
<td>Foreman</td>
<td>79.26</td>
<td>7.29</td>
</tr>
<tr>
<td>Tennis</td>
<td>78.75</td>
<td>7.38</td>
</tr>
<tr>
<td>Garden</td>
<td>69.51</td>
<td>8.83</td>
</tr>
<tr>
<td>Mobile</td>
<td>90.03</td>
<td>5.05</td>
</tr>
<tr>
<td>Coastguard</td>
<td>80.80</td>
<td>7.01</td>
</tr>
<tr>
<td>Average</td>
<td>79.54</td>
<td>7.15</td>
</tr>
</tbody>
</table>

In Figure 1a, there are two 16×16 bits shift registers for current block, two 48×48 shift registers for search window where the second shift registers utilised for the constraint mask. Current block shift registers are capable of shifting to the right or down according to the moving direction for loading the next macroblock. On the other hand the shift register array for the search window is capable of moving the data to 4 directions of right, left, up and, down with rotate capability that enables the proposed spiral search mechanism. The architecture of the search window shift register is shown in Figure 2.

The spiral search scheme is implemented with thanks to the search window shift register. These modules contain 48×48 DFFs both but only have 16×16 outputs which are located on the centre of the register array. To perform a move to the next search location on the right, the register array is configured to shift the data to the left to generate the relative move to right. Also note that the bits on the boundaries of the movement are concatenated to the input of the registers on the opposite end of the array. At the end of the search process none of the pixel information in the PE array is lost that allows reuse of the data for the next macroblock that the motion vector of which is going to be computed. In Figure 3, the spiral search scheme for a 2×2 block and search range of [-2, 2] illustrated. The numbers in the squares represent the next position of the dark grey pixel located on the bottom right corner of the block for each search step.

The 2D PE array and one PE component is shown in Figure 1a and 1c respectively. The PE architecture contains an additional 2×1 multiplexer to enable search range computation and CNNMP computation selectively. 16×16 PEs are utilised to implement the equation (5).

The parallel counter block which counts the non-matching pixels of a candidate search position and the sub parallel counter block which counts the number of non matching point criterion for 7 pixels are shown in Figure 1a and 1b respectively. The architecture of parallel counter consists of seven stages of sub parallel counters of size 3×2, 7×3, 15×4, 31×5, 63×6, 128×7 and, 255×8 respectively. There are 256 pixels in each macroblock but parallel counter has 255 inputs. Experimental results have shown that the absence of one pixel in the CNNMP computation does not affect the ME performance there for one stage of addition for only one pixel is neglected to reduce the complexity.

Figure 2- Architecture of search window shift register.
The combinational path delay of the parallel counter block has been obtained about 11.57ns for a 45nm technology FPGA device. Actually this number reduces the maximum achievable clock frequency to approximately 90 MHz that is why 2 pipeline stages are added between the sub parallel counter stages of 15[4−3][5 and 6][6−12][7]. The output of the parallel counter is also synchronised with clock to divide the combinational path to next stage where comparison operation is performed.

In [7] an 8 input LUT based architecture is proposed to count the number of non-matching pixels. Then, several architectures are proposed to overcome the bottleneck due to the logarithmic relation between input width and the LUT depth. In [14] a 4 input LUT based non-match counter architecture is proposed to reduce the area of the hardware. Recently in [16], parallel counter method is proposed to count the number of non-matching pixels and according to the best of our knowledge it is the most hardware efficient method reported yet.

4. IMPLEMENTATION RESULTS

Proposed hardware architecture is implemented on a state of the art FPGA that is fabricated on 45nm process. According to the synthesis results proposed hardware architecture occupied 5691 LUTs and 5309 DFFs which consume the 6% and 2% of the total available resources of the FPGA device used, respectively. According to the post place and route timing analysis results, the proposed hardware architecture can operate up to 197MHz clock frequency.

There is not any fixed number of clock cycles that the hardware architecture needs for computing the best candidate position of a macroblock. But for the worst case where the search range is [-16,16]. Total number of candidate position is 1089. Other than that the hardware architecture needs 4 clock cycles for computing search range for a macroblock because of the pipelined architecture structure of the parallel counter shown in Figure 1a and 1b. Then additional 5 clock cycles are needed at the end of the computation for control purpose. At the end of the search process for the range of [-16,16] an additional 16 clock cycles are needed to recover the search window to its initial phase. Because, the next search window is going to be concatenated to the current search window. In Figure 3, it is shown that with thanks to the utilised spiral search scheme one period of the search step around the square ends exactly one step below the previous location where the first starting point is (0,0) position. Thus a rotate shift operation to the up is enough to recover the search window to its initial state. Consequently 1114 clock cycles are needed for the worst case. But the total workload of the dynamic search range is not same for each macroblock. According to the results given in Table 1 the computational load of the C-1BT based ME with adaptive search range can be lowered up to 90% compared to the full search which means, total number of clock cycles needed for a macroblock changes between 1114 and 120 approximately for the proposed hardware architecture. A comparison of the performance between several hardware architectures proposed in the literature are shown in Table 2.

5. CONCLUSIONS

Novel hardware architecture for one of the most recent low complexity ME methods in the literature is proposed. With thanks to the adopted ME method a hardware architecture which has the potential of processing a very high resolution video sequence on very high frame rate is obtained. Such that, according to Table 1 average search range search range for a video sequence is around 7. Since additional 9 control cycles are needed for each macroblock, an average clock cycle of 7+9+(2*7+1)^2 = 241 is needed. For the clock frequency of 197MHz obtained after post place and route timing analysis, a processing power of 847k macroblocks/second can be achieved. Thus, approximately 101 frames/second performance for 1080p resolution is achieved that is superior compared to the architecture proposed for single reference frame operation in recently published [15].

<table>
<thead>
<tr>
<th>Table 2. ME Performance Comparison.</th>
</tr>
</thead>
<tbody>
<tr>
<td>On Chip Memory</td>
</tr>
<tr>
<td>Area</td>
</tr>
<tr>
<td>Maximum Frequency</td>
</tr>
<tr>
<td>Technology</td>
</tr>
<tr>
<td>Bit Depth</td>
</tr>
<tr>
<td>Search Range</td>
</tr>
<tr>
<td>Search Method</td>
</tr>
<tr>
<td>1080p Performance</td>
</tr>
<tr>
<td>720p Performance</td>
</tr>
<tr>
<td># of Reference Frames</td>
</tr>
</tbody>
</table>
6. ACKNOWLEDGMENT

The hardware and software platforms used in this work are donated by Xilinx in the scope of Xilinx University Program. This work was supported by the Turkish State Planning Organization project DPT 2011K120330.

REFERENCES