# An improved fast mode decision algorithm for VLSI architecture implementation 

J. CHARLES RAJESH KUMAR ${ }^{\text {a, },}$, T. VANCHINATHAN ${ }^{\mathrm{b}}$ P. SUDHARSAN ${ }^{\mathrm{c}}$<br>${ }^{a}$ Lecturer, Department of Electrical \& Computer Engineering, Effat University, Kingdom of Saudi Arabia<br>${ }^{b}$ Director - AXIIP Semiconductor $(P)$ Ltd.<br>${ }^{c}$ Director - AXIIP Semiconductor ( $P$ ) Ltd.


#### Abstract

For better video quality in the H.264/AVC video coding technology, motion estimation has massive growth due to improvements in searching algorithms and improved significantly in compression efficiency and complexity, specifically in area, power and throughput. In this paper, an efficient sum of absolute difference (SAD) tree and its hardware architecture have proposed in Residue Number System (RNS) based moduli and implements the full search variable block size motion estimation (FSVBSME). The main advantage is that for performing carry free addition operation, residue number system is being considered as a non weighted number system to binary number system, RNS is mostly suitable for image compression techniques and loss of image quality is very less. In hardware implementation, it occupies less area and takes less execution time for output result. This proposed architecture is capable of achieving the less hardware cost and logical elements, high throughput required to perform real time motion estimation. Experimental results show that synthesized with TSMC 180nm CMOS, the proposed design occupies 12.9 k logic gates at 352 MHZ and consumes 19 mW power to encode 1920X1088 HDTV video frames at 30 frames per second.


(Received April 9, 2015; accepted May 7, 2015)
Keywords: SAD RNS adder, SAD comparator, Mode decision

## 1. Introduction

Motion Estimation (ME) is a vital part of most motion-compensated video coding standards [1]. It is a process for estimating motion vectors (MV) that transform from reference frame to the current frame in a video sequence coding. FSVBSME is a temporal redundancy elimination technique between two or more consecutive frames for video compression. H.264/AVC is the standard video coding developed by the ITU-T. ME [2-3] is mostly based on a block-matching [4-8] technique is playing a major role in H.264/AVC by using the temporal redundancy between consecutive successive frames. In H.264, a video frame split by using macro blocks (MB) of 16x16 size in a FSVBSME approach. So, FSVBSME architecture for the H.264/AVC have been proposed [910]. In arithmetic systems, the speed is limited by making the logic decisions and the extent to which the low order numeric significance decisions can affect higher significance results. This issue is described by the addition operation, by which a low-order carry can have a rippling effect on a sum. RNS have been applied to achieve highspeed and low-power VLSI implementations, typically utilized in signal and image processing. To convert representation of the numbers from the residues to a positional, The conventional magnitude comparison systems in RNS [11] utilize the Chinese Remainder Theorem (CRT) and the Mixed Radix Conversion (MRC). However, both these methods are inefficient, the main reason is that the CRT requires modulo M (number system range) operations. MRC is a slow sequential technique.

Recently, a New Chinese Reminder Theorem [12] was proposed to analyze the magnitude of the number in RNS. In this paper, the proposed algorithm takes advantage of the characteristic of the conjugate moduli set $\left(2^{n}-2^{k}-\right.$ 1) offers better performance of delay, area and speed. The new modulo adder could be isolated into four units, such as, 1. Preprocessing unit, 2. Prefix computation unit, 3. Carries correction unit, and 4 . Sum computation unit. In the proposed scheme, To obtain the final carries required in the sum computation module, the carry information of $\mathrm{A}+\mathrm{B}+\mathrm{T}$ could be calculated by prefix computation unit. So that the proposed modulo $\left(2^{n}-2^{k}-1\right)$ adder can get the best delay performance. The proposed algorithm has two main reasons. It is the best algorithm leading to VLSI architecture with the real time applications for better performance in weighted number systems. And it is described by implementing an efficient RNS for computing the minimum Sum of Absolute Differences (SAD), with more time consuming video motion estimation application.

The paper is classified as follows. Section II discussed about SAD adder tree. Section III discussed about Residue Number System and its modular addition procedure and RNS modulo ( $2^{n}-2^{k}-1$ ) adder. Section IV describes the motion estimation using RNS. Finally, section V concludes this paper.

## 2. SAD adder tree

In the 16 SAD architecture, each one is in charge of the $\operatorname{SAD}$ computation of one primitive $4 \times 4$ sub-block in
parallel. There are 16 absolute differences computed and then the 16 absolute values are fed into the adder unit to complete a $4 x 4$ SAD. Adding the 16 absolute differences to obtain one $4 \times 4 \mathrm{SAD}$ is implemented by employing multilevel 3-2 compressors, as shown in Fig. 1 (a). Fig. 1 (b) shows the structure of the 3-2 compressor, where the k binary inputs of $a, b$ and $c$ bit values $a_{0}-a_{k}, b 0-b_{k}$, and $c_{0}-c_{k}$ respectively. And the depth of input value is 8 bit, so k equals to 8 . The output $\operatorname{sum}_{0}$-sum ${ }_{k}$ stand for each of the summer bit of the input three binary bits, and carry ${ }_{0}$-carry $_{\mathrm{k}}$ stands for each of the carry bit.


Fig. 1. (a) Structure of 3-2 compressor.


Fig. 1. (b) Structure of adder unit.

One $16 \times 16 \mathrm{MB}$ is partitioned into $164 \times 4$ sub-block, denoted as $\mathrm{C}_{0}-\mathrm{C}_{15}$,as shown in Fig. 2 (a). During the processing procedure, eight $8 \times 4$ SADs and $4 \times 8$ SADs can be first obtained simultaneously, and then four $8 \times 8$ SADs can be produced at the same time, then two $16 \times 8$ SADs and $8 \times 16$ SADs be synchronously achieved subsequently, and finally the $16 \times 16$ SAD can be obtained, shown in Fig. 2 (b). All of the 41 SADs should be stored in the registers for the reuse of the following unit. The generation of the whole 41 SADs of one MB can be implemented within 6 cycles.

| C0 | C1 | C 2 | C3 |
| :--- | :--- | :--- | :--- |
| C4 | C5 | C6 | C 7 |
| C8 | C 9 | C10 | C11 |
| C12 | C13 | C14 | C15 |

Fig. 2. (a) Block Pattern of H. 264.


Fig. 2. (b) Structure of Adder Array.

Mostly, the processing systems concerned about the speed of arithmetic. In arithmetic systems, the speed is limited by the way of building block that makes logic decisions and the extent to which decisions of least numeric significance can affect results of most significance. This problem is best designed by the addition operation, in which a lower-order carry can have a rippling effect on a sum. The adder tree array is a carry effective. In the adder tree array, the addition operation effect the carries on digits of most significance. For every addition operation, it facilitates the realization of less speed and more power. The most significance sum value waits for the carry values for execution result. Hence, the adder tree consumes more power and less speed.

## 3. Residue number system and its modular addition procedure

The advantage of the RNS adder tree is that the absence of carry propagation between its arithmetic units, and for every addition operation, it is no need wait for carry values. Hence, It facilitates the realization of more speed, less power arithmetic. The Residue Number System is defined as co-prime modular radix groups $\left\{m_{1}, m_{2}, \ldots\right.$, $\left.m_{N}\right\}$, where N is greater than $1, \operatorname{GCD}\left(m_{i}, m_{j}\right)=1, i \neq$ $j, i, j=1,2, \ldots, N$, and $\operatorname{GCD}\left(m_{i}, m_{j}\right)$ is the greatest common divisor of $m_{i}$ and $m_{j}$. By residues respect to the modulus $m_{i}$ of the integer X in $[0, \mathrm{M})$ can be represented as $\left(x_{1}, x_{2}, \ldots, x_{N}\right)$, where $x_{i}=\langle X\rangle_{m_{i}}, M=\prod_{i=1}^{N} m_{i}, i=$ $1,2, \ldots, N$. In the range of $[0, \mathrm{M})$, the integers $\mathrm{A}, \mathrm{B}$, and C can be represented RNS numbers as $\left(a_{1}, a_{2}, \ldots\right.$, $\left.a_{N}\right),\left(b_{1}, b_{2}, \ldots, b_{N}\right)$ and $\left(c_{1}, c_{2}, \ldots, c_{N}\right)$ respectively. According to Guassian modular algorithms, $C_{i}=$ $\left(a_{i} \Delta b_{i}\right)_{m_{i}}$, where $\Delta$ represents addition operation, subtraction operation and multiplication operation.

For the range of $[0, \mathrm{M})$ integers A and B , modulo $m$ addition is defined as

$$
C=\langle A+B\rangle_{m}= \begin{cases}A+B & A+B<m  \tag{1}\\ A+B-m & A+B \geq m\end{cases}
$$

If $C=\langle A+B\rangle_{m}$ and the modular adder bit width is n -bit, where $n=\left\lceil\log _{2} m\right\rceil$. So that,

$$
C= \begin{cases}A+B & A+B+T<2^{n}  \tag{2}\\ \langle A+B+T\rangle_{2^{n}} & A+B+T \geq 2^{n}\end{cases}
$$

Here the correction $T=2^{n}-m$. The basic rule in most modular adder design is that if $\mathrm{A}+\mathrm{B}+\mathrm{T}$ carry bit is 1 , the result of modular addition is n LSBs of $\mathrm{A}+\mathrm{B}+\mathrm{T}$, otherwise, the result is $\mathrm{A}+\mathrm{B}$. Then, Parallel prefix addition operation is extensively accepted in binary adder design. Present sum and carry bits can be calculated with the previous carries and inputs. The prefix computation is calculated as

$$
\begin{equation*}
\left(g_{i}, p_{i}\right)=\left(a_{i} b_{i}, a_{i} \oplus b_{i}\right) \tag{3}
\end{equation*}
$$

Where carry generation bit $\mathrm{g}_{\mathrm{i}}$ and carry propagation bit $\mathrm{p}_{\mathrm{i}}$. In the sum computational unit, the carries $c_{i}$ from the prefix computation unit and the partial sum of the preprocessing unit are utilized to calculate the final sum bits $\mathrm{s}_{\mathrm{i}}$,

$$
\begin{equation*}
s_{i}=p_{i} \oplus c_{i} i=0,1, \ldots, n-1 \tag{4}
\end{equation*}
$$

## RNS MODULO $2^{n}-\mathbf{2}^{\boldsymbol{k}}-1$ ADDER

The modulo ( $2^{n}-2^{k}-1$ )adder is divided into four different modules, is shown in Fig. 3.


Fig. 3. The proposed modulo $2^{n}-2^{k}-1$ adder structure.

1. Pre-processing Unit: The pre-processing unit is used to generate the carry generation $g_{i}$ and carry propagation $p_{i}$ bits of $\mathrm{A}+\mathrm{B}+\mathrm{T}$.

$$
\begin{equation*}
T=2^{\left[\log _{2} 2^{n}-2^{k}-1\right]}-m=2^{k}+1 \tag{5}
\end{equation*}
$$

Actually, A1 is utilized for lower k-bits and A2 is utilized for higher $\mathrm{n}-\mathrm{k}$ bits addition, and it can be performed in the computation of $\mathrm{A}+\mathrm{B}+\mathrm{T}$. Here, the binary representation of $\mathrm{A}, \mathrm{B}$ and T are $\left(a_{n-1} \ldots a_{k} a_{k-1} \ldots a_{1} a_{0}\right),\left(b_{n-1} \ldots b_{k} b_{k-1} \ldots b_{1} b_{0}\right)$ and " $\underbrace{000 \ldots 001}_{(n-k) \text { bit }} \underbrace{00 \ldots 001}_{k \text { bit }}$ " respectively. The operation of the adder A1 and A2 can be represented as

$$
\left\{\begin{array}{l}
S_{A_{1}}=a_{k-1} \ldots a_{0}+b_{k-1} \ldots b_{0}+T_{A_{1}}  \tag{6}\\
S_{A_{2}}=a_{k-1} \ldots a_{0}+b_{k-1} \ldots b_{0}+T_{A_{2}}+C_{A_{1}}
\end{array}\right.
$$

Where $T_{A_{1}}=\underbrace{00 \ldots 001}_{k \text { bit }}, T_{A_{2}}=\underbrace{000 \ldots 001}_{(n-k) \text { bit }}$ and $C_{A_{1}}$ is the carry out bit of A1 adder.

In Fig. 1, A1 can be declared as $k$-bit adder in the lowest carry-in bit for zero e every bit value except the LSB value. the carry generation and carry propagation bits are

$$
\begin{cases}\left(g_{0}, p_{0}\right)=\left(a_{0}+b_{0}, \overline{a_{0} \oplus b_{0}}\right) & i=0  \tag{7}\\ \left(g_{i}, p_{i}\right)=\left(a_{i} b_{i}, a_{i} \oplus b_{i}\right) & i=1,2, \ldots, k-1\end{cases}
$$

The three inputs of adder $\mathrm{A}_{2}$ are $a_{n-1} \ldots a_{k}, b_{n-1} \ldots b_{k}$ and $T_{A_{2}}$ in binary. The adder A2 depends on both the constant $T_{A_{2}}$ and the $\mathrm{A}_{1}$ carry out bit of $C_{A_{1}}$. For adder $\mathrm{A}_{2}$, Simple carry save adder (SCSA) reduces the number of inputs form three to two. When $i=k, k+1, \ldots, n-1$, for $\mathrm{a}_{\mathrm{i}}$ and $\mathrm{b}_{\mathrm{i}}$; the second stage of $\mathrm{g}_{\mathrm{i}}$ and $p_{i}$ in SCSA and the final outputs of pre-processing unit is defined as

$$
\begin{equation*}
\left(g_{i}{ }^{\prime}, p_{i}{ }^{\prime}\right)=\left(a_{i} b_{i}, a_{i} \oplus b_{i}\right) \tag{8}
\end{equation*}
$$

$$
\left\{\begin{array}{lc}
\left(g_{k}, p_{k}\right)=\left(p_{k}^{\prime}, \overline{p_{k}^{\prime}}\right) & i=k  \tag{9}\\
\left(g_{i}, p_{i}\right)=\left(p_{i}^{\prime} g_{i-1}^{\prime}, p_{i}^{\prime} \oplus g_{i-1}^{\prime}\right) & i=k+1, \ldots, n-1
\end{array}\right.
$$

From Eq. 6 and 8, the computational prefix is obtained. And carry out of SCSA is defined from the $\mathrm{A}+\mathrm{B}+\mathrm{T}$ carry-out bit.

$$
\begin{equation*}
C_{S C S A}=a_{n-1} b_{n-1}=g_{n-1}^{\prime} \tag{10}
\end{equation*}
$$

## 4. Carry generation unit

The pre-processing unit is with the carry-generation and carry-propagation bits, the carries $c_{i}^{T}(i=1,2, \ldots, n)$ of $\mathrm{A}+\mathrm{B}+\mathrm{T}$ can be obtained. To determine the carry out bit of $\mathrm{A}+\mathrm{B}+\mathrm{T}, C_{S C S A}$ is combined with the prefix tree carryout bit.

$$
\begin{align*}
C_{o u t} & =C_{S C S A}+c_{n}^{T}=C_{S C S A}+G_{n-1: 0} \\
& =C_{S C S A}+G_{n-1: l}+P_{n-1: l} G_{l-1: 0}  \tag{11}\\
& =C_{S C S A}+G_{n-1: l}+P_{n-1: l} c_{l}^{T}
\end{align*}
$$

Where $0<l \leq n-1$.

## 5. Carry correction unit

The real carries $c_{i}^{\text {real }}$ of the carry correction unit is used for each bit in the final sum computation. For modulo
$2^{n}-2^{k}-1$ adder, T is $2^{k}+1$ represented as $\underbrace{000 \ldots 001} \underbrace{00 \ldots 001}_{k}$ in binary. The $\mathrm{A}+\mathrm{B}+\mathrm{T}$ computation, $S_{A}=A+B+\underbrace{000 \ldots 001}_{(n) \text { bit }} \quad$ and $\quad S_{B}=S_{A}+$ $\underbrace{000 \ldots 001}_{(n-k) \text { bit }} \underbrace{00 \ldots 001}_{k \text { bit }}$.

## Carry correction of $\mathbf{A}_{1}$

The binary representation of T is $\underbrace{000 \ldots 001}_{(n-k) b i t} \underbrace{00 \ldots 001}_{k \text { bit }}, c_{i}^{T}$ can be regarded as the carry bits of $(A+B+T-1)+c_{\text {in }}$ and $c_{\text {in }}$ is 1 . The first correction output $c_{i+1}^{c_{1}}$ result is

$$
\begin{array}{r}
c_{i+1}^{c_{1}}=\overline{c_{\text {out }}} c_{i+1}^{T-1}+c_{\text {out }} c_{i+1}^{T}= \\
\overline{c_{\text {out }}} \overline{P_{i: 0}} c_{i+1}^{T}+c_{\text {out }} c_{i+1}^{T}=c_{i+1}^{T}\left(c_{\text {out }}+\overline{P_{i: 0}}\right) \tag{12}
\end{array}
$$

## Carry correction of $\mathbf{A}_{\mathbf{2}}$

If $c_{\text {out }}$ is zero, $c_{i}^{\text {real }}$ is equal to the carry of $\mathrm{A}+\mathrm{B}+$ $\mathrm{T}-1-2^{\mathrm{k}}$, else the $c_{i}^{\text {real }}$ is equal to $\mathrm{A}+\mathrm{B}+\mathrm{T}$ carry. i.e, $c_{i}^{\text {real }}=c_{i}^{c_{1}}$. The inputs of adder $\mathrm{A}_{2}$ are $p_{n-1}^{\prime} \ldots p_{k+1}^{\prime} p_{k}^{\prime}$ and $g_{n-2}^{\prime} \ldots g_{k+1}^{\prime} g_{k}^{\prime} 1$. the carry-in bit $c_{k}^{c_{1}}$ is the $\mathrm{A}_{1}$ carry-out bit. The two inputs additions of $\mathrm{A}_{2}$ are $p_{n-1}^{\prime} \ldots p_{k+1}^{\prime} p_{k}^{\prime}$ and $g_{n-2}^{\prime} \ldots g_{k+1}^{\prime} g_{k}^{\prime} c_{k}^{c_{1}}$ with the carryin bit value 1 .

$$
\left\{\begin{array}{l}
p_{n-1}^{\prime} \ldots p_{k+1}^{\prime} p_{k}^{\prime}+g_{n-2}^{\prime} \ldots g_{k+1}^{\prime} g_{k}^{\prime} 1+\underbrace{00 \ldots c_{k}^{c_{1}}}_{(n-k) b i t} \\
p_{n-1}^{\prime} \ldots p_{k+1}^{\prime} p_{k}^{\prime}+g_{n-2}^{\prime} \ldots g_{k+1}^{\prime} g_{k}^{\prime} c_{k}^{c_{1}}+\underbrace{00 \ldots 1}_{(n-k) b i t}
\end{array}\right.
$$

## 6. Sum computation unit:

The partial sum bits of $A+B$ and $A+B+T$ for final sum computation are defined by the correction carry. If $c_{\text {out }}$ is zero, the correction carry $c_{i}^{\text {real }}$ is the $\mathrm{A}+\mathrm{B}$ carry bit. Else, A + B + T carry bit.

|  |  | - $------\gg$ White Circle | $\square \rightarrow$ Pentagon Box |
| :---: | :---: | :---: | :---: |
| $\left\{\begin{array}{l} p_{0}^{0}=\overline{p_{0}}, p_{0}^{1}=p_{0}  \tag{14}\\ p_{k}^{0}=\overline{p_{k}}, p_{k}^{1}=p_{k} \\ p_{i}^{0}=p_{i}^{1}=p_{i} \end{array}\right.$ | $i=0$ | -----> Black Circle | (6) $---\gg$ Cross Circle |
|  | $i=k$ | - $---\mathrm{-}-\mathrm{-}$ - Gray Circle | ( ) -------> Triangle Black Circle |
|  |  | $\nabla---->->$ Triangle | -------->> Gray Square Box |
| Where $p_{i}^{0}$ and $p_{i}^{1}$ are the partial sum bits of $\mathrm{A}+\mathrm{B}$ and $\mathrm{A}+$ $B+T$ respectively, $(i=0,1, \ldots, n-1)$. Hence |  | - --- ${ }^{\text {-- }}$ Square Cross Box | (3) ------> Cross Nibble Circle |

Fig. 4. Describing the shapes of modulo $2^{n}-2^{k}-1$.

$$
\text { When } i=1, \ldots, k-1, k+1, \ldots, n-1
$$

$$
\begin{equation*}
s_{i}=c_{i}^{\text {real }} \oplus p_{i} \tag{17}
\end{equation*}
$$

The final sum bits are

$$
\begin{array}{lr}
s_{i} \\
=\left\{\begin{array}{cc}
c_{\text {out }} \oplus \overline{p_{k}} & \mathrm{i}=0 \\
c_{k}^{\text {real }} \oplus c_{\text {out }} \oplus \overline{p_{k}} & i=k \\
c_{i}^{\text {real }} \oplus p_{i} & i=1, \ldots, k-1, k+1, \ldots, n-1
\end{array}\right. \tag{18}
\end{array}
$$

Here $c_{i}^{\text {real }}, c_{\text {out }} \oplus \overline{p_{k}}$ can be obtained at the same time. Hence, no extra delay can be occurred.

## RNS architecture

The VLSI implementation of modulo $2^{n}-2^{k}-1$ adder architecture is shown in Fig. 5. And describing the shapes of the modulo $2^{n}-2^{k}-1$ adder in Fig. 4.

The "white circle" pattern in Fig. 4 is the preprocessing unit and it is utilized to generate carry generation and carry propagation bits for prefix calculation. And the fixed input " 1 " at the $1^{\text {st }}$ and the $4^{\text {th }}$ positions, the " triangle " and " white square box" patterns are utilized for this special condition. These patterns computations considered by eq (7), (8) and (9).

The prefix computation unit is defined in the "black circle" pattern. This pattern consists of one OR gate and one AND gate. It is used for carry generation path. However, To compute the propagation bits, the "gray circle" pattern of prefix tree final stage is not required.

The "triangle black circle" pattern is computing the $c_{\text {out }}$ in Fig. 4. From eq (10), $c_{\text {out }}$ can get after an OR gate.The "pentagon box", "gray square box" and "square cross box"patterns used for the correction of carry unit.

The "cross circle" and "cross nibble circle" patterns are performing the final sum computation. The cross circle pattern performs the XOR operation and "cross nibble circle" pattern performs the XOR operation with one of it's input is inverted. Because, in eq (17) $c_{\text {out }} \oplus \overline{p_{k}}$ computation can be performed.

$s_{k}=c_{k}^{\text {real }} \oplus\left(\overline{c_{\text {out }}} p_{k}^{0}+c_{\text {out }} p_{k}^{1}\right)=c_{k}^{\text {real }} \oplus$
$\left(\overline{c_{\text {out }} p_{k}}+c_{\text {out }} p_{k}\right)=c_{k}^{\text {real }} \oplus c_{\text {out }} \oplus \overline{p_{k}}$


Fig. 4. Implementation architecture of Modulo $2^{n}-2^{k}-1$ adder ( $n=8, k=4$ ).

## 7. Motion esimation using RNS

FSVBSME (BME) searches the best matching block between the current frame and a reference frame. The most frequently utilized technique to find the distance is SAD. The search algorithm can be varied from optimal FS to sub-optimal fast search algorithms.

In H.264/AVC, frame of a video is split into macro blocks of $16 \times 16$ size. Each macro block is segmented into different sub-blocks, shown in Fig. 5. Motion estimation is conceded 7 different modules, mode 1 has a $16 \times 16$ macro block, mode 2 has two $16 x 8$ sub-blocks, mode 3 has two $8 \times 16$ sub-blocks, and mode 4 has four $8 x 8$ sub-blocks. Then, each block of 8 x 8 size is split into sub-blocks. Mode 5 has two $8 \times 4$ sub-blocks, mode 6 has two $4 \times 8$ subblocks, and mode 7 has four $4 \times 4$ sub-blocks. The total possible partitions are 41. In FSVBSME, to calculate the minimum SAD (SADmin) of $4 \times 4$ block and achieve all the SADmin from all these 41 modes slicing.


Fig. 5. The different sub-block of partitions and its positions.

In H.264/AVC video coding, motion estimation is mainly concentrating the temporal redundancy between successive frames. For the H. 264 FSVBSME implementation, the proposed architecture has the advantages of low latency and high throughput. Fig. 6 shows the block diagram of the proposed architecture, which contains the external memory, memory controller, current and reference frames, RNS adder tree, RNS SAD array, SAD comparator and mode decision.


Fig. 6. The proposed architecture of motion estimation using RNS.

From Fig. 6, first of all, the memory controller reads the current pixel and reference pixel from the external memory by a system bus, and send to the Current pixel and reference pixel values to be stored; Secondly, the current pixel and reference pixel values are responsible to send the data to RNS adder arrays to calculate SAD and
array of RNS adder, shown in Fig. 7. When all the searching points are looped over, the cost value (SAD) of the final 41 sub-blocks can be achieved. Then, send the costs of these 41 parallel sub-blocks to Comparator SAD Tree Array to calculate the module information, such as the optimal position cost and the motion vector. Finally, the mode decision decides the best motion vector, by finding the position of the present cost values and previous neighboring motion vector position.

## RNS SAD Array

To measure the similarity between the two image window blocks, the sum of absolute differences (SAD) algorithm is being evaluated for image comparison and object recognition in digital image processing. It works by taking the absolute difference between each pixel in the current frame and corresponding reference frame in the image window block. It is widely utilized for stereo vision, the generation of disparity maps for stereo images, optical flow, motion estimation for video compression. In Fig. 7, the SAD is computed the 16 differences from the current and reference pixels in the $4 \times 4$ window size image.


Fig. 7. Sum of Absolute Difference (SAD) Unit.

Each macro block is split into seven sub-blocks. First, the block will be calculated the $4 \times 4$ window size block. In the $4 \times 4$ window. In Fig.8, the RNS adder tree is done the carry free addition operation in the parallel process. So, the execution of the RNS adder tree process speed is increased. One $16 \times 16 \mathrm{MB}$ is partitioned into $164 \times 4$ subblock, denoted as $\mathrm{C}_{0}-\mathrm{C}_{15}$, as shown in Fig. 9. During the processing procedure, eight $8 \times 4$ SADs and $4 \times 8$ SADs can be first obtained simultaneously, and then four 8x8 SADs can be produced at the same time, then two $16 \times 8$ SADs and $8 \times 16$ SADs be synchronously achieved subsequently, and finally the $16 \times 16$ SAD can be obtained. All of the 41 SADs should be stored in the registers for the reuse of the following unit. These 41 SADs of one MB can be implemented in 4 cycles.


Fig. 8. RNS Adder Tree.


Fig. 9. RNS SAD Tree.

## SAD comparator

The SAD comparator compares each $16 x 16$ SAD of the selected object window, and then compares the finest SAD value. In Fig. 9, The SAD comparator first accumulates the output of SAD $16 \times 16$ present and previous SAD 16x16 stored values, and then it compares to conclude the minimum SAD value by using the comparator.


Fig. 9. SAD comparator.

## Mode decision

In mode decision, the motion vector (MV) declares that the position of the minimum SAD in the x and y directions. Here, x and y coordinates are denoted as row and column directions respectively.


Fig. 10. Motion Vector Difference

The $\mathrm{MV}_{\mathrm{U}}, \mathrm{MV}_{\mathrm{L}}, \mathrm{MV}_{\mathrm{UR}}, \mathrm{MV}_{\mathrm{UL}}$ are the Upper motion vector, Lower motion vector, Upper right motion vector and Lower left of the motion vectors respectively. These four MVs are located at the corresponding motion vector's position. From both $\mathrm{MV}_{\mathrm{UR}}, \mathrm{MV}_{\mathrm{UL}}$, the position of motion vector can be selected Either $\mathrm{MV}_{\mathrm{UR}}$ or $\mathrm{MV}_{\mathrm{UL}}$. In Fig. 10 (a), A Single motion vector predictor (SMVP) can be predicted by the neighbor motion vector value. In Fig. 10 (b), the motion vector difference computes the difference between current MV position and SMVP positions. In Fig. 11(a) (b), the Best mode can be predicted from Both the corresponding SAD16x16 output and the motion vector difference values;. Here, $\lambda_{\text {motion }}$ is common to both the corresponding SADs for sub-blocks of other sizes and each sub-block. It can be computed in the $164 \times 4$ input SAD trees.


Fig. 11. Mode Decision Module.

## 8. Discussion and result

This architecture is implemented with TSMC CMOS 180 nm technology. The characteristics and the performance of proposed algorithm based on the FSVBSME. Table 1 describes the comparison between the proposed algorithm with the existing algorithm techniques. In Ref. [5], After logic synthesis using SMIC 130nm standard cell library, the integer ME architecture allows 36 k logic gates with the processing of $1280 \times 720(720 \mathrm{HD})$ at 38 fps under a clock frequency of 300 MHz with full search block matching algorithm (FS-BMA) in a search range [-32, +32]. In Ref. [7], The Integer Motion Estimation processor chip was designed in the UMC 180nm technology, the result in a circuit with 32.3k logic gates. And a clock frequency of 300 MHz can be estimated
with a processing capacity for HDTV (1920x1088 @ 30 fps ) and a search range of $32 \times 32$. In Ref. [10], a novel VLSI design was designed with TSMC CMOS $0.18 \mu \mathrm{~m}$ technology, the result in the architecture occupies 15.8 k gates at the frequency of 200 MHz , which can constantly reduce about $66 \%$ of the RD related computation with a negligible quality loss. It is expected to be utilized in the
hardware module in a real-time HDTV (1920×1088p) H. 264 encoder. Proposed Method results show that synthesized with TSMC 180nm CMOS, the proposed design occupies 12.9 k logic gates at 352 MHZ and consumes 79 mW power to encode 1920X1080 HDTV video frames at 30 frames per second.

Table 1. The Differences Between The Proposed and Existing Algorithm Techniques.

| Reference | Technology | Search <br> Range | Logic <br> Gate <br> Count | Frequency | Throughput | Power |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Ref. [5] | SMIC <br> 130nm | $65 \times 65$ | 36 k | 300 MHZ | $1280 \times 720 @ 38 \mathrm{fps}$ | - |
| Ref. [7] | UMC <br> 180 nm | $32 \times 32$ | 32.3 k | 300 MHZ | $1920 \times 1088 @ 30 \mathrm{fps}$ | 115 mW |
| Ref. [10] | TSMC <br> 180nm <br> CMOS | $32 \times 32$ | 15.8 k | 200 MHZ | $1920 \times 1088 \mathrm{p} @ 30 \mathrm{fps}$ | - |
| PROPOSED | TSMC <br> 180nm <br> CMOS | $\mathbf{3 2 \times 3 2}$ | $\mathbf{1 2 . 9 k}$ | $\mathbf{3 5 2 M H Z}$ | $\mathbf{1 9 2 0 \times 1 0 8 8 p @ 3 0 f p s}$ | $\mathbf{7 9 m W}$ |

## 10. Conclusion

In this paper, the proposed algorithm implemented the fast mode decision and its architecture in RNS module for real time applications of motion estimation for video compression. By finding the ways of representing numbers, the best way is that reduce the sequential effect of carries on digits of most significance, which is a carry free arithmetic. The advantage of the RNS adder tree is that the absence of carry propagation between its arithmetic units, and for every addition operation. This carry-free arithmetic represents that the way of approaching on the speed at which addition can be performed. And it is no need to wait for carry values. Hence, It facilitates the realization of high-speed, lowpower arithmetic. This proposed architecture is achieved that the less logical elements, high throughput required to perform real time motion estimation. In the results the proposed method is synthesized with TSMC 180 nm CMOS, and occupies 12.9 k logic gates at 352 MHZ and consumes 79 mW power to encode 1920X1088 HDTV video frames at 30 frames per second in search range of $32 \times 32$.

## "Compliance with Ethical Standards"

1. Disclosure of potential conflicts of interest - No conflict of Interest.
2. Research involving Human Participants and/or Animals -NA
3. Informed consent - NA

## References

[1] Rec. H.264/ISO/IEC 11496-10, "Advanced Video Coding ", Final Committee Draft, Document JVTE022, 2002.
[2] Yeu-Shen-Jehng, Liang-Gee-Chen, Tzi-Dar Chiueh, IEEE transaction on signal processing, 41(2), (1993).
[3] Swee Yeow Yap, John V. McCanny, IEEE computer society, 2003.
[4] Swee Yeow Yap, John V. McCanny, IEEE Transactions On Circuits And Systems, 51(7), 384 (2004).
[5] Meihua Gu, Ningmei Yu, Lei Zhu, Wenhua Jia, Journal of Computational Information Systems, 7(4), 1310 (2011).
[6] Jun Sung Park, Hyo Jung Song, World Academy of Science, Engineering and Technology, 13, 637 (2008).
[7] G. A. Ruiz, J. A.Michell, Elsevier Signal Processing: Image Communication, 26, 289 (2011).
[8] Haibing Yin, InTech chapter. 8, Advanced Video Coding for Next-Generation Multimedia Services, 157 (2012).
[9] Chuan-Yu Cho, Shiang-Yang Huang, Jenq-Neng Hwang, Jia-Shung Wang, IEEE International Conference on Image Processing, 3, 1016 (2005).
[10] Shen Li, Xianghui Wei, Takeshi Ikenaga, Satoshi Goto, GLSVLSI, 20-24, (2007).
[11] N. S. Szabo, R. I. Tanaka, McGraw-Hill, 1967.
[12] G. Dimauro, S. Impedovo, G. Pirlo, IEEE Transaction on Computers, 42(5), 608 (1993).

[^0]
[^0]:    *Corresponding author: research4charles@ gmail.com

