# An Intra Prediction Pipeline Architecture Design for AVS Encoder

Xiangkui Zhu, Haibin Yin, Wen Gao, *Fellow IEEE*, Honggang Qi, Don Xie National Engineering Laboratory for Video Technology, Peking University, China

Abstract--In this paper, an efficient pipelining method to reduce the data dependence for intra prediction in AVS highdefinition real-time encoder is proposed. Taking advantage of different data dependences of different locations and prediction modes of sub-blocks within a MB, a new processing order for sub-blocks and their prediction modes is applied in intra prediction pipelining method. The proposed method was implemented in Verilog and synthesized on Xilinx LX330. The simulation result shows that the design is capable of achieving real-time encoding 720p high-definition video sequences at 30 frames per second.

### I. INTRODUCTION

AVS (Audio Video Coding Standard) is the first audio and video coding standard made by China [1]. It is proposed for the compression technology for moving picture in digital TV broadcasting, digital storage media, the Internet streaming media and multimedia communication. In this paper, our study is based on the FPGA design for a 720p high-definition real-time encoder in AVS with high coding performance and moderate complexity. Fig.1. shows the structure of the whole encoder.



Fig.1. Structure of the AVS encoder

The encoder architecture has 4 stage pipelines. The first stage is the IME (Integer Pixel Motion Estimation). The second stage is the FME (Fractional Pixel Motion Estimation). The third stage is mode decision based on RDO, including intra prediction, DCT transform, and quantization and so on. The last stage is entropy coding and deblock.

Intra prediction in AVS only supports 8x8 block mode. There are 5 modes for luma block and 4 modes for chroma block, less than the number of H.264/AVC [2]. Fig.2. illustrates all possible modes in intra prediction.





Intra prediction predicts one block by referring to its reconstructed neighboring pixels. While the best prediction mode of the left or upper adjacent block sometimes are not available, in other word, the neighboring referring pixels which current block needs to refer to for predicting has not been reconstructed, it leads to pipeline blocking of intra prediction. In this paper, the data dependence problem of intra prediction is solved through careful pipeline design.

## II. THE PROPOSED SOLUTION FOR THE DATA DEPENDENCE

In order to obtain better coding performance and ensure the quality of encoded picture, RDO [3] is employed as our mode decision algorithm. It is so complicated for hardware implementation that it must be processed in paralleling and pipelining for mode decision. In order to be consistent with mode decision, intra prediction must be processed pipelining within a MB.



Fig.3. 8x8 Block numbers in a MB



Intra prediction generates prediction pixels for each block according to reconstructed neighboring pixels. If the prediction follows the order in Fig.4, only when all the left and upper blocks have been reconstructed, the prediction for current block could be started. For example, block 1 is predicted in horizontal direction, it needs the right-most column reconstruction pixels of block 0. A block can not be reconstructed until its 5 or 4 (for chroma) prediction modes and their results have been finished processing by the pipeline. At this moment, there are 4 prediction modes which have not been processed, so the pipeline has to block to wait for the reconstruction of block 0. And we will encounter the similar situation when it predicts block 2 and block 3. Thus, the pipelining process will be always blocked. This will be a great impact on pipeline and wastes a lot of resources and time.



With the carefully analyzing, we find that there are not data dependence for block U and block V, so their processing orders within a MB can be flexible. Based on this, the pipeline order can be changed as Fig.5. shows. It reduces the data dependence efficiently. There is not data dependence now except predicting block 2.

Actually, these reconstructed pixels can be replaced with original pixels when predicting block 2 to avoid the data dependence. But the coding performance will drop a lot. The performance test of intra prediction based on original pixels shows that the drop of PSNR can be up to 0.2db, sometimes even more.

It is found that the prediction of block 2 may begin without waiting for all modes of block 1 processed. It just needs to wait for 4Ts before prediction of block 2 beginning. According to this observation, a new pipeline order was proposed as shown in Fig.6. Although this method will make the pipeline block for 4Ts, it still achieve high coding performance within the scope of acceptable pipeline block cycles.



Fig.6. Proposed pipeline order

### III. INTRA PREDICTION STRUCTURE

We have designed a structure for intra prediction [4] to implement our method of solving the data dependence. Fig.7. shows the structure of intra prediction.



Fig.7. Structure of Intra Prediction

In the intra Prediction structure, Neighbour unit is used to judge if the left and upper adjacent pixels are reconstructed. Reference Pixel unit is used to access and update these pixels. It uses 3 register files (17x8 bits, 9x8 bits, 9x8 bits) to store the left pixels and 3 RAMs (1080x8 bits, 540x8 bits, 540x8 bits) to store the upper pixels. If current frame is I frame, after intra prediction generates results of all modes, mode decision will chooses the best prediction mode for the block. Otherwise, we will choose the best intra prediction mode with SAD before mode decision. In all of the prediction modes, plane is especially complicated. It first needs to calculate several parameters and a base value, and then generate pixels by adding some of them certain times to the base value. In order to save time, improve resource utilization and achieve better coding performance, we pre-calculate these parameters using a special unit. At the same time, a lot parallel and pipeline processing are carried out in intra prediction, such as Neighbour and Vertical which are processing at the same time. Moreover, the predicting circuits of the same prediction mode in I frames and P or B frames and luma blocks and chroma blocks are sharing.

## IV. SIMULATION RESULT

The proposed method is implemented in synthesizable Verilog RTL on Xilinx LX330. Fig.8. shows the simulation results of our design. It can encode real-time high-definition video (720p@30fps). In the future, we will optimize our design for higher frequency and lower resource consumption so as to support 1080p high-definition video coding.

| Worst slack in design: 0.509 |           |                   |           |           |       |          |                     |
|------------------------------|-----------|-------------------|-----------|-----------|-------|----------|---------------------|
|                              | Requested | Estimated         | Requested | Estimated |       | Clock    | Clock               |
| Starting Clock               | Frequency | Frequency         | Period    | Period    | Slack | Туре     | Group               |
| IntraPred clk                | 200.0 HHz | 222.7 <b>I</b> Hz | 5.000     | 4. 491    | 0.509 | inferred | Inferred_clkgroup_( |

Total LUTs: 18537 (8%)

Fig.8. Synthesized results of our design

#### V. CONCLUSION

This paper presents an efficient pipelining method to break the data dependence for intra prediction, through rearrangement the processing order of blocks within a MB. It is capable of real-time encoding of high-definition 720p video at 30 frames per second. Simulation results show our design reduces the data dependence between adjacent blocks while providing high coding performance.

#### REFERENCES

- VS Video Expert Group, Information Technology —Advanced Audio Video Coding Standard Part 2: Video, in Audio Video Coding Standard Group of China (AVS), Doe. AVS-N1063, Dec.2003.
- [2] Feng Pan, Xiao Lin, Susanto Rahardja, Keng Pang Lim, Z.G. Li, Dajun Wu, and Si Wu, "Fast Mode Decision Algorithm for Intra prediction in H.264/AVC Video Coding," IEEE Trans. Circuits and Systems for Video Technology, Vol.15, No.7, pp:813-822, July. 2005.
- [3] G.J. Sullivan, T. Wiegand, "Rate-Distortion Optimization for video compression," IEEE Signal Processing Magazine, vol. 15, pp. 74-90,Nov. 1998.
- [4] Yu-Wen Huang, ,Bing -Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen, "Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder," IEEE Transactions on Circuit and Systems for Video Technology. Vol 15, No. 3 March 2005.