# EdkDSP platform

# Technical Report FIT-VG20102015006-2011-06

# David Barina, Pavel Zemcik

Faculty of Information Technology, Brno University of Technology

Date: 2011-12-01

FIT

# Contents

| 1        | Introduction         | <b>2</b> |
|----------|----------------------|----------|
| <b>2</b> | Platform description | <b>2</b> |
| 3        | Example application  | 4        |
| 4        | References           | 5        |

#### Abstract

The EdkDSP platform has a form of System-on-Chip bitstream that fits into Xilinx FPGAs. This platform is able to accelerate simple floating point operations applied on vectors. Typically can be used to accelerate common image processing and computer vision tasks. In this report, the platform is introduced together with an example application (wavelet transform).

### 1 Introduction

This platform consists of MicroBlaze (MB) central processing unit and several acceleration units (BCE elements [2]) controlled by corresponding PicoBlaze (PB) processors. The MicroBlaze is full 32-bit soft processor designed for FP-GAs from Xilinx. Even PetaLinux operating system can be run inside of MicroBlaze. In this case, the kernel provides file system and Ethernet connectivity.



Figure 1: Spartan-6 SP605 FPGA kit.

## 2 Platform description

The BCE (basic computing element) units are able to accelerate simple floating point dataflow operations applied on local-memory vectors (arrays). The supported operations include addition, multiplication, assignment, dot product etc. The embedded dataflow unit is able to read two input operands and write one result in every clock cycle of the BCE clock. The PicoBlaze processor performs sequence of such a simple vector operations according to the uploaded firmware.



Figure 2: MicroBlaze with one connected BCE. BCE consists of PicoBlaze controler and dataflow unit (DFU). Taken from the platform documentation.

Platform toolchain mainly consists of UTIA PicoBlaze compiler, PetaLinux MicroBlaze compiler and UTIA EdkDSP APIs. A principle of the acceleration lies in a replacement of simple for-loops by several function calls which starts the computation in BCE (Basic Compution Element). Also, coping memory area to/from BCE memory is necessary. UTIA EdkDSP platform currently operates only with single precision floating point numbers. Maximum length of mentioned for-loop is limited by memory bank of size of 256 words. Thus, computation in long loops have to be cut into short ones. This involves some overhead.

EdkDSP SoC fits into Spartan-6 SP605 FPGA kit. The procedure of uploading EdkDSP bitstream into the FPGA kit is described in corresponding manual. One can boot PetaLinux system through PC with TFTP server and using U-Boot bootloader. Any precompiled application can be uploaded inside such a booted system using FTP protocol and controlled by Telnet terminal.

From programmers point of view, EdkDSP compilation toolchain consists of two APIs. The first one, WAL (Worker Abstraction Layer) API is intended for control the BCE elements from the MicroBlaze code. These functions begin with wal prefix. The second one, PB2 API is used in PicoBlaze code (i.e. firmware) for control the acceleration unit (DFU) and communicate with MicroBlaze program. In this case, all functions begin with pb2 prefix. Two compilers are needed in order to compile application that use the acceleration units (workers). PicoBlaze compiler (pbcc) is able to compile PB code (firmware) with PB2 API calls into herader file which is later included into main MB code (the actual application). The latter code is compiled with PetaLinux MicroBlaze compiler into final binary what is executable under PetaLinux system.

### **3** Example application

This application should demonstrate the EdkDSP platform implementation of discrete wavelet transform (DWT) image decomposition using CDF (Cohen-Daubechies-Feauveau) 9/7 wavelet [3] used in JPEG 2000 image coding standard [5], Dirac video compression format [6] and FBI fingerprint image compression standard [7]. Lifting scheme [4] is used for performing one level of one-dimensional discrete wavelet transform. The two-dimensional discrete wavelet transform is performed using the Mallat's decomposition [1] (separable wavelet transform).

In each level of two-dimensional decomposition, the one-dimensional transform can be executed on every single row (and consequently on every single column) in parallel. As mentioned above, one-dimensional transform is computed using lifting scheme. In this calculation, several for-loops are performed. These loops are known as predict and update steps and can be performed for every partial coefficient in parallel as well. Predict step is performed over odd coefficients using values of even ones. Similarly, update step is performed over even coefficients.



Figure 3: Lifting scheme (predict and update steps). Taken from [1].

Computation of one-dimensional discrete wavelet transform [1] with CDF 9/7 wavelets consists of sequence of 4 lifting steps followed by coefficient scaling. This critical code section was accelerated in PicoBlaze firmware as 15 pb2dfu\_restart\_op calls (VADD\_AZ2B, VADD\_BZ2A, VMULT and VZ2A). These operations are depicted in Figure 1, where  $\alpha$ ,  $\beta$ ,  $\gamma$ ,  $\delta$  are lifting coefficients and  $\zeta$  is a scaling constant.

| computation on              | time [secs] |
|-----------------------------|-------------|
| MicroBlaze                  | 0.485678    |
| MicroBlaze + BCE            | 0.469039    |
| empty critical code section | 0.140792    |

Table 1: Performance measurement of forward transform. The code was accelerated using one BCE worker. The MicroBlaze and BCE run at 62.5 MHz.



Figure 4: PicoBlaze operation for 2 pair of lifting steps followed by coefficient scaling.

On MicroBlaze, the following sequence is performed.

```
wal_mb2dmem(worker, 0, WAL_BCE_JK_DMEM_A, 0, arr, 2*steps+4);
wal_mb2dmem(worker, 0, WAL_BCE_JK_DMEM_B, 0, coeffs, 11);
wal_mb2pb(worker, steps);
wal_pb2mb(worker, NULL);
wal_dmem2mb(worker, 0, WAL_BCE_JK_DMEM_A, 0, arr, 2*steps+4);
```

The code was accelerated in this way. Results are shown in Table 1. In this case, single precision floating point format was used. Computation was compared on grayscale image  $(512 \times 512 \text{ pixels})$ .

### 4 References

- S. Mallat: A Wavelet Tour of Signal Processing : The Sparse Way. With contributions from Gabriel Peyr. Academic Press, 3rd edition, 2009, ISBN 9780123743701.
- [2] M. Danek, J. Kadlec, R. Bartosinski and L. Kohout: Increasing the Level of Abstraction in FPGA-based Designs. In International Conference on Field Programmable Logic and Applications. Heidelberg: Kirchhoff Institute for Physics, 2008. pp. 5-10. ISBN 978-1-4244-1961-6.

- [3] A. Cohen, I. Daubechies and J. C Feauveau: Biorthogonal bases of compactly supported wavelets. Comm. Pure Appl. Math, 1992, 45, pp. 485500.
- [4] I. Daubechies and W. Sweldens: Factoring Wavelet Transforms into Lifting Steps. Journal of Fourier Analysis and Applications, vol. 4, issue 3, 1998: pp. 247269, ISSN 1069-5869.
- [5] ISO/IEC 15444-1:2000, JPEG 2000 image coding system Part 1: Core coding system.
- [6] Dirac Specification, Version 2.2.3, September 23, 2008.
- [7] J. Bradley, C. Brislawn and T. Hopper, WSQ Gray-Scale Fingerprint Image Compression Specification, IAFIS-IC-0110(V3.1), Federal Bureau of Investigation, October 4, 2010.