Page 31 - DCAP608_REAL TIME SYSTEMS
P. 31
Real Time Systems
Notes For the SSAR benchmark, the VSIPL++ API provides the right level of abstraction. VSIPL++
is an open standard, high-level API for parallel high-performance signal and image
processing. It is defined by the High Performance Embedded Computing Software Initiative
(HPEC-SI – www.hpec-si.org), a consortium of industrial, academic, and governmental
partners, with sponsorship from the Air Force Research Laboratory. Its goal is to
simultaneously deliver productivity, portability, and performance. VSIPL++ defines a
pure C++ interface for operations – including FFTs, filters, linear system solvers, and
other mathematical functions – that allow SIP applications to be written at the problem
domain level.
For the SSAR benchmark challenge, CodeSourcery used Sourcery VSIPL++, a library that
provides an optimized implementation of the VSIPL++ API on x86, Power Architecture,
and Cell/B.E. processors with useful extensions to the base VSIPL++ specification.
The next step: Implementing SSAR in VSIPL++
Using the VSIPL++ library, CodeSourcery implemented SSAR in C++ in just four days. To
illustrate the relative advantage of the VSIPL++ API for developer productivity, There are
three different implementations of the SSAR algorithm’s fast time filter: (1) MATLAB, (2)
simple, unoptimized C, and (3) VSIPL++. Mathematically, they all perform the same
computation, but the VSIPL++ version, like the MATLAB version, is easy to understand
because it is expressed in SIP primitives such as FFTs and matrix multiplication.
Ignoring the setup of the filter coefficients, the VSIPL++ version of the fast time filter
requires a single line, performing two data-parallel operations sequentially. By contrast,
the C reference implementation is more verbose and thus more error prone. In addition,
because the C code is iterative, it is more difficult to divide among multiple processors or
to optimize for different architectures.
The C and VSIPL++ implementations were benchmarked on both a conventional Xeon
processor running at 3.6 GHz and a Cell/B.E. processor running at 3.2 GHz. The entire
front-end processing chain was run looping over the data 10 times to average out the
measurements. On the Xeon platform, the VSIPL++ library used the Intel Performance
Primitives (IPP) library v5 and Intel Math Kernel Library (MKL) v7.21 as well as FFTW
v3.1.2. On the Cell/B.E. platform, the VSIPL++ library used the Cell Math Library v1.0 and
FFTW v3.2-alpha3. The VSIPL++ code needed no changes to run on both architectures; the
VSIPL++ library utilized these underlying math libraries without explicit direction from
the developer.
The extra mile: Unlocking hardware’s potential with strategic optimization.
Finally, opportunities for optimization were investigated in the VSIPL++ implementation
for the Xeon and Cell/B.E. processors. For example, in the fast time filter computation, the
VSIPL++ code performs an FFT followed by a matrix multiplication. Given large enough
data sets, memory accesses in the second step cause cache misses on a Xeon processor,
leading to expensive reads from main memory. Rewriting this loop so that each row is
processed one at a time (that is, taking the FFT and then performing the vector
multiplication) results in a 1.6x speed improvement for this portion of the code. The
change is trivial to implement, requiring only a net increase of eight lines of code, yet it
yields a 20% improvement in the execution time of the entire front-end stage.
On the Cell/B.E. processor, profiling reveals that interpolation takes almost 40 times
longer than matched filtering. A large amount of time is spent in a loop over data in the
“range” direction (perpendicular to the flight path), performing a polar-to-rectangular
coordinate conversion. A contribution from several inputs for each side-lobe of the sinc
Contd...
26 LOVELY PROFESSIONAL UNIVERSITY