Page 31 - DCAP608_REAL TIME SYSTEMS
P. 31

Real Time Systems




                    Notes            For the SSAR benchmark, the VSIPL++ API provides the right level of abstraction. VSIPL++
                                     is an open standard, high-level  API for  parallel high-performance signal and  image
                                     processing. It is defined by the High Performance Embedded Computing Software Initiative
                                     (HPEC-SI – www.hpec-si.org), a consortium of industrial, academic, and governmental
                                     partners,  with  sponsorship  from  the Air  Force  Research  Laboratory. Its  goal  is  to
                                     simultaneously deliver  productivity, portability,  and performance.  VSIPL++ defines  a
                                     pure C++ interface for operations –  including FFTs, filters, linear system solvers,  and
                                     other mathematical functions – that allow SIP applications to be written at the problem
                                     domain level.
                                     For the SSAR benchmark challenge, CodeSourcery used Sourcery VSIPL++, a library that
                                     provides an optimized implementation of the VSIPL++ API on x86, Power Architecture,
                                     and Cell/B.E. processors with useful extensions to the base VSIPL++ specification.
                                     The next step: Implementing SSAR in VSIPL++
                                     Using the VSIPL++ library, CodeSourcery implemented SSAR in C++ in just four days. To
                                     illustrate the relative advantage of the VSIPL++ API for developer productivity, There are
                                     three different implementations of the SSAR algorithm’s fast time filter: (1) MATLAB, (2)
                                     simple, unoptimized C,  and (3) VSIPL++. Mathematically,  they all  perform the  same
                                     computation, but the VSIPL++ version, like the MATLAB version, is easy to understand
                                     because it is expressed in SIP primitives such as FFTs and matrix multiplication.
                                     Ignoring the setup of the filter coefficients, the  VSIPL++ version  of the  fast time  filter
                                     requires a single line, performing two data-parallel operations sequentially. By contrast,
                                     the C reference implementation is more verbose and thus more error prone. In addition,
                                     because the C code is iterative, it is more difficult to divide among multiple processors or
                                     to optimize for different architectures.
                                     The C and VSIPL++ implementations were benchmarked on both a conventional Xeon
                                     processor running at 3.6 GHz and a Cell/B.E. processor running at 3.2 GHz. The entire
                                     front-end processing chain was  run looping over the data 10  times to average out the
                                     measurements. On the Xeon platform,  the VSIPL++ library used the Intel Performance
                                     Primitives (IPP) library v5 and Intel Math Kernel Library (MKL) v7.21 as well as FFTW
                                     v3.1.2. On the Cell/B.E. platform, the VSIPL++ library used the Cell Math Library v1.0 and
                                     FFTW v3.2-alpha3. The VSIPL++ code needed no changes to run on both architectures; the
                                     VSIPL++ library utilized these underlying math libraries without explicit direction from
                                     the developer.

                                     The extra mile: Unlocking hardware’s potential with strategic optimization.
                                     Finally, opportunities for optimization were investigated in the VSIPL++ implementation
                                     for the Xeon and Cell/B.E. processors. For example, in the fast time filter computation, the
                                     VSIPL++ code performs an FFT followed by a matrix multiplication. Given large enough
                                     data sets, memory accesses in the second step cause cache misses on a Xeon processor,
                                     leading to expensive reads from main memory. Rewriting this loop so that each row is
                                     processed  one  at  a  time  (that  is,  taking  the  FFT  and  then  performing  the  vector
                                     multiplication) results in a 1.6x  speed improvement  for this  portion of  the code. The
                                     change is trivial to implement, requiring only a net increase of eight lines of code, yet it
                                     yields a 20% improvement in the execution time of the entire front-end stage.
                                     On the Cell/B.E.  processor, profiling  reveals that interpolation takes almost 40  times
                                     longer than matched filtering. A large amount of time is spent in a loop over data in the
                                     “range” direction (perpendicular to the flight path), performing a polar-to-rectangular
                                     coordinate conversion. A contribution from several inputs for each side-lobe of the sinc
                                                                                                          Contd...



          26                                LOVELY PROFESSIONAL UNIVERSITY
   26   27   28   29   30   31   32   33   34   35   36