Page 86 - DCAP408_WEB_PROGRAMMING
P. 86

Windows Programming




                    Notes          5.1 Local vs Global Memory


                                   5.1.1 Global  Memory

                                   Global memory exists in device memory and device memory is used through 32-, 64-, or 128-
                                   byte memory transactions. These memory transactions must be normally aligned: Only the 32,
                                   64-, or 128-byte segments of device memory that are aligned to their size (i.e. whose first address
                                   is a multiple of their size) can be read or written by memory communications. When a warp
                                   implements an instruction that uses global memory, it coalesces the memory accesses of  the
                                   threads inside the warp into one or more of these memory transactions relying on the size of the
                                   word accessed by every thread and the distribution of the memory addresses across the threads.
                                   Usually, the more transactions are essential, the more unused words are relocated as well as the
                                   words accessed by the threads, decreasing the instruction throughput therefore.


                                          Example: If a 32-byte memory transaction is produced for each thread’s 4-byte access,
                                   throughput is divided  by 8. How  many  transactions are  essential and  how throughput  is
                                   eventually affected differs with the compute potential of the device?
                                   For devices of compute capability 1.0 and 1.1, the necessities on the distribution of the addresses
                                   across the threads to obtain any coalescing at all are very severe. They are much more comfortable
                                   for devices of superior compute capabilities. For devices of compute capability 2.0, the memory
                                   communications are cached, so data locality is exploited to decrease impact on throughput. To
                                   make the most of global memory throughput, it is thus significant to maximize coalescing by:
                                      Following the most optimal access patterns.
                                      Using data types that fulfill the size and alignment prerequisite as illustrated below.

                                      Padding data in some cases, for instance, when accessing a two-dimensional array as
                                       illustrated below.

                                   Size and Alignment Requirement

                                   Global memory instructions assist reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes.
                                   Any access (through a variable or a pointer) to data existing in global memory compiles to a
                                   single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes
                                   and  the data  is logically allocated (i.e. its address is multiple of that  size). If this size and
                                   alignment prerequisite is not fulfilled, the access compiles to various instructions with interleaved
                                   access patterns that stop these instructions from completely coalescing. It is as a result suggested
                                   to use types that fulfill this prerequisite for data that exists in global memory.



                                     Did u know?  The allocation prerequisite is automatically fulfilled for built-in types.
                                   For structures, the size  and alignment necessities can be enforced by the compiler using  the
                                   alignment specifiers
                                   __attribute__  ((aligned(8)))
                                   or
                                   __attribute__  ((aligned(16)))  ,  such  as
                                   struct{floata;floatb;}__attribute__((aligned(8)));






          80                                LOVELY PROFESSIONAL UNIVERSITY
   81   82   83   84   85   86   87   88   89   90   91