Page 87 - DCAP408_WEB_PROGRAMMING
P. 87

Unit 5: Memory Management (I)




          or                                                                                    Notes
          struct{floata;floatb;floatc;}__attribute__((aligned(16)));
          Any address of a variable existing in global memory or returned by one of the memory allotment
          routines from the driver or runtime API is always allocated to at least 256 bytes. Reading non-
          naturally aligned 8-byte or 16-byte words generates incorrect results (off by a few words).


               !
             Caution  Special care must be taken to preserve alignment of the beginning address of any
             value or array of values of these types.
          A classic case where this might be easily unnoticed is when using some tradition global memory
          allocation scheme, whereby the allocations of numerous arrays

          (with multiple calls to
          cudaMalloc()
          or
          cuMemAlloc()
          ) is substituted by the allocation of a single large block of memory partitioned into multiple
          arrays, in which  case the starting address of every array is  offset from the block’s starting
          address.

          Two-Dimensional Arrays

          A general global memory access pattern is when every thread of index  (tx,ty) accesses the
          following address to access one element of a 2D array of width ,situated at address BaseAddress
          of type type*:
          BaseAddress + width * ty + tx
          For  these accesses to be  completely coalesced,  both the  breadth of the thread block and the
          breadth of the array must be a numerous of the warp size (or only half the warp size for devices
          of compute capability 1.x). Particularly, this signifies that an array whose width is not a multiple
          of this size will be accessed much more competently if it is actually assigned with a  width
          rounded up to the closest multiple of this size and its rows padded consequently.

          5.1.2 Local Memory

          Local memory accesses only happen for some automatic variables. Automatic variables that the
          compiler is probable to position in local memory are Arrays for which it cannot find out that
          they are indexed with constant quantities,

              Large structures or arrays that would use too much register space,
              Any variable if the kernel utilizes  more registers than  obtainable (this  is also  called
               register spilling).





             Notes  Observe that some mathematical functions have completion paths that might access
             local memory. The local memory space exists in device memory, so local memory accesses
             have similar high latency and low bandwidth as global memory accesses and are subject
             to the similar necessities for memory coalescing.




                                           LOVELY PROFESSIONAL UNIVERSITY                                   81
   82   83   84   85   86   87   88   89   90   91   92