Page 87 - DCAP408_WEB_PROGRAMMING
P. 87
Unit 5: Memory Management (I)
or Notes
struct{floata;floatb;floatc;}__attribute__((aligned(16)));
Any address of a variable existing in global memory or returned by one of the memory allotment
routines from the driver or runtime API is always allocated to at least 256 bytes. Reading non-
naturally aligned 8-byte or 16-byte words generates incorrect results (off by a few words).
!
Caution Special care must be taken to preserve alignment of the beginning address of any
value or array of values of these types.
A classic case where this might be easily unnoticed is when using some tradition global memory
allocation scheme, whereby the allocations of numerous arrays
(with multiple calls to
cudaMalloc()
or
cuMemAlloc()
) is substituted by the allocation of a single large block of memory partitioned into multiple
arrays, in which case the starting address of every array is offset from the block’s starting
address.
Two-Dimensional Arrays
A general global memory access pattern is when every thread of index (tx,ty) accesses the
following address to access one element of a 2D array of width ,situated at address BaseAddress
of type type*:
BaseAddress + width * ty + tx
For these accesses to be completely coalesced, both the breadth of the thread block and the
breadth of the array must be a numerous of the warp size (or only half the warp size for devices
of compute capability 1.x). Particularly, this signifies that an array whose width is not a multiple
of this size will be accessed much more competently if it is actually assigned with a width
rounded up to the closest multiple of this size and its rows padded consequently.
5.1.2 Local Memory
Local memory accesses only happen for some automatic variables. Automatic variables that the
compiler is probable to position in local memory are Arrays for which it cannot find out that
they are indexed with constant quantities,
Large structures or arrays that would use too much register space,
Any variable if the kernel utilizes more registers than obtainable (this is also called
register spilling).
Notes Observe that some mathematical functions have completion paths that might access
local memory. The local memory space exists in device memory, so local memory accesses
have similar high latency and low bandwidth as global memory accesses and are subject
to the similar necessities for memory coalescing.
LOVELY PROFESSIONAL UNIVERSITY 81