Page 86 - DCAP408_WEB_PROGRAMMING
P. 86
Windows Programming
Notes 5.1 Local vs Global Memory
5.1.1 Global Memory
Global memory exists in device memory and device memory is used through 32-, 64-, or 128-
byte memory transactions. These memory transactions must be normally aligned: Only the 32,
64-, or 128-byte segments of device memory that are aligned to their size (i.e. whose first address
is a multiple of their size) can be read or written by memory communications. When a warp
implements an instruction that uses global memory, it coalesces the memory accesses of the
threads inside the warp into one or more of these memory transactions relying on the size of the
word accessed by every thread and the distribution of the memory addresses across the threads.
Usually, the more transactions are essential, the more unused words are relocated as well as the
words accessed by the threads, decreasing the instruction throughput therefore.
Example: If a 32-byte memory transaction is produced for each thread’s 4-byte access,
throughput is divided by 8. How many transactions are essential and how throughput is
eventually affected differs with the compute potential of the device?
For devices of compute capability 1.0 and 1.1, the necessities on the distribution of the addresses
across the threads to obtain any coalescing at all are very severe. They are much more comfortable
for devices of superior compute capabilities. For devices of compute capability 2.0, the memory
communications are cached, so data locality is exploited to decrease impact on throughput. To
make the most of global memory throughput, it is thus significant to maximize coalescing by:
Following the most optimal access patterns.
Using data types that fulfill the size and alignment prerequisite as illustrated below.
Padding data in some cases, for instance, when accessing a two-dimensional array as
illustrated below.
Size and Alignment Requirement
Global memory instructions assist reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes.
Any access (through a variable or a pointer) to data existing in global memory compiles to a
single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes
and the data is logically allocated (i.e. its address is multiple of that size). If this size and
alignment prerequisite is not fulfilled, the access compiles to various instructions with interleaved
access patterns that stop these instructions from completely coalescing. It is as a result suggested
to use types that fulfill this prerequisite for data that exists in global memory.
Did u know? The allocation prerequisite is automatically fulfilled for built-in types.
For structures, the size and alignment necessities can be enforced by the compiler using the
alignment specifiers
__attribute__ ((aligned(8)))
or
__attribute__ ((aligned(16))) , such as
struct{floata;floatb;}__attribute__((aligned(8)));
80 LOVELY PROFESSIONAL UNIVERSITY