We have learned about shared memory in CUDA in my previous articles. We also saw
shared memory in context of synchronization.
In this article I’ll let you know detail description about Shared memory in CUDA. Since shared memory is limited in terms of size so, some limitation came out.
In this article I’ll let you know detail description about Shared memory in CUDA. Since shared memory is limited in terms of size so, some limitation came out.
This article specifically dedicated to shared memory
bank conflict, if you don’t know what shared memory is and how to use in CUDA,
please follow this article.
Let’s start our discussion on bank conflict in shared memory in CUDA.
Let’s start our discussion on bank conflict in shared memory in CUDA.
CUDA is the parallel programming model to write general purpose
parallel programs that will be executed on the GPU. Bank conflicts in GPUs are specific to shared memory and it is
one of the many reasons to slow down the GPU kernel. Bank conflicts arise because of some specific access pattern of data in
shared memory. It also depends on the hardware. For example, a bank
conflict on a GPU device with compute capability 1.x may not be a bank conflict
on a device with compute capability 2.x.
For better understanding about the bank
conflict in shared memory, one should very clear about concept of Wrap. If you feel
that you have little doubt over wrap, I refer you to go through this article.
Shared
memory Banks
Background
Since fast shared memory access is restricted
to threads in a block. The shared memory is divided into multiple banks
(similar to banks in DRAM modules). Each bank can service only one request at a
time. The shared memory is therefore interleaved to increase the throughput. If
the shared memory is interleaved by 32 bits, then the bandwidth of each bank is
32 bits or one float data type. The total number of banks is fixed. It is 16 on older GPUs (with compute capability 1.x ) and
32 on modern GPUs (with compute capability 2.x).
Because it is on-chip, shared memory is much
faster than local and global memory. Shared
memory latency is roughly 100x lower than global memory latency.
Memory bank
To achieve high memory bandwidth for concurrent accesses, shared
memory is divided into equally sized
memory modules, called banks that
can be accessed simultaneously. Therefore, any memory load or store of n addresses that spans n distinct memory banks can be serviced
simultaneously, yielding an effective bandwidth that is n times as high as the bandwidth of a single bank.
What’s wrong? Bank
Conflict
If multiple addresses of a memory request map to the same memory bank, the accesses are serialized. The hardware splits a memory request that has bank conflicts into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of separate memory requests.
Organization of Shared
memory Banks in CUDA
Shared memory banks are organized such that successive 32-bit
words are assigned to successive banks and each bank has a bandwidth of 32 bits
per clock cycle. The bandwidth of shared
memory is 32 bits per bank per clock cycle. For devices of compute capability 1.x, the warp size is 32 threads and the number
of banks is 16.
How the request to
shared memory works in a Wrap
A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.
Shared memory enables cooperation between threads in a block. When
multiple threads in a block use the same data from global memory, shared memory
can be used to access the data from global memory only once. Shared memory can
also be used to avoid un-coalesced
memory (discussed below) accesses by loading and storing data in a coalesced pattern (discussed below) from
global memory and then reordering it in shared memory. Aside from memory bank conflicts, there is no penalty for
non-sequential or unaligned accesses by a half warp in shared memory.
For better understanding we should
know about what is Coalesced memory in CUDA, that I have discussed below.
Coalesced
memory in CUDA
A coalesced memory transaction is one in which all of the threads
in a half-warp access global memory at the same time. This is over simple, but
the correct way to do it is just have consecutive threads access consecutive
memory addresses.
So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8,
and 0xc, it should be a coalesced read
Example
Let’s say we have a matrix which isreside linearly in memory. You
can do this however you want, and your memory access should reflect how your
matrix is laid out. So, the 3x4 matrix below
0 1 2 3
4 5 6 7
8 9 a b
|
Could be done row after row, like this, so that (r,c) maps to
memory (r*4 + c)
0 1 2 3 4 5 6 7 8 9 a b
|
Suppose you need to
access element once, and say you have four threads. Which threads will be used
for which element? Probably either
thread 0: 0, 1, 2
thread 1: 3, 4, 5
thread 2: 6, 7, 8
thread 3: 9, a, b
or
thread 0: 0, 4, 8
thread 1: 1, 5, 9
thread 2: 2, 6, a
thread 3: 3, 7, b
|
Which is better? Which will result in coalesced
reads, and which will not?
Either way, each thread makes three accesses. Let's look at the
first access and see if the threads access memory consecutively. In the first
option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The
second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!
The best way is probably to write your kernel and then profile it
to see if you have non-coalesced global loads and stores.
Bonus: The number of coalesced and un-coalesced memory transactions in
GPU
I’ll
describe in more detail in my future article.
Now back to
the picture.
So we have learned that a bank conflict arises if any of the
threads in a half warp access different words in the same bank. When a bank
conflict happens the access to the data is serialized.
So, when
bank conflict happens or when not, how we recognize that?
Let’s go though some example to understand.
Let’s go though some example to understand.
No Bank Conflict
Liner addressing
Let say, I have a shared memory of 32 floats inside the kernel which look like this.
__shared__ float
shared[32];
|
We have declared an array of shared memory of float type. Now we
access data though shared memory such as,
float data = shared[
BaseIndex + S*tid];
where “S” is
known as Stribe
|
Memory accessing pattern is show in below figure, with S = 1 and S
= 3
Fig 1
Clearly in memory access pattern, it is clearer that each thread
in a warp access a different bank like thread
0 access bank 0 or thread 15 access bank 15 and so on, on the other fig. thread
0 access bank 0, thread 1 access bank 3 and so on.
So, clearly each request made concurrently (In parallels). SO, No Bank Conflict.
So, clearly each request made concurrently (In parallels). SO, No Bank Conflict.
But why there is no bank conflict?
Think!!!
This is only bank-conflict-free if S shares no common factors with the number of ad banks.
This is only bank-conflict-free if S shares no common factors with the number of ad banks.
Example: 16 banks
on G80, so S must be odd
Example
Scenario
Let’s
say we have an array of size 256 of integer type in global memory and we have
256 threads in a single Block, and we want to copy the array to shared memory.
Therefore every thread copies one element.
shared_a[threadIdx.x] = global_a[threadIdx.x];
|
So, what u think, does it trap into bank conflict? (Before reading
answer, think first)
Ok Ok!!
First let’s assume your arrays are say for example of the type int (a 32-bit word). Your code saves these ints into shared memory, across any half warp the Kth thread is saving to the Kth memory bank. So for example thread 0 of the first half warp will save to shared_a[0] which is in the first memory bank, thread 1 will save to shared_a[1], each half warp has 16 threads these map to the 16 4byte banks. In the next half warp, the first thread will now save its value into shared_a[16] which is in the first memory bank again. So if you use a 4byte word such int, float etc, then this example will not result in a bank conflict.
First let’s assume your arrays are say for example of the type int (a 32-bit word). Your code saves these ints into shared memory, across any half warp the Kth thread is saving to the Kth memory bank. So for example thread 0 of the first half warp will save to shared_a[0] which is in the first memory bank, thread 1 will save to shared_a[1], each half warp has 16 threads these map to the 16 4byte banks. In the next half warp, the first thread will now save its value into shared_a[16] which is in the first memory bank again. So if you use a 4byte word such int, float etc, then this example will not result in a bank conflict.
Be Cautions
If you use a 1 byte word such as char, in the first half warp threads 0, 1, 2 and 3 will all save their values to the first bank of shared memory which will cause a bank conflict known as 4-way bank conflict (we’ll discussed later) .
If you use a 1 byte word such as char, in the first half warp threads 0, 1, 2 and 3 will all save their values to the first bank of shared memory which will cause a bank conflict known as 4-way bank conflict (we’ll discussed later) .
Question
time!!! So, tell me is there any bank conflict in the following statement?
foo
= shared[baseIndex + threadIdx.x]
|
A>
If data is type of 32-bits; 4 Byte?
B>
If data is type of 8-bits; 1 Byte?
You can
answer these questions as comment to this post. J
Random addressing
In
Random addressing itself there is no bank conflict. Shown in below figure,
Fig
2
Bank Conflicts
2 way bank conflict
Let say, I have a shared memory of 32 doubles/shorts inside the kernel which look like this.
__shared__double
shared[32];
Or
__shared__short
shared[32];
|
We have declared an array of shared memory of double/short type.
Now we access data though shared memory such as,
double data = shared[
BaseIndex + S*tid];
or
short data = shared[
BaseIndex + S*tid];
where “S” is
known as Stribe.
|
Memory accessing pattern is show in below figure, with S = 1
Fig 3
Clearly in memory access pattern, it is clearer that two threads
in a warp access a same bank memory location like thread 0 access bank 0 or thread 8 access bank 0 and so on. So,
clearly these requests become serialize, Known as 2-Way bank conflict.
For example, refer last example the
only changes is instead of int’s make them double/short.
4 way bank conflict
Let say, I have a shared memory of 32 char inside the kernel which look like this.
__shared__double
shared[32];
Or
__shared__short
shared[32];
|
We have declared an array of shared memory of char type. Now we
access data though shared memory such as,
double data = shared[
BaseIndex + S*tid];
or
short data = shared[
BaseIndex + S*tid];
where “S” is
known as Stribe.
|
Memory accessing pattern is show in below figure, with S = 1
Fig 4
Clearly in memory access pattern, it is clearer that four threads
in a warp access a same bank memory location like thread 0,1,2,3 access bank 0 and so on. So, clearly these requests
become serialize, Known as 4-Way bank conflict.
For example, refer last example the
only changes is instead of int’s make them char.
Structure
and Bank Conflict | Bank Conflict in Structure in CUDA
Now let us
discuss bank conflict in structures. In structures let say of floats or chars
or ints or doubles or even combination of these, we need to take special care
about it. So, let us first understand how the bank conflict arises in structure
itself.
I’ll
explain it by examples !!!
Let us say
we have;
struct vector { float x, y, z; };
__shared__ struct vector vectors[64];
|
struct myType2 {float f; int c; };
__shared__ struct myType2 myTypes2[64];
|
struct myType8 {float x,y,z,w; };
__shared__ struct myType8 myTypes8[64];
|
So,
according to you which have bank
conflict and which don’t have?
Ok
Ok!!
Example 1: This
has no bank conflicts for vector; struct
size is 3 words
3 accesses per thread, contiguous banks (no common
factor with 16)
Thread
struct vector v = vectors [ baseIndex + threadIdx.x];
Example 2: This
has 2 way bank conflicts for myType2;
(2 accesses per thread) (Have common factor with
16; 2 )
struct myType2 m
= myTypes4[baseIndex + threadIdx.x];
Example 3: This
has 8 way bank conflicts for myType8;
(8 accesses per thread) (Have common factor with
16; 8 )
struct myType8 m
= myTypes8[baseIndex + threadIdx.x];
Fig
5
Are
you still having confusion? if Yes, you may not catch the Eagle, let me do for
you, the all movie is end up with this statement;
This is only bank-conflict-free if S shares no common factors with
the number of ad banks.
Example: 16 banks
on G80, so S must be odd, if S
is even then there is a bank conflict.
How to avoid bank conflict?
•The old fashion
method: (don’t use it)
__shared__
intshared_lo[32];
__shared__
intshared_hi[32];
double
dataIn;
shared_lo[BaseIndex+tid]=
__double2loint(dataIn);
shared_hi[BaseIndex+tid]=
__double2hiint(dataIn);
double
dataOut =__hiloint2double( shared_hi[BaseIndex+tid],
shared_lo[BaseIndex+tid] );
|
•For array of structures, bank conflict can be reduced by changing
it to structure of array. “Memory padding”(we’ll discuss later)
Now time for
exercise and some real application examples.
Common Array Bank Conflict Patterns 1D
Start with this, each
thread loads 2 elements into shared memory:
int tid = threadIdx.x;
shared[2*tid] =
global[2*tid];
shared[2*tid + 1 ] = global[2*tid + 1];
|
Does
it have bank conflict?
Did you say yes and 2-Way bank conflict? OH!!, yes it has 2-way bank conflict. Since, 2-way-interleaved loads result in 2-way bank conflicts. Great Job!!!
Did you say yes and 2-Way bank conflict? OH!!, yes it has 2-way bank conflict. Since, 2-way-interleaved loads result in 2-way bank conflicts. Great Job!!!
Note:
This makes sense for traditional CPU
thread, locality in cache line usage and reduce sharing traffic but not in shared memory usage where there is no cache
line effects but banking effects. ;)
So,
how you’ll solve this problem ? do you have any solution for this?
I have!!
I have!!
A
Better Array Access Pattern:
Each thread loads one element in every
consecutive group of blockDim elements, like this;
shared[tid] =
global[tid];
shared[tid + blockDim.x] = global[tid + blockDim.x];
|
Fig 6
Common Bank Conflict
Patterns (2D)
Suppose you are operating on 2D array of floats in shared memory
like when we do image processing. Let say our block size is 16x16 (don’t bother
about grid size). So our scenario is that each thread processes a row, so
threads in a block access the element in each column simultaneously(example:
row 1 with in purple color fig. 7)
Fig 7
So, we suffer here with 16-way bank conflict, since all row start
with 0. L
Solution
Fortunately,
we have solutions.
1>
Memory padding; Add one float to end
of each row.
2>
Transpose the matrix before
processing,
3>
Change the address pattern.
Unfortunately,
second one may have bank conflict, so for this article I’ll concern about only
memory padding and in the future article I’ll let you know the solution with
second option. For time being concentrate on first solution.
Memory
padding like as shown in fig.
Fig 8
Solutions
is very simple with Memory padding. Instead of creating shared memory as
__shared__ int shared[TILE_WIDHT][TILE_HEIGHT] ;
|
Create it
as
__shared__ int shared[TILE_WIDHT][TILE_HEIGHT + 1 ]
;
|
And fill the unused space with 0’s. Since C/C++ is row major, the leading dimension is the row-width
(number of elements in a row). J
I’ll describe you memory padding with pitch that is the correct
way to pad memory, in future article. [since it is itself a long article J]
I hope you must like this article and
have learned Shared memory bank
conflict and how to avoid them in CUDA.
Got
Questions?
Feel
free to ask me any question because I'd be happy to walk you through step by
step!