I have
talked about texture memory in CUDA in this
article previously. In the previous
article, we were talking about Texture Reference. Now in this article I’m
introducing new feature of Kepler architecture called Texture object.
For this, we
recap some of the important concept of texture memory that we have discussed in
details in this
article previously. Feel free to brows this
article for more theory and example.
TEXTURE Memory
Texture memory
is read from kernels using the device functions described in Texture
Functions. The process of reading a texture calling one of these functions
is called a texture fetch.
Each texture fetch specifies a parameter called a texture object for
the texture object API or a texture reference for
the texture reference API.
the
TEXTURE
Texture which is the
piece of texture memory that is fetched. Texture
objects are created at runtime and the texture is specified when creating
the texture object as described in Texture Object API.
Texture references are created at compile time and the texture is specified at
runtime by bounding the texture reference to the texture through runtime
functions as described in Texture
Reference API; several distinct texture references might
be bound to the same texture or to textures that overlap in memory. A texture
can be any region of linear memory or a CUDA array (described in CUDA
Arrays).
TEXTURE
dimensionality
Its dimensionality
that
specifies whether the texture is addressed as a one dimensional array using one
texture coordinate, a two-dimensional array using two texture coordinates, or a
three-dimensional array using three texture coordinates. Elements of the array
are called texels, short
for texture
elements.
The texture
width,
height, and depth refer to
the size of the array in each dimension. Table Wiki lists the maximum texture
width, height, and depth depending on the compute capability of the device.
TEXTURE
type (Texel)
The type of a
texel, which is restricted to the basic integer and single-precision floating-point
types and any of the 1-, 2-, and 4-component vector types defined in char,
short, int, long, long long, float, double.
TEXTURE
Read Mode
The read mode, which is
equal to cudaReadModeNormalizedFloat or cudaReadModeElementType. If it is
cudaReadModeNormalizedFloat and the type
of the texel is a 16-bit or 8-bit
integer type, the value returned by the texture fetch is actually returned
as floating-point type and the full range of the integer type is mapped to
[0.0, 1.0] for unsigned integer type and [-1.0, 1.0] for signed integer type; for
example, an unsigned 8-bit texture element with the value 0xff reads as 1. If
it is
cudaReadModeElementType, no
conversion is performed.
TEXTURE
Normalized coordinates
Whether texture
coordinates are normalized or not. By default, textures are referenced (by the
functions of Texture Functions) using floating-point
coordinates in the range [0, N-1] where N is the size of the texture in the
dimension corresponding to the coordinate. For example, a texture that is 64x32
in size will be referenced with coordinates in the range [0, 63] and [0, 31]
for the x and y dimensions, respectively. Normalized texture coordinates cause
the coordinates to be specified in the range [0.0, 1.0-1/N] instead of [0,
N-1], so the same 64x32 texture would be addressed by normalized coordinates in
the range [0, 1-1/N] in both the x and y dimensions. Normalized texture
coordinates are a natural fit to some applications' requirements, if it is
preferable for the texture coordinates to be independent of the texture size.
TEXTURE
addressing mode
It is valid to call the
device functions with coordinates that are out of range. The addressing mode
defines what happens in that case. The
default addressing mode is to clamp the coordinates to the valid range: [0,
N) for non-normalized coordinates and [0.0, 1.0) for normalized coordinates. If
the border mode is specified instead, texture fetches with out-of-range texture
coordinates return zero. For normalized coordinates, the warp mode
and the mirror mode are also available. When using the wrap mode, each coordinate x is converted to frac(x)= x floor(x) where floor(x) is the largest integer not greater than x. When using
the mirror mode, each coordinate x is
converted to frac(x)
if
floor(x)
is
even and 1-frac(x)
if
floor(x)
is
odd. The addressing mode is specified as an array of size three whose first,
second, and third elements specify the addressing mode for the first, second,
and third texture coordinates, respectively; the addressing mode are cudaAddressModeBorder, cudaAddressModeClamp, cudaAddressModeWrap
and
cudaAddressModeMirror; cudaAddressModeWrap and
cudaAddressModeMirror are only
supported for normalized texture coordinates.
TEXTURE
filtering mode
The filtering mode which specifies how the value returned
when fetching the texture is computed based on the input texture coordinates.
Linear texture filtering may be done only for textures that are configured to
return floating-point data. It performs
Low-precision
interpolation between neighboring texels. When enabled, the texels surrounding
a texture fetch location are read and the return value of the texture fetch is
interpolated based on where the texture coordinates fell between the texels.
Simple linear interpolation is performed for one-dimensional textures, bilinear
interpolation for two-dimensional textures, and trilinear interpolation for
three-dimensional textures. Texture Fetching gives
more details on texture fetching. The filtering mode is equal to cudaFilterModePoint or
cudaFilterModeLinear.
If it is cudaFilterModePoint,
the returned value is the texel whose texture coordinates are the closest to
the input texture coordinates. If it is cudaFilterModeLinear,
the returned value is the linear interpolation of the two (for a
one-dimensional texture), four (for a two dimensional texture), or eight (for a
three dimensional texture) texels whose texture coordinates are the closest to
the input texture coordinates. cudaFilterModeLinear is
only valid for returned values of floating-point type.
TEXTURE
object api
A texture object is created
using cudaCreateTextureObject() from a
resource description of type struct cudaResourceDesc, which specifies the texture, and
from a texture description defined as such:
struct
cudaTextureDesc
{
enum cudaTextureAddressMode
addressMode[3];
enum cudaTextureFilterMode
filterMode;
enum cudaTextureReadMode
readMode;
int sRGB;
int normalizedCoords;
unsigned int maxAnisotropy;
enum cudaTextureFilterMode
mipmapFilterMode;
float mipmapLevelBias;
float minMipmapLevelClamp;
float maxMipmapLevelClamp;
};
|
·
addressMode specifies the addressing
mode;
·
filterMode specifies the filter
mode;
·
readMode specifies the read
mode;
·
normalizedCoords specifies
whether texture coordinates are normalized or not;
TEXTURE
REFERENCES
Textures
are likely a familiar concept to anyone who’s done much CUDA programming. A
feature from the graphics world, textures are images that are stretched,
rotated and pasted on polygons to form the 3D graphics we are familiar with.
Using textures for GPU computing has always been a pro tip for the CUDA
programmer; they enable fast random access to arrays and use a cache to provide
bandwidth aggregation. On the flip side, the legacy texture reference API is
cumbersome to use because it requires manual binding and unbinding of texture
references to memory addresses, as the following code demonstrates. Also,
texture references can only be declared as static global variables and cannot
be passed as function arguments.
#define N 1024
texture<float, 1, cudaReadModeElementType> tex;
// texture reference name must be known at compile time
__global__ void kernel() {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float x = tex1D(tex, i);
// do some work using x...
}
void call_kernel(float *buffer) {
// bind texture to buffer
cudaBindTexture(0, tex, buffer, N*sizeof(float));
kernel<<<grid, block>>>();
// unbind texture from buffer
cudaUnbindTexture(tex);
}
int main() {
// declare and allocate memory
float *buffer;
cudaMalloc(&buffer, N*sizeof(float));
call_kernel(buffer);
cudaFree(buffer);
}
TEXTURE
OBJECTS
Kepler GPUs and CUDA 5.0 introduce a new feature called texture
objects (sometimes called bindless textures,
since they don’t require manual binding/unbinding) that greatly improves the
usability and programmability of textures. Texture objects use the new
cudaTextureObject_t class API, whereby textures become first-class C++ objects
and can be passed as arguments just as if they were pointers. There is no need
to know at compile time which textures will be used at run time, which enables
much more dynamic execution and flexible programming, as shown in the following
code.
#define N 1024
// texture object is a kernel argument
__global__ void kernel(cudaTextureObject_t tex) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float x = tex1Dfetch(tex, i);
// do some work using x ...
}
void call_kernel(cudaTextureObject_t tex) {
kernel <<<grid, block>>>(tex);
}
int main() {
// declare and allocate memory
float *buffer;
cudaMalloc(&buffer, N*sizeof(float));
// create texture object
cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypeLinear;
resDesc.res.linear.devPtr = buffer;
resDesc.res.linear.desc.f = cudaChannelFormatKindFloat;
resDesc.desc.x = 32; // bits per channel
resDesc.res.linear.sizeInBytes = N*sizeof(float);
cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
texDesc.readMode = cudaReadModeElementType;
// create texture object: we only have to do this once!
cudaTextureObject_t tex;
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
call_kernel(tex); // pass texture as argument
// destroy texture object
cudaDestroyTextureObject(tex);
cudaFree(buffer);
}
Moreover, texture objects only need to be instantiated once, and
are not subject to the hardware limit of 128 texture references, so there is no
need to continuously bind and unbind them. (The Kepler hardware limit is over
one million texture objects, a limit that will likely never be reached by any
practical CUDA application in the lifetime of the architecture!) Using texture
objects, the overhead of binding (up to 1 μs) and unbinding (up to 0.5 μs)
textures is eliminated. What is not commonly known is that each outstanding
texture reference that is bound when a kernel is launched incurs added launch
latency—up to 0.5 μs per texture reference. This launch overhead persists even
if the outstanding bound textures are not even referenced by the kernel. Again,
using texture objects instead of texture references completely removes this
overhead.
STRONG
SCALING IN HPC
While these overheads may sound minor for any application that
uses kernels that run for milliseconds or greater in duration, they are
important for latency-sensitive, short-running kernels. High Performance
Computing (HPC) applications that need strong scaling typify this use case.
Strong scaling is how the solution time varies with the number of processors
for a fixed global problem size. The goal is to solve a fixed problem in as
short a time as possible. There is usually a limit to strong scaling, as at
some point an application will become bound by the inter-processor
communication bandwidth or latencies that arise from doing less work on each processor.
Beyond this limit the total performance will actually decrease.
Figure
1: Strong-scaling performance of the QUDA solver using both texture references
and texture objects.
There are many large-scale HPC applications that will be running
on the Titan supercomputer at Oak Ridge National Laboratory in Tennessee, for
which achieving excellent strong scaling performance will be vital. Lattice
Quantum Chromodynamics (LQCD) is one such application. LQCD uses a combination
of sparse linear solvers, molecular dynamics algorithms and Monte Carlo methods
to probe the structure of the nucleus. Using the highly optimized QUDA library,
legacy LQCD applications MILC and Chroma can run on Titan making full use of
the attached Tesla K20X accelerators. The QUDA library makes extensive use of
textures, and using texture references the strong scaling tops out at 2
TFLOP/s, as you can see in Figure 1. Profiling using the CUDA visual profiler
revealed that the overhead of texture binding and unbinding contributed
significantly to time-critical communication routines running on the GPU.
Moreover, many of the kernels run for O(10μs) using up to ten textures
simultaneously. Thus the launch latency overhead is a significant bottleneck.
Once the library was rewritten to use texture objects, strong scaling
performance increased significantly and throughput increased to nearly 3
TFLOP/s at 27 GPUs (see Figure 1).
Texture objects are yet another powerful feature of the Kepler
architecture that make it easier to achieve high performance and scalability
with CUDA.
Another example
//
Simple transformation kernel
__global__
void transformKernel(float*
output, cudaTextureObject_t texObj, int width,
int height, float theta)
{
//
Calculate normalized texture coordinates
unsigned
int x = blockIdx.x * blockDim.x +
threadIdx.x;
unsigned
int y = blockIdx.y * blockDim.y +
threadIdx.y;
float
u = x / (float)width;
float
v = y / (float)height;
//
Transform coordinates
u -=
0.5f;
v -=
0.5f;
float
tu = u * cosf(theta) – v * sinf(theta) + 0.5f;
float
tv = v * cosf(theta) + u * sinf(theta) + 0.5f;
//
Read from texture and write to global memory
output[y
* width + x] = tex2D<float>(texObj, tu, tv);
}
//
Host code
int main()
{
//
Allocate CUDA array in device memory
cudaChannelFormatDesc
channelDesc =cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray*
cuArray;
cudaMallocArray(&cuArray,
&channelDesc, width, height);
//
Copy to device memory some data located at address h_data
//
in host memory
cudaMemcpyToArray(cuArray,
0, 0, h_data, size,cudaMemcpyHostToDevice);
//
Specify texture
struct
cudaResourceDesc resDesc;
memset(&resDesc,
0, sizeof(resDesc));
resDesc.resType
= cudaResourceTypeArray;
resDesc.res.array.array
= cuArray;
//
Specify texture object parameters
struct
cudaTextureDesc texDesc;
memset(&texDesc,
0, sizeof(texDesc));
texDesc.addressMode[0]
= cudaAddressModeWrap;
texDesc.addressMode[1]
= cudaAddressModeWrap;
texDesc.filterMode
= cudaFilterModeLinear;
texDesc.readMode
= cudaReadModeElementType;
texDesc.normalizedCoords
= 1;
//
Create texture object
cudaTextureObject_t
texObj = 0;
cudaCreateTextureObject(&texObj,
&resDesc, &texDesc, NULL);
//
Allocate result of transformation in device memory
float*
output;
cudaMalloc(&output,
width * height * sizeof(float));
//
Invoke kernel
dim3
dimGrid((width + dimBlock.x – 1) / dimBlock.x, (height +
dimBlock.y – 1) / dimBlock.y);
transformKernel<<<dimGrid,
dimBlock>>>(output, texObj, width, height, angle);
//
Destroy texture object
cudaDestroyTextureObject(texObject);
//
Free device memory
cudaFreeArray(cuArray);
cudaFree(output);
return
0;
}
|
Reference
CUDA C programming Guide