Classes | Defines | Enumerations | Functions

MNCudaUtil.h File Reference

Provides useful macros and CUDA functions. More...

#include "MNUtilities.h"
#include <cuda_runtime.h>
#include <cutil_inline.h>
#include <cutil_math.h>

Go to the source code of this file.

Classes

class  ReduceOperatorTraits< T, oper >
 Reduction operator traits class. More...

Defines

#define MNCUDA_CHECKERROR   mncudaSafeCallNoSync(mncudaCheckError(false))
 Performs a CUDA synchronization and checks for returned errors (optional in release mode).
#define mncudaSafeCallNoSync(err)   __mncudaSafeCallNoSync(err, __FILE__, __LINE__)
 CUDA error check macro. Comparable to cudaSafeCallNoSync, however using MNFatal().
#define mncudaCheckErrorCUtil(err)   __mncudaCheckErrorCUtil(err, __FILE__, __LINE__)
 CUDA CUtil error check macro. Uses MNFatal().
#define MNCUDA_DIVUP(count, chunkSize)   ((count) / (chunkSize) + (((count) % (chunkSize))?1:0))
 Divides count by chunkSize and adds 1 if there is some remainder.
#define MNCUDA_MAKEGRID2D(numBlocks, maxBlocks)   dim3(min((numBlocks), (maxBlocks)), 1 + (numBlocks) / (maxBlocks), 1)
 Avoids the maximum CUDA grid size by using two grid dimensions for a one dimensional grid. I added this due to problems with exceeding 1D grid sizes.
#define MNCUDA_GRID2DINDEX   (blockIdx.x + (blockIdx.y*gridDim.x))
 Calculates the one dimensional block index for the given 2D grid that was created by MNCUDA_MAKEGRID2D().
#define MNCUDA_ALIGN_BYTES(size, alignment)
 Computes the aligned byte size for a given alignment.
#define MNCUDA_ALIGN(count)   MNCUDA_ALIGN_EX(count, 16)
 Computes the aligned element count for an alignment of 16 elements.
#define MNCUDA_ALIGN_NZERO(count)   MNCUDA_ALIGN_EX((((count) == 0) ? 1 : (count)), 16)
 Computes the aligned element count with special treatment for zero counts.
#define MNCUDA_ALIGN_EX(count, alignment)
 Computes the aligned element count, extended version.

Enumerations

enum  MNCudaOP { MNCuda_ADD, MNCuda_SUB, MNCuda_MUL, MNCuda_DIV, MNCuda_MIN, MNCuda_MAX }
 

Operator types used in utility algorithms. These operators are usually used as binary operators, e.g. when working component-wise on two arrays.

More...

Functions

void __mncudaSafeCallNoSync (cudaError err, const char *file, const int line)
 Checks for CUDA errors without synchronization. Uses MNFatal() to report the error.
void __mncudaCheckErrorCUtil (CUTBoolean err, const char *file, const int line)
 Checks for CUDA CUtil errors without synchronization. Uses MNFatal() to report the error.
cudaError_t mncudaCheckError (bool bForce=true)
 Checks for pending errors and returns them.
void mncudaEnableErrorChecks (bool bEnable=true)
 Enables error checks using mncudaCheckError().
bool mncudaIsErrorChecking ()
 Gets whether error checking is enabled.
uint mncudaGetMaxBlockSize (uint reqSharedMemPerThread, uint maxRegPerThread, bool useMultipleWarpSize=true)
 Computes best thread block size for given shared memory requirement.
void mncudaInitIdentity (uint *d_buffer, uint count)
 Initializes the given buffer with the identity relation, that is buffer[i] = i.
template<class T >
void mncudaInitConstant (T *d_buffer, uint count, T constant)
 Initializes the given buffer with a constant value.
void mncudaAddIdentity (uint *d_buffer, uint count)
 Adds the index to all elements of the given buffer.
template<MNCudaOP op, class T >
void mncudaConstantOp (T *d_array, uint count, T constant)
 Performs constant operation on all array elements:
template<class V , class S >
void mncudaScaleVectorArray (V *d_vecArray, uint count, S scalar)
 Scales given vector array by given scalar (component wise).
template<class V , class S >
void mncudaAverageArray (V *d_array, uint count, S *d_counts)
 Averages given vector array (type of V) by dividing each element by the corresponding count.
void mncudaInverseBinary (uint *d_buffer, uint count)
 Inverses the given "binary" 0-1-buffer.
template<MNCudaOP op, class T >
void mncudaArrayOp (T *d_target, T *d_other, uint count)
 Performs an operation on the given two arrays.
template<class T >
uint mncudaResizeMNCudaMem (T **d_buffer, uint numOld, uint numRequest, uint slices=1)
 Resizes the given buffer from numOld to at least numRequest elements.
template<class T >
cudaError_t mncudaReduce (T &result, T *d_data, uint count, MNCudaOP op, T identity)
 Performs reduction on d_data.
template<class T >
cudaError_t mncudaSegmentedReduce (T *d_data, uint *d_owner, uint count, MNCudaOP op, T identity, T *d_result, uint numSegments)
 Performs segmented reduction on d_data.
template<class T >
void mncudaSetAtAddress (T *d_array, uint *d_address, T *d_vals, uint countVals)
 Moves data from device memory d_vals to device memory d_array using target addresses specified in d_address.
template<class T >
void mncudaSetConstAtAddress (T *d_array, uint *d_address, T constant, uint countVals)
 Moves a given constant to device memory d_array at the adresses specified in d_address.
template<class T >
void mncudaSetFromAddress (T *d_array, uint *d_srcAddr, T *d_vals, uint countTarget)
 Moves data from device memory d_vals to device memory d_array using source addresses specified in d_srcAddr.
void mncudaAlignCounts (uint *d_outAligned, uint *d_counts, uint count)
 Aligns the given count array by aligning all element counts.
template<class T >
cudaError_t mncudaPrintArray (T *d_array, uint count, bool isFloat, const char *strArray=NULL)
 Prints the given array using printf.
template<class T >
void mncudaCompactInplace (T *d_data, uint *d_srcAddr, uint countOld, uint countNew)
 Compacts the data array "inplace" using a temporary buffer, the given source adresses and count.
uint mncudaGenCompactAddresses (uint *d_isValid, uint countOld, uint *d_outSrcAddr)
 Generates compact addresses using CUDPP's compact.
void mncudaNormalize (float4 *d_vectors, uint numVectors)
 Normalizes each element of a given vector array.

Detailed Description

Provides useful macros and CUDA functions.

Author:
Mathias Neumann
Date:
16.02.2010

Define Documentation

#define MNCUDA_ALIGN (   count )    MNCUDA_ALIGN_EX(count, 16)

Computes the aligned element count for an alignment of 16 elements.

This is required to gain profit from coalesced access.

Author:
Mathias Neumann
Date:
14.03.2010
Parameters:
countNumber of elements to align.
#define MNCUDA_ALIGN_BYTES (   size,
  alignment 
)
Value:
( ( ((size) % (alignment)) == 0 ) ? \
                    (size) : \
                    ((size) + (alignment) - ((size) % (alignment))) )

Computes the aligned byte size for a given alignment.

This is required to gain profit from coalesced access or assign linear memory to textures without using offsets.

Author:
Mathias Neumann
Date:
21.03.2010
Parameters:
sizeThe size in bytes to align.
alignmentThe alignment in bytes.
#define MNCUDA_ALIGN_EX (   count,
  alignment 
)
Value:
( ( ((count) % (alignment)) == 0 ) ? \
                    (count) : \
                    ((count) + (alignment) - ((count) % (alignment))) )

Computes the aligned element count, extended version.

We allow here to pass an alignment (number of elements). This can be useful when aligning counts for linear texture memory which requires a special device-dependent alignment.

Author:
Mathias Neumann
Date:
26.03.2010
Parameters:
countNumber of elements to align.
alignmentThe alignment, as a number of elements.
#define MNCUDA_ALIGN_NZERO (   count )    MNCUDA_ALIGN_EX((((count) == 0) ? 1 : (count)), 16)

Computes the aligned element count with special treatment for zero counts.

This version avoids zero counts by aligning them to a non-zero value.

Author:
Mathias Neumann
Date:
16.08.2010
Parameters:
countNumber of elements to align.
#define MNCUDA_DIVUP (   count,
  chunkSize 
)    ((count) / (chunkSize) + (((count) % (chunkSize))?1:0))

Divides count by chunkSize and adds 1 if there is some remainder.

Author:
Mathias Neumann
Date:
25.04.2010
Parameters:
countCount to divide in chunks, e.g. number of elements.
chunkSizeSize of each chunk.
#define MNCUDA_GRID2DINDEX   (blockIdx.x + (blockIdx.y*gridDim.x))

Calculates the one dimensional block index for the given 2D grid that was created by MNCUDA_MAKEGRID2D().

Author:
Mathias Neumann
Date:
21.03.2010
#define MNCUDA_MAKEGRID2D (   numBlocks,
  maxBlocks 
)    dim3(min((numBlocks), (maxBlocks)), 1 + (numBlocks) / (maxBlocks), 1)

Avoids the maximum CUDA grid size by using two grid dimensions for a one dimensional grid. I added this due to problems with exceeding 1D grid sizes.

Todo:
Evaluate impact of spawning many more threads, e.g. when we got only maxBlocks+1 threads. In this case, the second slice of blocks would also get maxBlocks blocks.
Author:
Mathias Neumann
Date:
21.03.2010
Parameters:
numBlocksNumber of thread blocks.
maxBlocksMaximum size of the grid (number of thread blocks) in the first dimension.

Enumeration Type Documentation

enum MNCudaOP

Operator types used in utility algorithms. These operators are usually used as binary operators, e.g. when working component-wise on two arrays.

Author:
Mathias Neumann
Date:
30.06.2010
Enumerator:
MNCuda_ADD 

Addition.

MNCuda_SUB 

Subtraction.

MNCuda_MUL 

Multiplication.

MNCuda_DIV 

Division.

MNCuda_MIN 

Minimum.

MNCuda_MAX 

Maximum.


Function Documentation

void __mncudaCheckErrorCUtil ( CUTBoolean  err,
const char *  file,
const int  line 
) [inline]

Checks for CUDA CUtil errors without synchronization. Uses MNFatal() to report the error.

Author:
Mathias Neumann
Date:
05.10.2010
Parameters:
errThe error code.
fileThe file where the error occurred.
lineThe line where the error occurred.
void __mncudaSafeCallNoSync ( cudaError  err,
const char *  file,
const int  line 
) [inline]

Checks for CUDA errors without synchronization. Uses MNFatal() to report the error.

Author:
Mathias Neumann
Date:
05.10.2010
Parameters:
errThe error code.
fileThe file where the error occurred.
lineThe line where the error occurred.
void mncudaAddIdentity ( uint d_buffer,
uint  count 
)

Adds the index to all elements of the given buffer.

This corresponds to adding the identity relation to the elements. Here each thread works on one buffer component.

Author:
Mathias Neumann
Date:
22.03.2010
Parameters:
[in,out]d_bufferThe buffer to update.
countNumber of elements.
void mncudaAlignCounts ( uint d_outAligned,
uint d_counts,
uint  count 
)

Aligns the given count array by aligning all element counts.

 d_outAligned[i] = MNCUDA_ALIGN_NZERO(d_counts[i]) 

This should be useful to get coalesced access when accessing offsets calculated by scanning the count array. These offsets are aligned, too. Note that th MNCUDA_ALIGN_NZERO macro is used to provide special handling of zero counts. This ensures that even zero counts would get aligned to a non-zero count and helps avoiding problems with the corresponding offsets. If this would be left out, two adjacent elements would get the same offsets. Parallel access at these offsets could create a race condition.

Author:
Mathias Neumann
Date:
14.03.2010
Parameters:
[out]d_outAlignedThe aligned counts array. Has to be preallocated.
[in]d_countsThe counts array.
countNumber of elements (counts).
template<MNCudaOP op, class T >
void mncudaArrayOp ( T *  d_target,
T *  d_other,
uint  count 
)

Performs an operation on the given two arrays.

 d_target[i] = d_target[i] op d_other[i] 

Here each thread handles one component i.

Author:
Mathias Neumann
Date:
27.02.2010
Template Parameters:
opOperator type.
TElement type.
Parameters:
[in,out]d_targetTarget array.
[in]d_otherOther array.
countNumber of elements in both arrays.
template<class V , class S >
void mncudaAverageArray ( V *  d_array,
uint  count,
S *  d_counts 
)

Averages given vector array (type of V) by dividing each element by the corresponding count.

 d_array[i] = d_array[i] / d_counts[i] 

If d_counts[i] is zero, d_array[i] stays unchanged. Each spawned thread works on one component of the buffers.

Author:
Mathias Neumann
Date:
12.07.2010
Template Parameters:
VVector type.
SScalar type.
Parameters:
[in,out]d_arrayThe array of vectors to scale.
countNumber of elements in both d_array and d_counts.
[in]d_countsCounts used for averaging.
cudaError_t mncudaCheckError ( bool  bForce )

Checks for pending errors and returns them.

A synchronization is required to get all such errors. Hence this might be too costly for release mode. Therefore I added a way to make these checks optional in release mode, more precisely controllable using mncudaEnableErrorChecks().

Author:
Mathias Neumann
Date:
03.11.2010
Parameters:
bForcetrue to force error check. Is used in debug mode, see MNCUDA_CHECKERROR, to enforce error checks, even if disabled. If false is passed, the error check is performed only if checks are enabled.
Returns:
Return value from cudaThreadSynchronize(), if check is performed. Else cudaSuccess is returned.
template<class T >
void mncudaCompactInplace ( T *  d_data,
uint d_srcAddr,
uint  countOld,
uint  countNew 
)

Compacts the data array "inplace" using a temporary buffer, the given source adresses and count.

Enables to compact a structure of arrays using only one real compact and multiple set from addresses.

Author:
Mathias Neumann
Date:
17.06.2010
See also:
mncudaSetFromAddress(), mncudaGenCompactAddresses()
Template Parameters:
TElement type.
Parameters:
[in,out]d_dataThe data array to compact. Operation is performed "inplace".
[in]d_srcAddrThe source addresses.
countOldThe original count (elements in d_data).
countNewThe new count (elements in d_srcAddr). Note that this should normally be available, as it will be retrieved from the source address generation.
template<MNCudaOP op, class T >
void mncudaConstantOp ( T *  d_array,
uint  count,
constant 
)

Performs constant operation on all array elements:

 d_array[i] = d_array[i] op constant 

Each spawned thread works on one array component.

Author:
Mathias Neumann
Date:
22.04.2010
Template Parameters:
opOperator type.
TElement type.
Parameters:
[in,out]d_arrayThe array to manipulate.
countNumber of elements.
constantThe constant.
void mncudaEnableErrorChecks ( bool  bEnable )

Enables error checks using mncudaCheckError().

This function is useful in release mode only, as in debug mode, error checks are forced.

Author:
Mathias Neumann
Date:
03.11.2010
Parameters:
bEnabletrue to enable, false to disable.
uint mncudaGenCompactAddresses ( uint d_isValid,
uint  countOld,
uint d_outSrcAddr 
)

Generates compact addresses using CUDPP's compact.

These can be used to compact data corresponding to d_isValid without using CUDPP's compact, but using mncudaCompactInplace(). To compact a structure of arrays, you'd have to call this once and mncudaCompactInplace() for each array. In my tests I observed that this is much more efficient than multiple cudppCompact calls.

Note:
This corresponds to compacting an identity array.
Author:
Mathias Neumann
Date:
01.07.2010
See also:
mncudaCompactInplace(), mncudaSetFromAddress()
Parameters:
[in]d_isValidContains 1 if entry is valid, 0 if it should be dropped.
countOldOld count before compacting.
[out]d_outSrcAddrThe source address array. Device memory provided by caller.
Returns:
Number of compacted elements, i.e. the number of source addresses.
uint mncudaGetMaxBlockSize ( uint  reqSharedMemPerThread,
uint  maxRegPerThread,
bool  useMultipleWarpSize 
)

Computes best thread block size for given shared memory requirement.

Thread block size is limited due to maximum shared memory per block. This maximum depends on the actual CUDA GPU.

Warning:
This assumes 16 bytes (for blockIdx, ...) + 256 bytes (for parameters) of shared memory are reserved. That might not be correct for all GPUs. Value are taken from CUDA FAQ, see http://forums.nvidia.com/index.php?showtopic=84440.
Author:
Mathias Neumann
Date:
29.10.2010
Parameters:
reqSharedMemPerThreadRequired shared memory per thread in bytes.
maxRegPerThreadMaximum number of registers per thread.
useMultipleWarpSizeWhether to round the maximum to the next multiple of the device's warp size.
Returns:
Maximum thread block size for current CUDA device. Note that the returned value is not rounded to a valid power of two.
template<class T >
void mncudaInitConstant ( T *  d_buffer,
uint  count,
constant 
)

Initializes the given buffer with a constant value.

Each spawned thread works on one array component.

Author:
Mathias Neumann
Date:
17.02.2010
Template Parameters:
TElement type of the buffer.
Parameters:
[in,out]d_bufferDevice buffer to initialize.
countNumber of elements.
constantThe constant.
void mncudaInitIdentity ( uint d_buffer,
uint  count 
)

Initializes the given buffer with the identity relation, that is buffer[i] = i.

Each component is handled by it's own CUDA thread.

Author:
Mathias Neumann
Date:
16.02.2010
Parameters:
[in,out]d_bufferDevice buffer to initialize.
countThe number of elements in d_buffer.
void mncudaInverseBinary ( uint d_buffer,
uint  count 
)

Inverses the given "binary" 0-1-buffer.

This is done by setting all one (1) components to zero (0) and all zero components to one respectively. Each thread handles one component of the buffer.

Author:
Mathias Neumann
Date:
19.02.2010
Parameters:
[in,out]d_bufferThe binary buffer to inverse.
countNumber of elements in d_buffer.
bool mncudaIsErrorChecking (  )

Gets whether error checking is enabled.

Author:
Mathias Neumann
Date:
03.11.2010
Returns:
true if enabled, else false.
void mncudaNormalize ( float4 *  d_vectors,
uint  numVectors 
)

Normalizes each element of a given vector array.

Author:
Mathias Neumann
Date:
23.10.2010
Parameters:
[in,out]d_vectorsThe vector array to normalize. Array is given as float4 array of three-dimensional vectors to ensure alignment, where the actual vector is in the xyz components.
numVectorsNumber of vectors.
template<class T >
cudaError_t mncudaPrintArray ( T *  d_array,
uint  count,
bool  isFloat,
const char *  strArray 
)

Prints the given array using printf.

For debugging purposes only. Too slow for release code.

Todo:
Find way to print out data of different types.
Author:
Mathias Neumann
Date:
15.03.2010
Template Parameters:
TElement type.
Parameters:
[in]d_arrayThe device array to print out.
countNumber of elements of d_array to print.
isFloatWhether T is floating point type.
strArrayThe name of the array. Used for printing purposes. Might be NULL.
Returns:
cudaSuccess if OK, else some error.
template<class T >
cudaError_t mncudaReduce ( T &  result,
T *  d_data,
uint  count,
MNCudaOP  op,
identity 
)

Performs reduction on d_data.

d_data remains unchanged. The reduction depends on the passed operator. The reduction algorithm is implemented with the help of the CUDA SDK sample.

Author:
Mathias Neumann
Date:
16.03.2010
Template Parameters:
TElement type of input and output arrays.
Parameters:
[out]resultReduction result. Single element of type T.
[in]d_dataData to reduce, remains unchanged.
countNumber of elements in d_data.
opThe reduction operator. One of MNCuda_ADD, MNCuda_MIN, MNCuda_MAX.
identityIdentity value associated with op.
Returns:
cudaSuccess if successful, else some error value.
template<class T >
uint mncudaResizeMNCudaMem ( T **  d_buffer,
uint  numOld,
uint  numRequest,
uint  slices 
)

Resizes the given buffer from numOld to at least numRequest elements.

Copies the contents of the old buffer to the beginning of the new buffer. Works only for MNCudaMemPool buffers.

Additionally, the buffer can be organized into slices of contiguously placed elements. When having slices > 1, it is assumed that d_buffer has numOld times slices elements before the call. The buffer is resized so that each slice is resized to at least numRequest elements.

Author:
Mathias Neumann
Date:
16.02.2010
Template Parameters:
TElement type of buffer.
Parameters:
[in,out]d_bufferThe old buffer. After execution it is free'd and replaced by the new buffer. Has to be valid MNCudaMemPool memory.
numOldOld number of elements in a single slice.
numRequestRequested new number of elements in a single slice.
slicesNumber of slices. Usually 1, however you might have a buffer consiting of multiple slices of numOld elements and want to resize it to multiple slices of at least numNew elements.
Returns:
New element count within single slice. Might be slightly greater than numRequest in case numRequest wasn't correcly aligned.
template<class V , class S >
void mncudaScaleVectorArray ( V *  d_vecArray,
uint  count,
scalar 
)

Scales given vector array by given scalar (component wise).

Each spawned thread works on one array component.

Author:
Mathias Neumann
Date:
27.06.2010
Template Parameters:
VVector type.
SScalar type.
Parameters:
[in,out]d_vecArrayThe array of vectors to scale.
countNumber of vectors.
scalarThe scalar.
template<class T >
cudaError_t mncudaSegmentedReduce ( T *  d_data,
uint d_owner,
uint  count,
MNCudaOP  op,
identity,
T *  d_result,
uint  numSegments 
)

Performs segmented reduction on d_data.

Segments are defined by d_owner, where d_owner[i] contains the segment of d_data[i]. The result is put into d_result. This array has to be preallocated and should have space for all segment results.

Algorithmic idea: http://www.nvidia.com/object/nvidia_research_pub_013.html.

Author:
Mathias Neumann
Date:
17.02.2010
Template Parameters:
TElement type of d_data.
Parameters:
[in]d_dataThe segmented data to perform reduction on. Count A. It is assumed that the data of the same segment is stored contiguously.
[in]d_ownerThe data-segment association list. Count A.
countDefines A, that is the count in d_data and d_owner.
opThe reduction operator. One of MNCuda_ADD, MNCuda_MIN, MNCuda_MAX.
identityIdentity value associated with op.
[out]d_resultTakes the result, that is the reduction result of each segment. Size of B <= A.
numSegmentsDefines B, the number of segments.
Returns:
cudaSuccess if OK, else some error.
template<class T >
void mncudaSetAtAddress ( T *  d_array,
uint d_address,
T *  d_vals,
uint  countVals 
)

Moves data from device memory d_vals to device memory d_array using target addresses specified in d_address.

 d_array[d_address[i]] = d_vals[i] 
Warning:
Heavy uncoalesced access possible. Depends on addresses.
Author:
Mathias Neumann
Date:
22.03.2010
See also:
mncudaSetConstAtAddress()
Template Parameters:
TElement type of array and values.
Parameters:
[in,out]d_arrayThe data array to manipulate.
[in]d_addressAddresses of the values to manipulate. It is assumed that all addresses d_address[i] are valid with respect to d_array.
[in]d_valsThe values to fill in.
countValsNumber of values in d_address and d_vals.
template<class T >
void mncudaSetConstAtAddress ( T *  d_array,
uint d_address,
constant,
uint  countVals 
)

Moves a given constant to device memory d_array at the adresses specified in d_address.

 d_array[d_address[i]] = constant 
Warning:
Heavy uncoalesced access possible. Depends on addresses.
Author:
Mathias Neumann
Date:
18.02.2010
See also:
mncudaSetAtAddress()
Template Parameters:
TElement type of array and constant.
Parameters:
[in,out]d_arrayThe data array to manipulate.
[in]d_addressAddresses of the values to manipulate.
constantThe constant to move to d_array.
countValsNumber of values in d_address.
template<class T >
void mncudaSetFromAddress ( T *  d_array,
uint d_srcAddr,
T *  d_vals,
uint  countTarget 
)

Moves data from device memory d_vals to device memory d_array using source addresses specified in d_srcAddr.

 d_array[i] = d_vals[d_srcAddr[i]] 

When the source address is 0xffffffff, the corresponding target entry will get zero'd. This can be helpful for some algorithms.

Warning:
Heavy uncoalesced access possible. Depends on addresses.
Author:
Mathias Neumann
Date:
24.03.2010
Template Parameters:
TElement type of array and values.
Parameters:
[in,out]d_arrayThe data array to manipulate.
[in]d_srcAddrAddresses of the source values in d_vals. Use 0xffffffff to zero the corresponding entry in d_array.
[in]d_valsThe values to fill in.
countTargetNumber of values in d_array and d_srcAddr.
MNRT Source Code Documentation (Version 1.0) - Copyright © Mathias Neumann 2010