QUESTION 5 (20 Marks)
Consider a CUDA kernel routine reduceOp() below. This routine performs a parallel reduction operation, i.e., parallel summation of an array of integers with n elements. The routine does not make use of shared memory in each SM (or Stream Multiprocessor) and the reduction is done in-place, which means that the values in global memory are replaced by partial sums at each step.
__global__ void reduceOp(int *g_idata, int *g_odata, unsigned int n)
{
// set thread ID
unsigned int tid = threadIdx.x;
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x*blockDim.x;
// boundary check
if(idx >= n) return;
// in-place reduction in global memory
for (int stride = 1; stride < blockDim.x; stride *= 2) {
int index = 2 * stride * tid;
if (index < blockDim.x) {
idata[index] += idata[index + stride];
}
// synchronize within threadblock
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = idata[0];
}
You need to answer the following questions after analyzing the routine.
1) Describe how the parallel algorithm adopted by this routine works.
2) Discuss if there are any other shortcomings in this algorithm in addition to not using shared memory in each SM.
3) Based on your discussion, modify the routine to make it work more efficiently still for in-place reduction in global memory without making use of shared memory in each SM.
4) Justify your modifications, i.e., discuss why your modifications can enhance the performance of the original routine.