Data-parallel vs task-parallel:
The most general approach: consider the dependency graph between computed quantities. It is a DAG (directed acyclic graph). Notes:
Processing of an array of data can often be split into independent blocks.
The easiest case is when each input produces one output — the map pattern. See the example for computing the sum of two vectors: vector_sum_split_work.cpp.
Simple way of computing the boundary index: beginIdx = (threadIdx * nrElements) div nrThreads
However, beware of cache effects! Processing consecutive elements is significantly faster than processing every k-th element. Compare the previous program with the one at vector_sum_split_work_bad.cpp.
A more complex case arises when each output depends on a group of inputs, around the input at the same position — the stencil pattern. See vector_average_stencil.cpp.
It is preferrable to split on output than on inputs — so that each output is computed by exactly one worker (thread, task) and so no mutexes are necessary.
The initial worker splits the data into two or more fragments, gives the fragments as inputs to subordinate workers, and finally it combines the results.
Example 1: Compute the sum of a vector. Create a binary tree of adders. The depth is O(log(n). Source code: recursive_decomposition_sum.cpp.
Example 2: Merge sort. The basic (non-parallel algorith) is to divide the input vector into two parts, merge-sort each part, then merge the resulting two sorted vectors into one. For parallelizing, merge-sorting the two parts can easily be done in parallel. However, the final merge is a bit harder. It can be done as follows:
See the C++ implementations:
Example 3: Compute the sequence of sums of prefixes. Given a0, a1, ..., an-1, compute b0 = a0, b1 = a0+a1, b2 = a0+a1+a2,..., bn-1 = a0+a1+a2+...+an-1.
Solution: start with a binary tree computing the sum of all numbers in the sequence. Then, compute each prefix sum from the largest parts already computed.
  // First, compute the sums of 2^j consecutive numbers;
  // b[i*2^j - 1] = a[(i-1)*2^j] + ... + a[(i-1)*2^j + 2^j - 1]
  b = a
  for(size_t k=1 ; k<n ; k = k*2) {
      for(size_t i=2*k-1 ; i<n ; i+=2*k) { // in parallel
          b[i] += b[i-k];
      }
  }
  // Then, compute each partial sum as a sum of 2^j groups:
  k = k/4
  for( ; k>0 ; k = k/2) {
      for(size_t i=3*k-1 ; i<n ; i+=2*k) { // in parallel
          b[i] += b[i-k];
      }
  }
Examples:
 
