Parallel reduction on CUDA blocks

Next: Numerical results Up: FWI and its GPU Previous: Wavefield reconstruction via boundary

Parallel reduction on CUDA blocks

Recognizing that hardware serializes divergent thread execution within the same warp, but all threads within a warp must complete execution before that warp can end, we use a parallel reduction technique to find the maximum of the model vector $\textbf{m}_k$ and the descent vector $\textbf{d}_k$ , as well as summation for the inner product in the numerator and the denominator of $\alpha_k$ . A sequential addressing scheme is utilized because it is free of conflict (Harris et al., 2007). As shown in Figure 3, parallel reduction approach builds a summation tree to do stepwise partial sums. In each level half of the threads will perform reading from global memory and writing to shared memory. The required number of threads will decrease to be half of previous level. It reduces the serial computational complexity from to $O(\log_2(N))$ : In each step many threads perform computation simultaneously, leading to low arithmetic intensity. In this way, we expect a significant improvement in computational efficiency.


reduction Figure 3. Parallel reduction on GPU block. It reduces a serial computational complexity to be $O(\log_2(N))$ steps: in each step many threads perform computation simultaneously, leading to low arithmetic intensity.
$[tikz]$

Next: Numerical results Up: FWI and its GPU Previous: Wavefield reconstruction via boundary

2021-08-31