A graphics processing unit implementation of time-domain full-waveform inversion |
Recognizing that hardware serializes divergent thread execution within the same warp, but all threads within a warp must complete execution before that warp can end, we use a parallel reduction technique to find the maximum of the model vector and the descent vector , as well as summation for the inner product in the numerator and the denominator of . A sequential addressing scheme is utilized because it is free of conflict (Harris et al., 2007). As shown in Figure 3, parallel reduction approach builds a summation tree to do stepwise partial sums. In each level half of the threads will perform reading from global memory and writing to shared memory. The required number of threads will decrease to be half of previous level. It reduces the serial computational complexity from to : In each step many threads perform computation simultaneously, leading to low arithmetic intensity. In this way, we expect a significant improvement in computational efficiency.
reduction
Figure 3. Parallel reduction on GPU block. It reduces a serial computational complexity to be steps: in each step many threads perform computation simultaneously, leading to low arithmetic intensity. |
---|
A graphics processing unit implementation of time-domain full-waveform inversion |