next up previous [pdf]

Next: Conclusions Up: Poulson et al.: Parallel Previous: Parallel Sweeping Preconditioner (PSP)


Experimental results

Our experiments were performed on the Texas Advanced Computing Center (TACC) machine, Lonestar, which is comprised of 1,888 compute nodes, each equipped with two hex-core 3.33 GHz processors and 24 GB of memory, which are connected with QDR InfiniBand using a fat-tree topology. Our tests launched eight MPI processes per node in order to provide each MPI process with 3 GB of memory.

Our experiments took place over five different 3D velocity models:

In all of the following experiments, the shortest wavelength of each model is resolved with roughly ten grid points and the performance of the preconditioner is tested using the following four forcing functions:

where $ \mathbf{x}_0=(0.5,0.5,0.1)$ , $ \mathbf{x}_1=(0.25,0.25,0.1)$ , $ \mathbf{x}_2=(0.75,0.75,0.5)$ , and $ \mathbf{d}=(1,1,-1)/\sqrt{3}$ . Note that, in the case of the Overthrust model, these source locations should be interpreted proportionally (e.g., an $ x_3$ value of $ 0.1$ means a depth which is $ 10\%$ of the total depth of the model). Due to the oscillatory nature of the solution, we simply choose the zero vector as our initial guess in all experiments.

The first experiment was meant to test the convergence rate of the sweeping preconditioner over domains spanning 50 wavelengths in each direction (resolved to ten points per wavelength) and each test made use of 256 nodes of Lonestar. During the course of the tests, it was noticed that a significant amount of care must be taken when setting the amplitude of the derivative of the PML takeoff function (i.e., the ``C'' variable in Eq. (2.1) from (15)). For the sake of brevity, hereafter we refer to this variable as the PML amplitude. A modest search was performed in order to find a near-optimal value for the PML amplitude for each problem. These values are reported in Table 1 along with the number of iterations required for the relative residuals for all four forcing functions to reduce to less than $ 10^{-5}$ .


Table 1: The number of iterations required for convergence for four model problems (with four forcing functions per model). The grid sizes were $ 500^3$ and roughly 50 wavelengths were spanned in each direction. Five grid points were used for all PML discretizations, four planes were processed per panel, and the damping factors were all set to $ 7$ .
velocity model
barrier wedge two-layers waveguide
Hz 50 75 50 37.5
PML amplitude 3.0 4.0 4.0 2.0
iterations 28 49 48 52


solution-models
solution-models
Figure 5.
A single $ x_2 x_3$ plane from each of the four analytical velocity models over a $ 500^3$ grid with the smallest wavelength resolved with ten grid points. (Top-left) the three-shot solution for the barrier model near $ x_1=0.7$ , (bottom-left) the three-shot solution for the two-layer model near $ x_1=0.7$ , (top-right) the single-shot solution for the wedge model near $ x_1=0.7$ , and (bottom-right) the single-shot solution for the waveguide model near $ x_1=0.55$ .
[pdf] [png] [scons]

It was also observed that, at least for the waveguide problem, the convergence rate would be significantly improved if 6 grid points of PML were used instead of 5. In particular, the 52 iterations reported in Table 1 reduce to 27 if the PML size is increased by one. Interestingly, the same number of iterations are required for convergence of the waveguide problem at half the frequency (and half the resolution) with five grid points of PML. Thus, it appears that the optimal PML size is a slowly growing function of the frequency.We also note that, though it was not intentional, each of the these first four velocity models is invariant in one or more direction, and so it would be straightforward to sweep in a direction such that only $ O(1)$ panel factorizations would need to be performed, effectively reducing the complexity of the setup phase to $ O(\gamma^2 N)$ .

The last experiment was meant to simultaneously test the convergence rates and scalability of the new sweeping preconditioner using the Overthrust velocity model (see Fig. 9) at various frequencies, and the results are reported in Table 2. It is important to notice that this is not a typical weak scaling test, as the number of grid points per process was fixed, not the local memory usage or computational load, which both grow superlinearly with respect to the total number of degrees of freedom. Nevertheless, including the setup phase, it took less than three minutes to solve the 3.175 Hz problem with four right-hand sides with 128 cores, and just under seven and a half minutes to solve the corresponding 8 Hz problem using 2048 cores. Also, while it is by no means the main message of this paper, the timings without making use of selective inversion are also reported in parentheses, as the technique is not widely implemented.

overthrust
overthrust
Figure 6.
Three cross-sections of the SEG/EAGE Overthrust velocity model, which represents an artificial $ 20\, \mathrm{km} \times
20\, \mathrm{km} \times 4.65\, \mathrm{km}$ domain, containing an overthrust fault, using samples every $ 25\, \mathrm{m}$ . The result is an $ 801 \times 801 \times 187$ grid of wave speeds varying discontinuously between $ 2.179\, \mathrm{km/sec}$ (blue) and $ 6.000\, \mathrm{km/sec}$ (red).
[pdf] [png] [scons]


Table 2: Convergence rates and timings on TACC's Lonestar for the SEG/EAGE Overthrust model, where timings in parentheses do not make use of selective inversion. All cases used a double-precision second-order stencil with five grid spacings for all PML (with an amplitude of 7.5), and a damping parameter of $ 2.25 \pi $ . The preconditioner was configured with four planes per panel and eight processes per node, and the `apply' timings are with respect to a single application of the preconditioner to four right-hand sides.
number of processes
128 256 512 1024 2048
Hz 3.175 4 5.04 6.35 8
grid $ 319^2 \!\times\! 75$ $ 401^2 \!\times\! 94$ $ 505^2 \!\times\! 118$ $ 635^2 \!\times\! 145$ $ 801^2 \!\times\! 187$
setup (sec) 48.40 (46.23) 66.33 (63.41) 92.95 (86.90) 130.4 (120.6) 193.0 (175.2)
apply (sec/rhs) 0.468 (1.07) 0.550 (1.28) 0.645 (2.40) 0.700 (3.33) 0.880 (6.13)
3 digits (iter's) 42 44 42 39 40
4 digits (iter's) 54 57 57 58 58
5 digits (iter's) 63 69 70 68 72


solution-overthrust
solution-overthrust
Figure 7.
Three planes from a solution with the Overthrust model with a single localized shot at the center of the $ x_1 x_2$ plane at a depth of 456 m: (top) a $ x_2 x_3$ plane near $ x_1=14$ km, (middle) an $ x_1 x_3$ plane near $ x_2=14$ km, and (bottom) an $ x_1 x_2$ plane near $ x_3=0.9$ km.
[pdf] [png] [scons]


next up previous [pdf]

Next: Conclusions Up: Poulson et al.: Parallel Previous: Parallel Sweeping Preconditioner (PSP)

2014-08-20