Next: Conclusions Up: Poulson et al.: Parallel Previous: Parallel Sweeping Preconditioner (PSP)

Experimental results

Our experiments were performed on the Texas Advanced Computing Center (TACC) machine, Lonestar, which is comprised of 1,888 compute nodes, each equipped with two hex-core 3.33 GHz processors and 24 GB of memory, which are connected with QDR InfiniBand using a fat-tree topology. Our tests launched eight MPI processes per node in order to provide each MPI process with 3 GB of memory.

Our experiments took place over five different 3D velocity models:

A uniform background with a high-contrast barrier. The domain is the unit cube and the wave speed is 1 except in $[0,1] \times [0.25,0.3] \times [0,0.75]$ , where it is $10^{10}$ .
A wedge problem over the unit cube, where the wave speed is set to if $Z \le 0.4+0.1x_2$ , if otherwise $Z \le 0.8 - 0.2x_2$ , and in all other cases.
A two-layer model defined over the unit cube, where if , and otherwise.
A waveguide over the unit cube: $c(\mathbf{x})=1.25(1-0.4 e^{-32 (\vert x_1-0.5\vert^2+\vert x_2-0.5\vert^2)})$ .
The SEG/EAGE Overthrust model (2), see Fig. 9.

In all of the following experiments, the shortest wavelength of each model is resolved with roughly ten grid points and the performance of the preconditioner is tested using the following four forcing functions:

a single shot centered at $\mathbf{x}_0$ , $f_0(\mathbf{x}) = n e^{-10 n \Vert\mathbf{x}-\mathbf{x}_0\Vert^2}$ ,
three shots, $f_1(\mathbf{x}) = \sum_{i=0}^2 n e^{-10 n \Vert\mathbf{x}-\mathbf{x}_i\Vert^2}$ ,
a Gaussian beam centered at $\mathbf{x}_2$ in direction $\mathbf{d}$ , $f_2(\mathbf{x})=e^{i\omega \mathbf{x} \cdot \mathbf{d}} e^{-4\omega \Vert\mathbf{x}-\mathbf{x}_2\Vert^2}$ , and
a plane wave in direction $\mathbf{d}$ , $f_3(\mathbf{x})=e^{i\omega \mathbf{x} \cdot \mathbf{d}}$ ,

where $\mathbf{x}_0=(0.5,0.5,0.1)$ , $\mathbf{x}_1=(0.25,0.25,0.1)$ , $\mathbf{x}_2=(0.75,0.75,0.5)$ , and $\mathbf{d}=(1,1,-1)/\sqrt{3}$ . Note that, in the case of the Overthrust model, these source locations should be interpreted proportionally (e.g., an

value of

means a depth which is $10\%$ of the total depth of the model). Due to the oscillatory nature of the solution, we simply choose the zero vector as our initial guess in all experiments.

The first experiment was meant to test the convergence rate of the sweeping preconditioner over domains spanning 50 wavelengths in each direction (resolved to ten points per wavelength) and each test made use of 256 nodes of Lonestar. During the course of the tests, it was noticed that a significant amount of care must be taken when setting the amplitude of the derivative of the PML takeoff function (i.e., the ``C'' variable in Eq. (2.1) from (15)). For the sake of brevity, hereafter we refer to this variable as the PML amplitude. A modest search was performed in order to find a near-optimal value for the PML amplitude for each problem. These values are reported in Table 1 along with the number of iterations required for the relative residuals for all four forcing functions to reduce to less than $10^{-5}$ .

Table 1: The number of iterations required for convergence for four model problems (with four forcing functions per model). The grid sizes were

and roughly 50 wavelengths were spanned in each direction. Five grid points were used for all PML discretizations, four planes were processed per panel, and the damping factors were all set to

	velocity model
	barrier	wedge	two-layers	waveguide
Hz	50	75	50	37.5
PML amplitude	3.0	4.0	4.0	2.0
iterations	28	49	48	52


solution-models Figure 5. A single plane from each of the four analytical velocity models over a grid with the smallest wavelength resolved with ten grid points. (Top-left) the three-shot solution for the barrier model near , (bottom-left) the three-shot solution for the two-layer model near , (top-right) the single-shot solution for the wedge model near , and (bottom-right) the single-shot solution for the waveguide model near .

solution-models
Figure 5. A single

plane from each of the four analytical velocity models over a

grid with the smallest wavelength resolved with ten grid points. (Top-left) the three-shot solution for the barrier model near

, (bottom-left) the three-shot solution for the two-layer model near

, (top-right) the single-shot solution for the wedge model near

, and (bottom-right) the single-shot solution for the waveguide model near

It was also observed that, at least for the waveguide problem, the convergence rate would be significantly improved if 6 grid points of PML were used instead of 5. In particular, the 52 iterations reported in Table 1 reduce to 27 if the PML size is increased by one. Interestingly, the same number of iterations are required for convergence of the waveguide problem at half the frequency (and half the resolution) with five grid points of PML. Thus, it appears that the optimal PML size is a slowly growing function of the frequency.We also note that, though it was not intentional, each of the these first four velocity models is invariant in one or more direction, and so it would be straightforward to sweep in a direction such that only panel factorizations would need to be performed, effectively reducing the complexity of the setup phase to $O(\gamma^2 N)$ .

The last experiment was meant to simultaneously test the convergence rates and scalability of the new sweeping preconditioner using the Overthrust velocity model (see Fig. 9) at various frequencies, and the results are reported in Table 2. It is important to notice that this is not a typical weak scaling test, as the number of grid points per process was fixed, not the local memory usage or computational load, which both grow superlinearly with respect to the total number of degrees of freedom. Nevertheless, including the setup phase, it took less than three minutes to solve the 3.175 Hz problem with four right-hand sides with 128 cores, and just under seven and a half minutes to solve the corresponding 8 Hz problem using 2048 cores. Also, while it is by no means the main message of this paper, the timings without making use of selective inversion are also reported in parentheses, as the technique is not widely implemented.


overthrust Figure 6. Three cross-sections of the SEG/EAGE Overthrust velocity model, which represents an artificial $20\, \mathrm{km} \times 20\, \mathrm{km} \times 4.65\, \mathrm{km}$ domain, containing an overthrust fault, using samples every $25\, \mathrm{m}$ . The result is an $801 \times 801 \times 187$ grid of wave speeds varying discontinuously between $2.179\, \mathrm{km/sec}$ (blue) and $6.000\, \mathrm{km/sec}$ (red).

overthrust
Figure 6. Three cross-sections of the SEG/EAGE Overthrust velocity model, which represents an artificial $20\, \mathrm{km} \times 20\, \mathrm{km} \times 4.65\, \mathrm{km}$ domain, containing an overthrust fault, using samples every $25\, \mathrm{m}$ . The result is an $801 \times 801 \times 187$ grid of wave speeds varying discontinuously between $2.179\, \mathrm{km/sec}$ (blue) and $6.000\, \mathrm{km/sec}$ (red).

Table 2: Convergence rates and timings on TACC's Lonestar for the SEG/EAGE Overthrust model, where timings in parentheses do not make use of selective inversion. All cases used a double-precision second-order stencil with five grid spacings for all PML (with an amplitude of 7.5), and a damping parameter of $2.25 \pi$ . The preconditioner was configured with four planes per panel and eight processes per node, and the `apply' timings are with respect to a single application of the preconditioner to four right-hand sides.

	number of processes
	128	256	512	1024	2048
Hz	3.175	4	5.04	6.35	8
grid	$319^2 \!\times\! 75$	$401^2 \!\times\! 94$	$505^2 \!\times\! 118$	$635^2 \!\times\! 145$	$801^2 \!\times\! 187$
setup (sec)	48.40 (46.23)	66.33 (63.41)	92.95 (86.90)	130.4 (120.6)	193.0 (175.2)
apply (sec/rhs)	0.468 (1.07)	0.550 (1.28)	0.645 (2.40)	0.700 (3.33)	0.880 (6.13)
3 digits (iter's)	42	44	42	39	40
4 digits (iter's)	54	57	57	58	58
5 digits (iter's)	63	69	70	68	72


solution-overthrust Figure 7. Three planes from a solution with the Overthrust model with a single localized shot at the center of the plane at a depth of 456 m: (top) a plane near km, (middle) an plane near km, and (bottom) an plane near km.

Next: Conclusions Up: Poulson et al.: Parallel Previous: Parallel Sweeping Preconditioner (PSP)

2014-08-20