Multiple GPU Implementation

Next: Modeling Examples Up: GPU implementation Previous: GPU implementation

Multiple GPU Implementation

When solving the 3D elastic WE on large-scale model domains (i.e., $N_x\times N_y \times N_z > 300^3$ ), the limited global memory of an individual GPU precludes storage of the entire grid on a single device. To parallelize our 3D FDTD algorithm across multiple GPUs, we adopt a domain-decomposition scheme, illustrated in Figure 3, that divides the computational grid in the slowest varying -axis direction and assigns the sub-domains to separate GPU devices. Each GPU individually executes Steps 1-9 on its assigned sub-domain, whilst CPU-based control threads coordinate the operations of the multiple devices, enable inter-device communication, and combine results to produce the output data.


domainDecomp Figure 3. Schematic of the domain decomposition of the computational grid along the -axis across four GPU devices shared equally in two compute nodes. Data in the interior boundary regions of each subdomain must be shared with their logical neighbor using either P2P or MPI in the gray area, and necessarily with MPI across the black hatched region.

Because the FD stencil in Figure 1 requires data from points extending four units in the forward and backward y-axis directions, the GPU threads that apply the stencil to points along the sub-domain edges may require data from a logically adjacent device. Therefore, boundary data from adjacent sub-domains must be exchanged between GPUs at every time step. This communication can be expensive due to the limited bandwidth of the PCIe bus within a node and/or the network connections between nodes in a distributed system.

In a consolidated multi-GPU computing environment where all devices share a common PCIe bus, the P2P protocol in NVIDIA's CUDA v4.0 (assuming a GPU with compute capability of $\geq$ 2.0) can be used to directly exchange data between devices. For a distributed GPU environment, direct P2P communication is not possible as the devices neither share a common PCIe bus nor have direct access to the network. Therefore, as shown in Figure 4, the CPU-based control threads must use the MPI communication interface (or equivalent) to enable communication over the network.


MPIp2p Figure 4. Schematic showing distributed computing environment where MPI is used to communicate between nodes, and either MPI or (hybrid) P2P communication is used within a node.

By eliminating system memory allocation and copy overhead, direct P2P memory transfer reduces total compute time by $10-15\%$ when compared to using MPI-based send/receive commands within a consolidated compute node. In a hybrid environment of distributed multi-GPU compute nodes, a hybrid communication scheme can be adopted where each node uses a single CPU control thread for managing local GPUs (utilizing P2P transfers between devices) while communicating when necessary with remote GPUs via the MPI sub-system. However, for the sake of simplifying our released codes and reproducible examples, we refrain from discussing this situation in more detail herein.

Next: Modeling Examples Up: GPU implementation Previous: GPU implementation

2013-12-07