Memory manipulation

The CUDA programming model assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively. Before we introduce the details about the architecture of the cu$Q$-RTM code package, we need to clarify the variable definition and figure out which variables need to be transferred between host memory and device memory. Table 1 presents some important variables allocated on host and device, which fall into three memory types: pageable host memory, page-locked host memory, and global device memory. The philosophy of choosing host variable types is that variables to be frequently copied between host and device, such as $seismogram\_rms$ and $image\_cor$, are allocated in page-locked host memory, whereas the rest of the host variables are allocated as regular pageable host memory. Because copies between page-locked host memory and device memory can be performed concurrently with kernel execution, data transfer can be overlapped during kernel execution leading to a more efficient streaming execution on cluster nodes with multiple GPUs.

struct MultiGPU contains page-locked host variables and global device variables (with $d$ as a prefix) on every stream. From this struct variable, we can estimate the total device memory usage before execution and ensure that the memory usage will not exceed the memory limit. CUDA threads (kernel functions) execute on a physically separate device (GPUs), whereas the rest of the C program executes on the host (CPUs). Therefore, a program manages the global memory accessible to kernels through calls to the CUDA runtime such as device memory allocation cuda_Device_malloc($\ldots$), deallocation cuda_Device_free($\ldots$), and initialization cuda_Host_initialization($\ldots$) as well as data transfer between host and device memory.

Fig1
Fig1
Figure 1.
The architecture of the cuQ-RTM code package.
[pdf] [png] [scons]


Table 1: Some important variables allocated on host and device.
  Memory type
 Allocation & Free  
Variables
<#4997#> pageable
 malloc()  
 free()  
 ricker, vp, Qp, Gamma, t_cp  
 kfilter, kstabilization  
 Final_image_cor, Final_image_cor  
  page-locked
 cudaMallocHost()  
 cudaFreeHost()  
Device global
 cudaMalloc()  
 cudaFree()  
 d_ricker, d_vp, d_Gamma, d_t_cp, d_u_cp  
 d_u0, d_u1,d_u2, d_seimogram_rms  
 d_image_cor, d_image_nor  
 d_uk, d_Lap_uk, d_amp_uk, d_pha_uk  
 d_borders_up, d_u2_final0, d_u2_final1  


2020-04-03