Memory Pinning and Transfer Data between Host (CPU) and Device (GPU)

 2024/02/09 

PyTorch official documentation explains this concept very briefly and we go into more detail here.

What is memory pinning and why we use it

First, let’s go back to our OS class and remind what “paged memory” means. Process always wants contiguous memory. The OS uses memory paging to enable logically contiguous memory that is not physically contiguous. When a process requests memory, OS allocates page frames to the process. These page frames look contiguous to the process, but are actually not so in physical memory. The OS then maps the process’s logical pages to the physical page frames.

This Nvidia blog on data transfer explains what this has to do with GPU: The GPU cannot access data directly from pageable host memory (logically contiguous), so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or “pinned”, physically contiguous host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory.

Therefore, we can avoid the cost of the transfer between pageable and pinned host arrays by directly allocating our host arrays in pinned memory.

To my understanding, “directly allocating in pinned memory” corresponds to what’s described in DataLoader’s documentation as:

1	loader = DataLoader(dataset, pin_memory=True)

`pin_memory()` and `non_blocking=True`

On the other hand, while reading nanoGPT’s code, I saw the following code:

1
2
3

if device_type == 'cuda':
   # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
   x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)

pin_memory is familiar to us while non_blocking is something new. It tells the program that it can perform other operations on this data while it being trasferred from host to device. (so don’t block till transfer is done to start the operation) This async copy usually speeds things up. This Stack Overflow answer gives a detaild example of the async part.

Here in the code, we are explicitly calling pin_memory() on something already initialized, which really confused me. Since according to the above quoted Nvidia blog, “when a data transfer from pageable host memory to cuda device memory is invoked, the CUDA driver must first allocate a pinned host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory.” That is to say: even without such an explicit pin_memory() call, CUDA will do it for us.

I found this exchange on PyTorch’s forum and also asked this question myself, but didn’t receive a super clear answer. But inferring from what @ptrblck said, I think it is correct to say that the following two commands are equal in speed (with the first pinning memory implicitly and the second does it explicitly)

t.to("cuda", non_blocking=False)
t.pin_memory().to("cuda", non_blocking=False)

and explicit memory pinning call is only useful when used together with to(device, non_blocking=True)

Someone in this Zhihu discussion also argues paged memory can be exchanged into disk swap when physical memory is not enough. Explicitly pinning memory avoids this problem and saves the time of finding these pages in disk for every query (pinning brings them all out into physical memory). The poster did not give a reference though.