<ahref="https://pytorch.org/docs/stable/notes/cuda.html#use-pinned-memory-buffers">PyTorchofficial documentation</a> explains this concept very briefly and we gointo more detail here.
<h2 id="what-is-memory-pinning-and-why-we-use-it">What is memory pinningand why we use it</h2><p>First, let’s go back to our OS class and remind what “paged memory”means. Process always wants contiguous memory. The OS uses memory pagingto enable logically contiguous memory that is not physically contiguous.When a process requests memory, OS allocates page frames to the process.These page frames look contiguous to the process, but are actually notso in physical memory. The OS then maps the process’s logical pages tothe physical page frames.</p><p>This <ahref=”https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/#pinned-host-memory”>Nvidiablog on data transfer</a> explains what this has to do with GPU: The GPUcannot access data directly from pageable host memory (logicallycontiguous), so when a data transfer from pageable host memory to devicememory is invoked, the CUDA driver must first allocate a temporarypage-locked, or “pinned”, physically contiguous host array, copy thehost data to the pinned array, and then transfer the data from thepinned array to device memory.</p>DataLoader
’sdocumentation</a> as:</p>
</pre></td><td class="code"><pre>loader = DataLoader(dataset, pin_memory=True)
</pre></td></tr></table>pin_memory()
andnon_blocking=True
</h2><p>On the other hand, while reading <ahref=”https://github.com/karpathy/nanoGPT/blob/325be85d9be8c81b436728a420e85796c57dba7e/train.py#L126-L128”>nanoGPT’scode</a>, I saw the following code:</p>
2
3
</pre></td><td class="code"><pre>if device_type == 'cuda':
# pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
</pre></td></tr></table>pin_memory
is familiar to us whilenon_blocking
is something new. It tells the program that itcan perform other operations on this data while it being trasferred fromhost to device. (so don’t block till transfer is done to start theoperation) This async copy usually speeds things up. This <ahref=”https://stackoverflow.com/a/55564072”>Stack Overflow answer</a>gives a detaild example of the async part.</p><p>Here in the code, we are explicitly calling pin_memory()
on something already initialized, which really confused me. Sinceaccording to the above quoted Nvidia blog, “when a data transfer frompageable host memory to cuda device memory is invoked, the CUDA drivermust first allocate a pinned host array, copy the host data to thepinned array, and then transfer the data from the pinned array to devicememory.” That is to say: even without such an explicitpin_memory()
call, CUDA will do it for us.</p><p>I found <ahref=”https://discuss.pytorch.org/t/when-is-pinning-memory-useful-for-tensors-beyond-dataloaders/103710”>thisexchange on PyTorch’s forum</a> and <ahref=”https://discuss.pytorch.org/t/how-is-explicit-pin-memory-different-from-just-calling-to-and-let-cuda-handle-it/197422”>alsoasked this question myself</a>, but didn’t receive a super clear answer.But inferring from what <ahref=”https://discuss.pytorch.org/u/ptrblck/summary”><spanclass=”citation” data-cites=”ptrblck”>@ptrblck</span></a> said, I thinkit is correct to say that the following two commands are equal in speed(with the first pinning memory implicitly and the second does itexplicitly)</p><ol type="1"><li>t.to("cuda", non_blocking=False)
</li><li>t.pin_memory().to("cuda", non_blocking=False)
</li></ol><p>and explicit memory pinning call is only useful when used togetherwith to(device, non_blocking=True)
</p><p>Someone in thisZhihu discussion also argues paged memory can be exchanged into diskswap when physical memory is not enough. Explicitly pinning memoryavoids this problem and saves the time of finding these pages in diskfor every query (pinning brings them all out into physical memory). Theposter did not give a reference though.</p>