I think CUDA even lets you allocate pinned host memory too now - cuHostMalloc or...

my123 · on July 5, 2022

CUDA provides a tier significantly above that: unified memory.

See: https://on-demand.gputechconf.com/gtc/2017/presentation/s728...

And: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

However, the Windows driver infrastructure's unified memory support is much further behind, with the pre-Pascal feature set.

For those you'll have to use Linux. Note that WSL2 is considered as Windows for this, it's a driver infrastructure limitation in Windows.

andoma · on July 5, 2022

I've switched to using cudaMallocManaged() exclusively. From what I can tell there's isn't much of a performance difference. A few cudaMemPrefetchAsync() at strategic places will remedy any performance problems. I really love the ability that you can just break with gdb and look around in that memory as well.

einpoklum · on July 5, 2022

Unified memory is just _different_, not above or below. It offers on-demand paging. But that comes at a cost (at times) in terms of memory I/O speed.

my123 · on July 5, 2022

It's a feature tier above, with much more emphasis on ease of use from the programmer's perspective.

It also allows for significantly more approachable programming models. For example: https://developer.nvidia.com/blog/accelerating-standard-c-wi...

boulos · on July 5, 2022

Yeah, sorry if I was unclear: some folks thought that cuHostMalloc et al. and pinned memory were "impure". That you should instead have a unified sense of "allocate" and that it could sometimes be host, sometimes device, sometimes migrate.

The unified memory support in CUDA (originally intended for Denver, IIRC) is mostly a response to people finding it too hard to decide (a la mmap, really).

So it's not that CUDA doesn't have these. It's that it does, but many people never have to understand anything beyond "there's a thing called malloc, and there's host and device".

01100011 · on July 5, 2022

Sure, but pinned memory is often a limited resource and requires the GPU to issue PCI transactions. Depending on your needs, it's generally better to copy to/from the GPU explicitly, which can be done asynchronously, hiding the overhead behind other work to a degree.

jhj · on July 5, 2022

In CUDA, some transfers involving pageable host memory are completely synchronous from the perspective of the host, even if you use `cudaMemcpyAsync`:

https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behav...

Pinned memory is typically used to get around the synchronization aspects.