I've switched to using cudaMallocManaged() exclusively. From what I can tell there's isn't much of a performance difference. A few cudaMemPrefetchAsync() at strategic places will remedy any performance problems. I really love the ability that you can just break with gdb and look around in that memory as well.
Yeah, sorry if I was unclear: some folks thought that cuHostMalloc et al. and pinned memory were "impure". That you should instead have a unified sense of "allocate" and that it could sometimes be host, sometimes device, sometimes migrate.
The unified memory support in CUDA (originally intended for Denver, IIRC) is mostly a response to people finding it too hard to decide (a la mmap, really).
So it's not that CUDA doesn't have these. It's that it does, but many people never have to understand anything beyond "there's a thing called malloc, and there's host and device".
Sure, but pinned memory is often a limited resource and requires the GPU to issue PCI transactions. Depending on your needs, it's generally better to copy to/from the GPU explicitly, which can be done asynchronously, hiding the overhead behind other work to a degree.