But a lockless algorithm would have the same amount of cache-line bouncing in the uncontended case.
In the LWN series about lockless algorithms, there is a lot of detail about the atomic primitives used in them, but apart from ringbuffers, linked lists and RCU, I actually didn't see any mention of higher level lockless algorithms. For single-producer, single-consumer queues and for linked lists, lockless algorithms are known and performant, but many others are either not that performant, or they come with severe restrictions. For example, RCU sounds great, but you have to worry about grace periods, and you should only use it when you read much more often than you modify.
Just looking at this initially, thread_safe_var_init is not correct. There is no guarantee whatsoever that "*vpp = vp" actually does anything, the compiler is perfectly free to just never allocate vp and just use the memory at vpp (thus allowing partial initialization) and the situation is even worse on ARM. You need some sort of memory barrier between that last assignment and it should probably be a volatile/std::atomic pointer.
I've added use of Helgrind's `ANNOTATE_HAPPENS_BEFORE()` and `ANNOTATE_HAPPENS_AFTER()` macros to the functions in `atomics.c` and now this is Helgrind-clean as well as TSAN-clean.
The only bugs found by TSAN (as noted in another reply) were the one you found and two assert()s that didn't use atomics. Those two asserts could be a big problem if one ran a non-NDEBUG build in production.
> O(N log(N)) where N is the maximum number of live threads that have read the variable and M is the number of values that have been set and possibly released).
If you reference N twice and M zero times, why did you define M? Did you make a typo in the O() or is this intentional?
> Most threads only ever need to call thread_safe_var_get().
> reference count the data.
Does getting a new value automatically release a reference to the old one (so it's no longer safe to read)?