Compare to polyhedral optimization [0], supported in GCC via Graphite [1] and in LLVM via Polly [2].
These have a lower ceiling than hand-optimized assembler, but they automate the tuning of how to nest the loops for maximum benefits due to the cache hierarchy. Considering their ability to do so for general purpose number crunching loops, they are rather nice, but there is still some integration work needed, especially for LLVM, as lack a nice place in the existing optimization pipeline.
Polyhedral optimisation is cool (Facebook has been using it to great effect for ML kernels recently [1]), but it’s not the end of the story. It’s complementary to this paper, which seems to be about learning an effective and transferrable cost model to guide the optimisation process (you could use that learned cost model in a polyhedral optimiser).
[0]: https://en.wikipedia.org/wiki/Polytope_model [1]: https://gcc.gnu.org/wiki/Graphite [2]: https://gcc.gnu.org/wiki/Graphite