Unfortunately although SMC could be shorter, it's even slower since the pipeline gets flushed every time a write occurs to locations within the current fetch window (not sure exactly how long that is, but it's not small.)
Yes, and the instruction cache becomes stale as well. I guess one way to avoid is to have 16 code blocks back-to-back and then to do a like a jmp into the section that contains the right register. JMP are pretty cheap, and the end point is likely to be in cache anyway.