If one is willing to write self-modifying code, then it is possible to overwrite VEX.vvvv instruction encoding with the register select, so for example to select XMM0/YMM0 you'd have VEX.1111 (one's complement) or for XMM15/YMM15 VEX.0000.
My first thought was that this was an interesting approach, but that there was no speed advantage because of the cache flush and pipeline reset. But maybe it could be made to work...
What if instead of overwriting an existing instruction, you generated the instructions you needed and appended them to a new writable and executable buffer. Then every 100/1000/1000000 characters, you terminate it with a 'ret' and 'call' into it. It branchlessly updates the set of XMM/YMM/ZMM registers you are using without touching memory. Repeat until you've read to the end of your input, then store the vector register results by writing their contents out to RAM in a known order. Perhaps you cycle between a few buffers so that the address you are calling into is never cached.
Overkill, but it seems like this might actually be pretty fast, presuming you can generate the instructions fast enough. I've considered this technique before for fast integer decompression when trying to avoid branch errors, but never so far as to actually test it.
Unfortunately although SMC could be shorter, it's even slower since the pipeline gets flushed every time a write occurs to locations within the current fetch window (not sure exactly how long that is, but it's not small.)
Yes, and the instruction cache becomes stale as well. I guess one way to avoid is to have 16 code blocks back-to-back and then to do a like a jmp into the section that contains the right register. JMP are pretty cheap, and the end point is likely to be in cache anyway.