It's likely I have absolutely no idea what I'm talking about here, but:
If you're trying to compile a language with features like continuations to machine code, I wonder if it would be a good idea to ignore the cpu's stack, allocate your entire memory to heap and roll your own linked list structure to store your local variables. Function calls become significantly slower because you have to make a call to malloc to get a new frame, but the upside would be that reordering the stack becomes extremely cheap and no longer involves memcpy