Going back to first principles on this debugging. Which is definitely not going to be done tonight.
1) The baseline configuration (MULT_AREA_OPT=0), regardless of REGFILE_OUT_REG setting, works in hardware. It gives answers that match the NaCl curve25519 implementation to every query I've tried, and is interoperable with OpenSSH (when I put the bitstream on a board, I can SSH to it).
With that in mind...
2) The original test vector in my simulation that I had thought was wrong (because it wasn't lining up with observed behavior) is in fact correct.
So *none* of my simulations, even with my recent hacks, are giving correct output. But the RTL synthesizes correctly and works in FPGA (at least, on Kintex-7... not yet tested on Trion)