@roytam1@miniwa.moe @ellenor2000@mastodon.top Hand-tuning can potentially make it faster. There's a large number of published literature on optimizing ChaCha20 for embedded systems that can potentially be applied back to the 386. I remember reading about an optimized version of Karatsuba multiplication with the least number of steps.