This instruction:
mov [rDest + <index>], ch
under these conditions, when overclocked a bit, once the machine has "warmed up", seems to have around a 1/10000 chance of actually storing the contents of CL instead of CH to memory.
(this was "fun" to debug.)
The workaround: when we detect Raptor Lake CPUs, we now do
shr ecx, 8
mov [rDest + <index>], cl
instead. This takes more FE and uop bandwidth, but this loop is mainly latency-limited, and this is off the critical path.