@camelcdr With a patch I am working on, aarch64 can vectorize this but only with -fno-vect-cost-model. The code generated is bad. Looks like GCC does not realize it could unroll the loop 4x to get a reasonible code generation (or with my patch just 2x).