@camelcdr With a patch I am working on, aarch64 can vectorize this but only with -fno-vect-cost-model. The code generated is bad. Looks like GCC does not realize it could unroll the loop 4x to get a reasonible code generation (or with my patch just 2x).
Conversation
Notices
-
Embed this notice
pinskia (pinskia@hachyderm.io)'s status on Tuesday, 19-Nov-2024 11:33:10 JST pinskia -
Embed this notice
pinskia (pinskia@hachyderm.io)'s status on Tuesday, 19-Nov-2024 11:50:47 JST pinskia @camelcdr Though looking at it again ARMv9-a's SVE should be able to optimize it. in a reasonible fashion I think. But neither GCC nor LLVM is able to handle it with SVE either.
-
Embed this notice