Small performance gain using AVX512 over SSE in batch quaternion-vector multiplication
I’ve implemented a quaternion-vector multiplication function using SIMD instructions, with conditional compilation for AVX512, AVX2, and SSE. While I expected to see significant performance improvements with newer instruction sets (AVX512 > AVX2 > SSE), the actual performance difference is surprisingly small.