How can I optimize this “dot product” function using SIMD? It’s Mat4x4 * Vec4 but with huge strided access
I’m having a huge issue trying to get the best speedups for this function but I can’t write effective SIMD code that beats the auto-vectorizer. I need to write some SIMD to beat it but I am completely stuck right now: