What is the most edfficient way to pack elements in AVX?
Consider we have 16 * 256i data in 16 m256i rigisters ix[16], and I want to pack them in the following way:
data details:
ix[0] : a0, a1, a2, … , a30, a31
ix[1] : b0, b1, b2, … , b30, b31
..
ix[15] : z0, z1, z2, … , z30, z31
expectations:
consecutive mem:
a0, b0, c0, …, z0, a1, b1, c1, …, z1, a2, …, a31, …, z31.
Apprently _mm256_extract_epi8 can do this, but I think there must be a more efficient way, please answer if got some idea. Thanks a lot!