How to perform parallel addition with carry (overflow) using AVX?

I want to perform eight parallel adds of 16bit values using AVX SIMD. Addition with overflow is required, i.e. ‘add with carry’ like it is performed with the old “adc” x86 mnemonic.

I implemented 50 percent of the AVX solution myself, the carry handling, also performed by AVX instructions, is missing. My current solution:

``````typedef union _uint16vec uint16vec, *uint16vec_ptr;

union  __attribute__((aligned(16))) _uint16vec
{
__m128i             x;
uint16_t            y[8];
};

__m128i parallel_add_with_carry ( __m128i n1, __m128i n2 )
{
volatile uint16vec  res, stupid_carry;
uint32_t            i;

stupid_carry.x = n1; /* load 8x uint16_t for carry adjustment below */

__asm__
(
"movdqa %1, %%xmm0             nt"
"movdqa %2, %%xmm1             nt"
"movdqa %%xmm0, %0             nt"
: "=m" (res.x)                      /* output */
: "m" (n1), "m" (n2)                /* inputs */
: "xmm0", "xmm1"                    /* GCC, please clobber XMM0 and XMM1 */
);

/* if each of the eight uint16_t in the result is lesser than
* the previous value, then we have the overflow situation...
*/
for (i=0;i<8;i++)
res.y[i] += (res.y[i] < stupid_carry.y[i]) ? 1 : 0;

return res.x;
}

void test ( void )
{
uint16vec   v1 = {0}, v2 = {0}, res;

v1.y[0] = 0x000A; v2.y[0] = 0x0014; /* 10+20 = 30 (0x001E), no overflow */
v1.y[1] = 0xFFF0; v2.y[1] = 0x0013; /* 0xFFF0 + 0x0013 = 0x0003 -> overflow -> 0x0004 */

fprintf(stdout,"%04X | %04Xn", res.y[0], res.y[1]);
}
``````

The GCC-emitted object code of the function’s epilogue is terrible (even with -O3). My question is if there is a better AVX-supported solution for the ‘add with carry’ problem?

• May the ‘vpcmp<CC>uw’ compare instruction with <CC>=LT (lesser than) help?
• How can I use these ‘K’ (mask) registers (K0..K7) for that?

My idea was to provide a 128bit vector { 0x0001,0x0001,…,0x0001} as a 128bit temporary variable adding this (the carry vector) to the eight uint16_t’s if and only if the preceding compare operation resulted in a ‘lesser than’ for specific uint16_t in the 128bit vector.

I browsed the Intel documentation and found nice add instructions that just copy source parts of the vector if a condition is met.

Support with this ‘AVX thing’ is highly appreciated. Thanks.

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website

How to perform parallel addition with carry (overflow) using AVX?

I want to perform eight parallel adds of 16bit values using AVX SIMD. Addition with overflow is required, i.e. ‘add with carry’ like it is performed with the old “adc” x86 mnemonic.

I implemented 50 percent of the AVX solution myself, the carry handling, also performed by AVX instructions, is missing. My current solution:

``````typedef union _uint16vec uint16vec, *uint16vec_ptr;

union  __attribute__((aligned(16))) _uint16vec
{
__m128i             x;
uint16_t            y[8];
};

__m128i parallel_add_with_carry ( __m128i n1, __m128i n2 )
{
volatile uint16vec  res, stupid_carry;
uint32_t            i;

stupid_carry.x = n1; /* load 8x uint16_t for carry adjustment below */

__asm__
(
"movdqa %1, %%xmm0             nt"
"movdqa %2, %%xmm1             nt"
"movdqa %%xmm0, %0             nt"
: "=m" (res.x)                      /* output */
: "m" (n1), "m" (n2)                /* inputs */
: "xmm0", "xmm1"                    /* GCC, please clobber XMM0 and XMM1 */
);

/* if each of the eight uint16_t in the result is lesser than
* the previous value, then we have the overflow situation...
*/
for (i=0;i<8;i++)
res.y[i] += (res.y[i] < stupid_carry.y[i]) ? 1 : 0;

return res.x;
}

void test ( void )
{
uint16vec   v1 = {0}, v2 = {0}, res;

v1.y[0] = 0x000A; v2.y[0] = 0x0014; /* 10+20 = 30 (0x001E), no overflow */
v1.y[1] = 0xFFF0; v2.y[1] = 0x0013; /* 0xFFF0 + 0x0013 = 0x0003 -> overflow -> 0x0004 */

fprintf(stdout,"%04X | %04Xn", res.y[0], res.y[1]);
}
``````

The GCC-emitted object code of the function’s epilogue is terrible (even with -O3). My question is if there is a better AVX-supported solution for the ‘add with carry’ problem?

• May the ‘vpcmp<CC>uw’ compare instruction with <CC>=LT (lesser than) help?
• How can I use these ‘K’ (mask) registers (K0..K7) for that?

My idea was to provide a 128bit vector { 0x0001,0x0001,…,0x0001} as a 128bit temporary variable adding this (the carry vector) to the eight uint16_t’s if and only if the preceding compare operation resulted in a ‘lesser than’ for specific uint16_t in the 128bit vector.

I browsed the Intel documentation and found nice add instructions that just copy source parts of the vector if a condition is met.

Support with this ‘AVX thing’ is highly appreciated. Thanks.

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website