I want to perform eight parallel adds of 16bit values using AVX SIMD. Addition **with overflow** is required, i.e. ‘add with carry’ like it is performed with the old “adc” x86 mnemonic.

I implemented 50 percent of the AVX solution myself, the carry handling, also performed by AVX instructions, is missing. My current solution:

```
typedef union _uint16vec uint16vec, *uint16vec_ptr;
union __attribute__((aligned(16))) _uint16vec
{
__m128i x;
uint16_t y[8];
};
__m128i parallel_add_with_carry ( __m128i n1, __m128i n2 )
{
volatile uint16vec res, stupid_carry;
uint32_t i;
stupid_carry.x = n1; /* load 8x uint16_t for carry adjustment below */
__asm__
(
"movdqa %1, %%xmm0 nt"
"movdqa %2, %%xmm1 nt"
"vpaddw %%xmm0, %%xmm1, %%xmm0 nt"
"movdqa %%xmm0, %0 nt"
: "=m" (res.x) /* output */
: "m" (n1), "m" (n2) /* inputs */
: "xmm0", "xmm1" /* GCC, please clobber XMM0 and XMM1 */
);
/* if each of the eight uint16_t in the result is lesser than
* the previous value, then we have the overflow situation...
*/
for (i=0;i<8;i++)
res.y[i] += (res.y[i] < stupid_carry.y[i]) ? 1 : 0;
return res.x;
}
void test ( void )
{
uint16vec v1 = {0}, v2 = {0}, res;
v1.y[0] = 0x000A; v2.y[0] = 0x0014; /* 10+20 = 30 (0x001E), no overflow */
v1.y[1] = 0xFFF0; v2.y[1] = 0x0013; /* 0xFFF0 + 0x0013 = 0x0003 -> overflow -> 0x0004 */
res.x = parallel_add_with_carry(v1.x, v2.x);
fprintf(stdout,"%04X | %04Xn", res.y[0], res.y[1]);
}
```

The GCC-emitted object code of the function’s epilogue is terrible (even with -O3). My question is if there is a better AVX-supported solution for the ‘add with carry’ problem?

- May the ‘vpcmp<CC>uw’ compare instruction with <CC>=LT (lesser than) help?
- How can I use these ‘K’ (mask) registers (K0..K7) for that?

My idea was to provide a 128bit vector { 0x0001,0x0001,…,0x0001} as a 128bit temporary variable adding this (the carry vector) to the eight uint16_t’s if and only if the preceding compare operation resulted in a ‘lesser than’ for specific uint16_t in the 128bit vector.

I browsed the Intel documentation and found nice add instructions that just copy source parts of the vector if a condition is met.

Support with this ‘AVX thing’ is highly appreciated. Thanks.