How do you get access to machine-level atomic increments?

On modern machines, I thought that atomic increment could be made as fast as normal increment with relaxed ordering. However, AtomicUsize::fetch_add(1, Ordering::Relaxed) seems to be 5-10x slower than a non-atomic increment.

I ran an AtomicUsize on both Mac ARM-1 and AWS t3.large. I believe both machines have instructions for atomic incrmements.


