Why does a for-loop copy not achieve peak CPU-RAM bandwidth on one core?
I would expect copying an array using a simple for loop to achieve my machine’s peak bandwidth, but it does not. I ran the following example code with input 3GB, ensuring that it did not swap. It got 13 GB/s. (Ran 10 times, stdev was < 1 GB/s).