How can the rep stosb instruction execute faster than the equivalent loop?

How can the rep stosb instruction execute faster than the equivalent loop? - performance

How can the instruction rep stosb execute faster than this code?
Clear: mov byte [edi],AL ; Write the value in AL to memory
inc edi ; Bump EDI to next byte in the buffer
dec ecx ; Decrement ECX by one position
jnz Clear ; And loop again until ECX is 0
Is that guaranteed to be true on all modern CPUs? Should I always prefer to use rep stosb instead of writing the loop manually?

In modern CPUs, rep stosb's and rep movsb's microcoded implementation actually uses stores that are wider than 1B, so it can go much faster than one byte per clock.
(Note this only applies to stos and movs, not repe cmpsb or repne scasb. They're still slow, unfortunately, like at best 2 cycles per byte compared on Skylake, which is pathetic vs. AVX2 vpcmpeqb for implementing memcmp or memchr. See https://agner.org/optimize/ for instruction tables, and other perf links in the x86 tag wiki.
See Why is this code 6.5x slower with optimizations enabled? for an example of gcc unwisely inlining repnz scasb or a less-bad scalar bithack for a strlen that happens to get large, and a simple SIMD alternative.)
rep stos/movs has significant startup overhead, but ramps up well for large memset/memcpy. (See the Intel/AMD's optimization manuals for discussion of when to use rep stos vs. a vectorized loop for small buffers.) Without the ERMSB feature, though, rep stosb is tuned for medium to small memsets and it's optimal to use rep stosd or rep stosq (if you aren't going to use a SIMD loop).
When single-stepping with a debugger, rep stos only does one iteration (one decrement of ecx/rcx), so the microcode implementation never gets going. Don't let this fool you into thinking that's all it can do.
See What setup does REP do? for some details of how Intel P6/SnB-family microarchitectures implement rep movs.
See Enhanced REP MOVSB for memcpy for memory-bandwidth considerations with rep movsb vs. an SSE or AVX loop, on Intel CPUs with the ERMSB feature. (Note especially that many-core Xeon CPUs can't saturate DRAM bandwidth with only a single thread, because of limits on how many cache misses are in flight at once, and also RFO vs. non-RFO store protocols.)
A modern Intel CPU should run the asm loop in the question at one iteration per clock, but an AMD bulldozer-family core probably can't even manage one store per clock. (Bottleneck on the two integer execution ports handling the inc/dec/branch instructions. If the loop condition was a cmp/jcc on edi, an AMD core could macro-fuse the compare-and-branch.)
One major feature of so-called Fast String operations (rep movs and rep stos on Intel P6 and SnB-family CPUs is that they avoid the read-for-ownership cache coherency traffic when storing to not-previously-cached memory. So it's like using NT stores to write whole cache lines, but still strongly ordered. (The ERMSB feature does use weakly-ordered stores).
IDK how good AMD's implementation is.
(And a correction: I previously said that Intel SnB can only handle a taken-branch throughput of one per 2 clocks, but in fact it can run tiny loops at one iteration per one clock.)
See the optimization resources (esp. Agner Fog's guides) linked from the x86 tag wiki.
Intel IvyBridge and later also ERMSB, which lets rep stos[b/w/d/q] and rep movs[b/w/d/q] use weakly-ordered stores (like movnt), allowing the stores to commit to cache out-of-order. This is an advantage if not all of the destination is already hot in L1 cache. I believe, from the wording of the docs, that there's an implicit memory barrier at the end of a fast string op, so any reordering is only visible between stores made by the string op, not between it and other stores. i.e. you still don't need sfence after rep movs.
So for large aligned buffers on Intel IvB and later, a rep stos implementation of memset can beat any other implementation. One that uses movnt stores (which don't leave the data in cache) should also be close to saturating main memory write bandwidth, but may in practice not quite keep up. See comments for discussion of this, but I wasn't able to find any numbers.
For small buffers, different approaches have very different amounts of overhead. Microbenchmarks can make SSE/AVX copy-loops look better than they are, because doing a copy with the same size and alignment every time avoids branch mispredicts in the startup/cleanup code. IIRC, it's recommended to use a vectorized loop for copies under 128B on Intel CPUs (not rep movs). The threshold may be higher than that, depending on the CPU and the surrounding code.
Intel's optimization manual also has some discussion of overhead for different memcpy implementations, and that rep movsb has a larger penalty for misalignment than movdqu.
See the code for an optimized memset/memcpy implementation for more info on what is done in practice. (e.g. Agner Fog's library).

If your CPU has CPUID ERMSB bit, then rep movsb and rep stosb commands are executed differently than on older processors.
See Intel Optimization Reference Manual, section 3.7.6 Enhanced REP MOVSB and REP STOSB operation (ERMSB).
Both the manual and my tests show that the benefits of rep stosb comparing to generic 32-bit register moves on a 32-bit CPU of Skylake microarchitecture appear only on large memory blocks, larger than 128 bytes. On smaller blocks, like 5 bytes, the code that you have shown (mov byte [edi],al; inc edi; dec ecx; jnz Clear) would be much faster, since the startup costs of rep stosb are very high - about 35 cycles. However, this speed difference has diminished on Ice Lake microarchitecture launched in September 2019, introducing the Fast Short REP MOV (FSRM) feature. This feature can be tested by a CPUID bit. It was intended for 128 bytes and shorter strings to be quick, but, in fact, strings before 64 bytes are still slower with rep movsb than with, for example, simple 64-bit register copy. Besides that, FSRM is only implemented under 64-bit, not under 32-bit. At least on my i7-1065G7 CPU, rep movsb is only quick for small strings under 64-bit, but, on 32-bit, strings have to be at least 4KB in order for rep movsb to start outperforming other methods.
To get the benefits of rep stosb on the processors with CPUID ERMSB bit, the following conditions should be met:
the destination buffer has to be aligned to a 16-byte boundary;
if the length is a multiple of 64, it can produce even higher performance;
the direction bit should be set "forward" (set by the cld instruction).
According to the Intel Optimization Manual, ERMSB begins to outperform memory store via regular register on Skylake when the length of the memory block is at least 128 bytes. As I wrote, there is high internal startup ERMSB - about 35 cycles. ERMSB begins to clearly outperform other methods, including AVX copy and fill, when the length is more than 2048 bytes. However, this mainly applies to Skylake microarchitecture and not necessarily be the case for the other CPU microarchitectures.
On some processors, but not on the other, when the destination buffer is 16-byte aligned, REP STOSB using ERMSB can perform better than SIMD approaches, i.e., when using MMX or SSE registers. When the destination buffer is misaligned, memset() performance using ERMSB can degrade about 20% relative to the aligned case, for processors based on Intel microarchitecture code name Ivy Bridge. In contrast, SIMD implementation of REP STOSB will experience more negligible degradation when the destination is misaligned, according to Intel's optimization manual.
Benchmarks
I've done some benchmarks. The code was filling the same fixed-size buffer many times, so the buffer stayed in cache (L1, L2, L3), depending on the size of the buffer. The number of iterations was such as the total execution time should be about two seconds.
Skylake
On Intel Core i5 6600 processor, released on September 2015 and based on Skylake-S quad-core microarchitecture (3.30 GHz base frequency, 3.90 GHz Max Turbo frequency) with 4 x 32K L1 cache, 4 x 256K L2 cache and 6MB L3 cache, I could obtain ~100 GB/sec on REP STOSB with 32K blocks.
The memset() implementation that uses REP STOSB:
1297920000 blocks of 16 bytes: 13.6022 secs 1455.9909 Megabytes/sec
0648960000 blocks of 32 bytes: 06.7840 secs 2919.3058 Megabytes/sec
1622400000 blocks of 64 bytes: 16.9762 secs 5833.0883 Megabytes/sec
817587402 blocks of 127 bytes: 8.5698 secs 11554.8914 Megabytes/sec
811200000 blocks of 128 bytes: 8.5197 secs 11622.9306 Megabytes/sec
804911628 blocks of 129 bytes: 9.1513 secs 10820.6427 Megabytes/sec
407190588 blocks of 255 bytes: 5.4656 secs 18117.7029 Megabytes/sec
405600000 blocks of 256 bytes: 5.0314 secs 19681.1544 Megabytes/sec
202800000 blocks of 512 bytes: 2.7403 secs 36135.8273 Megabytes/sec
101400000 blocks of 1024 bytes: 1.6704 secs 59279.5229 Megabytes/sec
3168750 blocks of 32768 bytes: 0.9525 secs 103957.8488 Megabytes/sec (!), i.e., 103 GB/s
2028000 blocks of 51200 bytes: 1.5321 secs 64633.5697 Megabytes/sec
413878 blocks of 250880 bytes: 1.7737 secs 55828.1341 Megabytes/sec
19805 blocks of 5242880 bytes: 2.6009 secs 38073.0694 Megabytes/sec
The memset() implementation that uses MOVDQA [RCX],XMM0:
1297920000 blocks of 16 bytes: 3.5795 secs 5532.7798 Megabytes/sec
0648960000 blocks of 32 bytes: 5.5538 secs 3565.9727 Megabytes/sec
1622400000 blocks of 64 bytes: 15.7489 secs 6287.6436 Megabytes/sec
817587402 blocks of 127 bytes: 9.6637 secs 10246.9173 Megabytes/sec
811200000 blocks of 128 bytes: 9.6236 secs 10289.6215 Megabytes/sec
804911628 blocks of 129 bytes: 9.4852 secs 10439.7473 Megabytes/sec
407190588 blocks of 255 bytes: 6.6156 secs 14968.1754 Megabytes/sec
405600000 blocks of 256 bytes: 6.6437 secs 14904.9230 Megabytes/sec
202800000 blocks of 512 bytes: 5.0695 secs 19533.2299 Megabytes/sec
101400000 blocks of 1024 bytes: 4.3506 secs 22761.0460 Megabytes/sec
3168750 blocks of 32768 bytes: 3.7269 secs 26569.8145 Megabytes/sec (!) i.e., 26 GB/s
2028000 blocks of 51200 bytes: 4.0538 secs 24427.4096 Megabytes/sec
413878 blocks of 250880 bytes: 3.9936 secs 24795.5548 Megabytes/sec
19805 blocks of 5242880 bytes: 4.5892 secs 21577.7860 Megabytes/sec
Please note that the drawback of using the XMM0 register is that it is 128 bits (16 bytes) while I could have used YMM0 register of 256 bits (32 bytes). Anyway, stosb uses the non-RFO protocol. Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a non-RFO cache protocol. They did not violate memory ordering, unlike ERMSB in Ivy Bridge. See https://stackoverflow.com/a/33905887/6910868 for more details and the source.
Anyway, even you compare just two of the methods that I have provided, and even though the second method is far from ideal, as you see, on 64-bit blocks rep stosb is slower, but starting from 128-byte blocks, rep stosb begin to outperform other methods, and the difference is very significant starting from 512-byte blocks and longer, provided that you are clearing the same memory block again and again within the cache.
Therefore, for REP STOSB, maximum speed was 103957 (one hundred three thousand nine hundred fifty-seven) Megabytes per second, while with MOVDQA [RCX],XMM0 it was just 26569 (twenty-six thousand five hundred sixty-nine) twenty-six thousand five hundred sixty-nine.
As you see, the highest performance was on 32K blocks, which is equal to 32K L1 cache of the CPU on which I've made the benchmarks.
Ice Lake
REP STOSB vs AVX-512 store
I have also done tests on an Intel i7 1065G7 CPU, released in August 2019 (Ice Lake/Sunny Cove microarchitecture), Base frequency: 1.3 GHz, Max Turbo frequency 3.90 GHz. It supports AVX512F instruction set. It has 4 x 32K L1 instruction cache and 4 x 48K data cache, 4x512K L2 cache and 8 MB L3 cache.
Destination alignment
On 32K blocks zeroized by rep stosb, performance was from 175231 MB/s for destination misaligned by 1 byte (e.g. $7FF4FDCFFFFF) and quickly rose to 219464 MB/s for aligned by 64 bytes (e.g. $7FF4FDCFFFC0), and then gradually rose to 222424 MB/sec for destinations aligned by 256 bytes (Aligned to 256 bytes, i.e. $7FF4FDCFFF00). After that, the speed did not rise, even if destination was aligned by 32KB (e.g. $7FF4FDD00000), and was still 224850 MB/sec.
There was no difference in speed between rep stosb and rep stosq.
On buffers aligned by 32K, the speed of AVX-512 store was exactly the same as for rep stosb, for loops starting from 2 stores in a loop (227777 MB/sec) and didn't grow for loops unrolled for 4 and even 16 stores. However, for a loop of just 1 store the speed was a little bit lower - 203145 MB/sec.
However, if the destination buffer was misaligned by just 1 byte, the speed of AVX512 store dropped dramatically, i.e. more than 2 times, to 93811 MB/sec, in contrast to rep stosb on similar buffers, which gave 175231 MB/sec.
Buffer Size
For 1K (1024 bytes) blocks, AVX-512 (205039 KB/s) was 3 times faster than rep stosb (71817 MB/s)
And for 512 bytes blocks, AVX-512 performance was always the same as for larger block types (194181 MB/s), while rep stosb dropped to 38682 MB/s. At this block type, the difference was 5 times in favor of AVX-512.
For 2K (2048) blocks, AVX-512 had 210696 MB/s, while for rep stosb it was 123207 MB/s, almost twice slower. Again, there was no difference between rep stosb and rep stosq.
For 4K (4096) blocks, AVX-512 had 225179 MB/s, while rep stosb: 180384 MB/s, almost catching up.
For 8K (8192) blocks, AVX-512 had 222259 MB/s, while rep stosb: 194358 MB/s, close!
For 32K (32768) blocks, AVX-512 had 228432 MB/s, rep stosb: 220515 MB/s - now at last! We are approaching the L0 data cache size of my CPU - 48Kb! This is 220 Gigabytes per second!
For 64K (65536) blocks, AVX-512 had 61405 MB/s, rep stosb: 70395 MB/s!
Such a huge drop when we ran out of the L0 cache! And, it was evident that, from this point, rep stosb begins to outperform AVX-512 stores.
Now let's check the L1 cache size. For for 512K blocks, AVX-512 made 62907 MB/s and rep stosb made 70653 MB/s. That's where rep stosb begins to outperform AVX-512. The difference is not yet significant, but the bigger the buffer, the bigger the difference.
Now let's take a huge buffer of 1GB (1073741824). With AVX-512, the speed was 14319 MB/s, rep stosb it as 27412 MB/s, i.e. twice as fast as AVX-512!
I've also tried to use non-temporal instructions for filling the 32K buffers vmovntdq [rcx], zmm31, but the performance was about 4 time slower than just vmovdqa64 [rcx], zmm31. How can I take benefits of vmovntdq when filling memory buffers? Should there be some specific size of the buffer in order vmovntdq to take an advantage?
Also, if the destination buffers are aligned by at least 64 bits, there is no performance difference in vmovdqa64 vs vmovdqu64. Therefore, I do have a question: does the instruction vmovdqa64 is only needed for debugging and safety when we have vmovdqu64?
Figure 1: Speed of iterative store to the same buffer, MB/s
block AVX stosb
----- ----- ------
0.5K 194181 38682
1K 205039 205039
2K 210696 123207
4K 225179 180384
8K 222259 194358
32K 228432 220515
64K 61405 70395
512K 62907 70653
1G 14319 27412
Summary on performance of multiple clearing the same memory block within the cache
rep stosb on Ice Lake CPUs begins to outperform AVX-512 stores only for repeatedly clearing the same memory buffer larger than the L0 cache size, i.e. 48K on the Intel i7 1065G7 CPU. And on small memory buffers, AVX-512 stores are much faster: for 1KB - 3 times faster, for 512 bytes - 5 times faster.
However, the AVX-512 stores are susceptible to misaligned buffers, while rep stosb is not as sensitive to misalignment.
Therefore, I have figured out that rep stosb begins to outperform AVX-512 stores only on buffers that exceed L0 data cache size, or 48KB as in case of the Intel i7 1065G7 CPU. This conclusion is valid at least on Ice Lake CPUs. An earlier Intel recommendation that string copy begins to outperform AVX copy starting from 2KB buffers also should be re-tested for newer microarchitectures.
Clearing different memory buffers, each only once
My previous benchmarks were filling the same buffer many times in row. A better benchmark might be to allocate many different buffers and only fill each buffer once, to not interfere with the cache.
In this scenario, there is no much difference at all between rep stosb and AVX-512 stores. The only difference is when all the data does not come close to a physical memory limit, under Windows 10 64 bit. In the following benchmarks, the total data size was below 8 GB with total physical ram of 16 GB. When I was allocating about 12 GB, performance drops about 20 times, regardless of the method. Windows began to discard purged memory pages, and probably did some other stuff when the memory was about to be full. The L3 cache size of 8MB on the i7 1065G7 CPU did not seem to matter the benchmarks at all. All that matters is that you didn't have to come close to physical memory limit, and it depends on your operating system on how it handles such situations. As I said, under Windows 10, if I took just half physical memory, it was OK, but it I took 3/4 of available memory, my benchmark slowed 20 times. I didn't even try to take more than 3/4. As I told, the total memory size is 16 GB. The amount available, according to the task manager, was 12 GB.
Here is the benchmark of the speed of filling various blocks of memory totalling 8 GB with zeros (in MB/sec) on the i7 1065G7 CPU with 16 GB total memory, single-threaded. By "AVX" I mean "AVX-512" normal stores, and by "stosb" I mean "rep stosb".
Figure 2: Speed of store to the multiple buffers, once each, MB/s
block AVX stosb
----- ---- ----
0.5K 3641 2759
1K 4709 3963
2K 12133 13163
4K 8239 10295
8K 3534 4675
16K 3396 3242
32K 3738 3581
64K 2953 3006
128K 3150 2857
256K 3773 3914
512K 3204 3680
1024K 3897 4593
2048K 4379 3234
4096K 3568 4970
8192K 4477 5339
Conclusion on clearing the memory within the cache
If your memory does not exist in the cache, than the performance of AVX-512 stores and rep stosb is about the same when you need to fill memory with zeros. It is the cache that matters, not the choice between these two methods.
The use of non-temporal store to clear the memory which is not in the cache
I was zeroizing 6-10 GB of memory split by a sequence of buffers aligned by 64 bytes. No buffers were zeroized twice. Smaller buffers had some overhead, and I had only 16 GB of physical memory, so I zeroized less memory in total with smaller buffers. I used various tests for the buffers starting from 256 bytes and up to to 8 GB per buffer.
I took 3 different methods:
Normal AVX-512 store by vmovdqa64 [rcx+imm], zmm31 (a loop of 4 stores and then compare the counter);
Non-temporal AVX-512 store by vmovntdq [rcx+imm], zmm31 (same loop of 4 stores);
rep stosb.
For small buffers, the normal AVX-512 store was the winner. Then, starting from 4KB, the non-temporal store took the lead, while rep stosb still lagged behind.
Then, from 256KB, rep stosb outperformed AVX-512, but not the non-temporal store, and since that, the situation didn’t change. The winner was a non-temporal AVX-512 store, then came rep stosb and then the normal AVX-512 store.
Figure 3. Speed of store to the multiple buffers, once each, MB/s by three different methods: normal AVX-512 store, nontemporal AVX-512 store and rep stosb.
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 2.90s, 2.30 GB/s by normal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by nontemporal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by rep stosb
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.06s, 2.62 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.02s, 2.65 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.66s, 2.18 GB/s by rep stosb
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.10s, 2.87 GB/s by normal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.37s, 2.64 GB/s by nontemporal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 4.85s, 1.83 GB/s by rep stosb
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.45s, 2.73 GB/s by normal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.79s, 2.48 GB/s by nontemporal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 4.83s, 1.95 GB/s by rep stosb
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by normal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 3.46s, 2.81 GB/s by nontemporal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by rep stosb
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.24s, 3.04 GB/s by normal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 2.65s, 3.71 GB/s by nontemporal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.35s, 2.94 GB/s by rep stosb
Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.37s, 2.94 GB/s by normal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 2.73s, 3.63 GB/s by nontemporal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.53s, 2.81 GB/s by rep stosb
Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.19s, 3.12 GB/s by normal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 2.64s, 3.77 GB/s by nontemporal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.44s, 2.90 GB/s by rep stosb
Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.08s, 3.24 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 2.58s, 3.86 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.29s, 3.03 GB/s by rep stosb
Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.22s, 3.10 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 2.49s, 4.01 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.26s, 3.07 GB/s by rep stosb
Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.52s, 3.97 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 1.98s, 5.06 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.43s, 4.11 GB/s by rep stosb
Zeroized 10.00 GB: 20475 blocks of 512 KB for 2.15s, 4.65 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.70s, 5.87 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.81s, 5.53 GB/s by rep stosb
Zeroized 10.00 GB: 10238 blocks of 1 MB for 2.18s, 4.59 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.50s, 6.68 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.63s, 6.13 GB/s by rep stosb
Zeroized 10.00 GB: 5119 blocks of 2 MB for 2.02s, 4.96 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.59s, 6.30 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.54s, 6.50 GB/s by rep stosb
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.90s, 5.26 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.37s, 7.29 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.47s, 6.81 GB/s by rep stosb
Zeroized 9.99 GB: 1279 blocks of 8 MB for 2.04s, 4.90 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.51s, 6.63 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.56s, 6.41 GB/s by rep stosb
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.93s, 5.18 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.37s, 7.30 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.45s, 6.89 GB/s by rep stosb
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.95s, 5.11 GB/s by normal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.41s, 7.06 GB/s by nontemporal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.42s, 7.02 GB/s by rep stosb
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.85s, 5.38 GB/s by normal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.33s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.40s, 7.09 GB/s by rep stosb
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.99s, 4.96 GB/s by normal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.42s, 6.97 GB/s by nontemporal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.55s, 6.37 GB/s by rep stosb
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.83s, 5.32 GB/s by normal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.32s, 7.38 GB/s by nontemporal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.64s, 5.93 GB/s by rep stosb
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.89s, 5.02 GB/s by normal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.31s, 7.27 GB/s by nontemporal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.42s, 6.71 GB/s by rep stosb
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.76s, 5.13 GB/s by normal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.26s, 7.12 GB/s by nontemporal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.29s, 7.00 GB/s by rep stosb
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.48s, 5.42 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.07s, 7.49 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.15s, 6.94 GB/s by rep stosb
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.48s, 5.40 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.08s, 7.40 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.14s, 7.00 GB/s by rep stosb
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.50s, 5.35 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.07s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.21s, 6.63 GB/s by rep stosb
Avoiding AVX-SSE transition penalties
For all the AVX-512 code, I've used the ZMM31 register, because SSE registers come from 0 to to 15, so the AVX-512 registers 16 to 31 do not have their SSE counterparts, thus do not incur the transition penalty.

Related

Performance of my MPI code does not improve when I use two NUMA nodes (dual Xeon chips)

I have a computer, Precision-Tower-7810 dual Xeon E5-2680v3 #2.50GHz × 48 threads.
Here is result of $lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz
Stepping: 2
CPU MHz: 1200.000
CPU max MHz: 3300,0000
CPU min MHz: 1200,0000
BogoMIPS: 4988.40
Virtualization: VT-x
L1d cache: 768 KiB
L1i cache: 768 KiB
L2 cache: 6 MiB
L3 cache: 60 MiB
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
My MPI code is based on basic MPI (Isend, Irecv, Wait, Bcast). Fundamentally, the data will be distributed and sent to all processors. On each processor, data is used to calculate something and its value is changed. After the above procedure, the amount of data on each processor is exchanged between all processors. This work is repeated to a limit.
Now, the main issue is that when I increase the number of processors within the limit of one chip (24 threads), performance increases. However, performance does not improve while the number of processors > 24 threads.
An example:
$mpiexec -n 6 ./mywork : 72s
$mpiexec -n 12 ./mywork : 46s
$mpiexec -n 24 ./mywork : 36s
$mpiexec -n 32 ./mywork : 36s
$mpiexec -n 48 ./mywork : 35s
I have tried on the both OpenMPI and MPICH, obtained result is the same. So, I think issue of physical connect type (NUMA nodes) of two chips. It is assumption of mine, I have never used a really supercomputer. I hope anyone know this issue and help me. Thank you for reading.

Cache hits and misses on AVX-512 multicore but not single core

Following is the loop body of a NASM program (loop body means I am not showing the parts that instantiate cores and shared memory, read the input data, write the final results to file). This program is a shared object called from a C wrapper. Line numbers are shown for nine of the lines; they correspond to the line numbers referenced in the notes below.
mov rax,255
kmovq k7,rax
label_401:
cmp r11,r10
jge label_899
vmovupd zmm14,[r12+r11] ;[185]
add r11,r9 ; stride ;[186]
vmulpd zmm13,zmm14,zmm31 ; [196]
vmulpd zmm9,zmm14,zmm29 ; [207]
vmulpd zmm8,zmm13,zmm30
mov r8,1
Exponent_Label_0:
vmulpd zmm7,zmm29,zmm29
add r8,1
cmp r8,2 ;rdx
jl Exponent_Label_0
vmulpd zmm3,zmm7,zmm8
vsubpd zmm0,zmm9,zmm3
vmulpd zmm1,zmm0,zmm28
VCVTTPD2QQ zmm0{k7},zmm1 ; [240]
VCVTUQQ2PD zmm2{k7},zmm0 ; [241]
vsubpd zmm3,zmm1,zmm2
vmulpd zmm4,zmm3,zmm27 ; [243]
VCVTTPD2QQ zmm5{k7}{z},zmm4
VPCMPGTQ k2,zmm5,zmm26
VPCMPEQQ k3 {k7},zmm5,zmm26
KADDQ k1,k2,k3
VCVTQQ2PD zmm2{k7},zmm0 ; [252]
vmulpd zmm1{k7},zmm2,zmm25
vmovupd zmm2,zmm1
VADDPD zmm2{k1},zmm1,zmm25
vmovapd [r15+r14],zmm2 ; [266]
add r14,r9 ; stride
jmp label_401
The program uses AVX-512 register-to-register instructions exclusively between the data read at line 185 to where the final results are written to a shared memory buffer at line 266. I ran this with 1 core and with 4 cores, but the 4-core version is 2-3 times slower than the single core. I profiled it with Linux perf to understand why AVX-512 is 2-3x slower with multicore than with a single core.
The perf reports shown below were done by running all 65 PEBS counters with perf record / perf annotate -- to see results by source code line -- and perf stat to get the full count. Each perf record and perf stat counter was a separate run, and the results are aggregated by source code line, with the count from perf stat shown below each.
Each instruction is followed by the source code line number. For perf record instructions it shows the percentage of that counter attributable to the source line, and the total count of such instructions (from perf stat) in parentheses at the end of each line.
My main question is why we see cache hits and misses with multicore on AVX-512 instructions that are all register-to-register instructions, but not with the same instructions on single core. There should not be any cache hits or misses for an instruction that is entirely within registers. Each core has its own set of registers so I would not expect any cache activity where the instructions are all register-to-register. We see virtually no cache activity in all-register instructions when run with only a single core.
1. Line 186 - add r11,r9
mem_inst_retired.all_loads 75.00% (447119383)
mem_inst_retired.all_stores 86.36% (269650353)
mem_inst_retired.split_loads 71.43% (6588771)
mem_load_retired.l1_hit 57.14% (443561879)
Single core (line 177) - add r11,r9
mem_inst_retired.all_stores 24.00% (267231461)
This instruction (add r11,r9) adds two registers. When run with a single-core we don't see any cache hits/misses or memory loads, but with multicore we do. Why are there cache hits and memory load instructions here with multicore but not with a single core?
2. Line 196 - vmulpd zmm13,zmm14,zmm31
mem_inst_retired.split_loads 28.57% (6588771)
mem_load_retired.fb_hit 100.00% (8327967)
mem_load_retired.l1_hit 14.29% (443561879)
mem_load_retired.l1_miss 66.67% (11033416)
Single core (line 187) - vmulpd zmm13,zmm14,zmm31
mem_load_retired.fb_hit 187 100.00% (8889146)
This instruction (vmulpd zmm13,zmm14,zmm31) is all registers, but again it shows L1 hits and misses and split loads with multicore but not with a single core.
3. Line 207 - vmulpd zmm9,zmm14,zmm29
mem_load_retired.l1_hit 14.29% (443561879)
mem_load_retired.l1_miss 33.33% (11033416)
rs_events.empty_end 25.00% (37013411)
Single core (line 198):
mem_inst_retired.all_stores 24.00% (267231461)
mem_inst_retired.stlb_miss_stores 22.22%
This instruction (vmulpd zmm9,zmm14,zmm29) is the same instruction as the one described above it (vmulpd, all registers), but again it shows L1 hits and misses and split loads with multicore but not with a single core. The single core does show second-level TLB misses and store instructions retired, but no cache activity.
4. Line 240 - VCVTTPD2QQ zmm0{k7},zmm1
mem_inst_retired.all_loads 23.61% (447119383)
mem_inst_retired.split_loads 26.67% (6588771)
mem_load_l3_hit_retired.xsnp_hitm 28.07% (1089506)
mem_load_l3_hit_retired.xsnp_none 12.90% (1008914)
mem_load_l3_miss_retired.local_dram 40.00% (459610)
mem_load_retired.fb_hit 29.21% (8327967)
mem_load_retired.l1_miss 19.82% (11033416)
mem_load_retired.l2_hit 10.22% (12323435)
mem_load_retired.l2_miss 24.84% (2606069)
mem_load_retired.l3_hit 19.70% (700800)
mem_load_retired.l3_miss 21.05% (553670)
Single core line 231:
mem_load_retired.l1_hit 25.00% (429499496)
mem_load_retired.l3_hit 50.00% (306278)
This line (VCVTTPD2QQ zmm0{k7},zmm1) is register-to-register. The single core shows L1 and L3 activity, but the multicore has much more cache activity.
5. Line 241 - VCVTUQQ2PD zmm2{k7},zmm0
mem_load_l3_hit_retired.xsnp_hitm 21.05% (1089506)
mem_load_l3_miss_retired.local_dram 10.00% (459610)
mem_load_retired.fb_hit 10.89% (8327967)
mem_load_retired.l2_miss 13.07% (2606069)
mem_load_retired.l3_miss 10.53%
Single core line 232:
Single core has no cache hits or misses reported
mem_load_retired.l1_hit 12.50% (429499496)
All-register instruction (VCVTUQQ2PD zmm2{k7},zmm0) that shows a lot of cache activity with multicore but only a small number of L1 hits with single core (12.5%). I would not expect to see any cache hits/misses or load/store instructions with an all-register instruction.
6. Line 243 - vmulpd zmm4,zmm3,zmm27
br_inst_retired.all_branches_pebs 12.13% (311104072)
Single core line 234:
mem_load_l3_hit_retired.xsnp_none 100.00% (283620)
Why do we see branch instructions for an all-register mul instruction?
7. Line 252 - VCVTQQ2PD zmm2{k7},zmm0
br_inst_retired.all_branches_pebs 16.62% (311104072)
mem_inst_retired.all_stores 21.22% (269650353)
Single core line 243:
Single core also has branch instructions
br_inst_retired.all_branches_pebs 22.16% (290445009)
For a register-to-register instruction (VCVTQQ2PD zmm2{k7},zmm0), why do we see branch instructions? This instruction does not branch, nor is it preceded or followed by a branch.
8. Line 266 - vmovapd [r15+r14],zmm2
br_inst_retired.all_branches_pebs 43.56% (311104072)
mem_inst_retired.all_loads 48.67% (447119383)
mem_inst_retired.all_stores 43.09% (269650353)
mem_inst_retired.split_loads 41.30% (6588771)
mem_inst_retired.stlb_miss_loads 11.36% (487591)
mem_inst_retired.stlb_miss_stores 12.50% (440729)
mem_load_l3_hit_retired.xsnp_hitm 33.33% (1089506)
mem_load_l3_hit_retired.xsnp_none 56.45% (1008914)
mem_load_l3_miss_retired.local_dram 35.00% (459610)
mem_load_retired.fb_hit 39.60% (8327967)
mem_load_retired.l1_hit 48.75% (443561879)
mem_load_retired.l1_miss 51.65% (11033416)
mem_load_retired.l2_hit 71.51% (12323435)
mem_load_retired.l2_miss 45.10% (2606069)
mem_load_retired.l3_hit 59.09% (700800)
mem_load_retired.l3_miss 47.37% (553670)
Single core line 257:
mem_inst_retired.all_loads 84.86% (426023012)
mem_inst_retired.all_loads
mem_inst_retired.all_stores 59.28% (267231461)
mem_inst_retired.split_loads 89.92% (6477955)
mem_load_l3_miss_retired.local_dram 100.00% (372586)
mem_load_retired.fb_hit 92.80% (8889146)
mem_load_retired.l1_hit 54.17% (429499496)
mem_load_retired.l1_miss 91.30% (4170386)
mem_load_retired.l2_hit 100.00% (4564407)
mem_load_retired.l2_miss 100.00% (476024)
mem_load_retired.l3_hit 33.33% (306278)
This line (vmovapd [r15+r14],zmm2) may be the line most likely to affect the difference between single core and multicore. Here we transfer the final results to a memory buffer that is shared by all cores. Because there is memory movement, we expect to see cache activity with both multicore and single core. The single core uses a single buffer created with malloc. For multicore it's posix shared memory because that ran significantly faster than with an array created with malloc.
Both single core and multicore were run on an Intel Xeon Gold 6140 CPU # 2.30GHz, which has two FMA units for AVX-512.
To summarize, my questions are: (1) why do we see cache activity on register-to-register instructions with AVX-512 multicore but not single core (except rare cases); and (2) is there any way to bypass cache entirely at vmovapd [r15+r14],zmm2 and go straight to memory to avoid cache misses? Posix shared memory was an improvement but that doesn't do it completely. Finally, are there any other reason(s) why AVX-512 would be so much slower with multicore than with a single core?
UPDATE: the access pattern for this code is dictated by AVX - the stride is (64 x number of cores) bytes. With 4 cores, core 0 begins at 0, reads and processes 64 bytes, then jumps by 256 (64x4); core 1 begins at 64, reads and processes 64 bytes, then jumps by 256, etc.

Impala - out of memory exception. Slow queries

Can someone help me. I'm running a cluster of 5 Impala-Nodes for my Api. Now I get a lot of 'out of memory' Exceptions when I run queries.
Failed to get minimum memory reservation of 3.94 MB on daemon r5c3s4.colo.vm:22000 for query 924d155863398f6b:c4a3470300000000 because it would exceed an applicable memory limit. Memory is likely oversubscribed. Reducing query concurrency or configuring admission control may help avoid this error. Memory usage:
, node=[r4c3s2]
Process: Limit=55.00 GB Total=49.79 GB Peak=49.92 GB, node=[r4c3s2]
Buffer Pool: Free Buffers: Total=0, node=[r4c3s2]
Buffer Pool: Clean Pages: Total=4.21 GB, node=[r4c3s2]
Buffer Pool: Unused Reservation: Total=-4.21 GB, node=[r4c3s2]
Free Disk IO Buffers: Total=1.19 GB Peak=1.52 GB, node=[r4c3s2]
However, it says there are just used 23.83 GB of 150.00 GB. Also the queries became really slow. This problem occurred out of nowhere. Does anyone have an explanation for that?
Here are all memory infromation I got from the "/memz?detailed=true" page of one node:
Memory Usage
Memory consumption / limit: 23.83 GB / 150.00 GB
Breakdown`enter code here`
Process: Limit=150.00 GB Total=23.83 GB Peak=58.75 GB
Buffer Pool: Free Buffers: Total=72.69 MB
Buffer Pool: Clean Pages: Total=0
Buffer Pool: Unused Reservation: Total=-71.94 MB
Free Disk IO Buffers: Total=1.61 GB Peak=1.67 GB
RequestPool=root.default: Total=20.77 GB Peak=59.92 GB
Query(2647a4f63d37fdaa:690ad3b500000000): Reservation=20.67 GB ReservationLimit=120.00 GB OtherMemory=101.21 MB Total=20.77 GB Peak=20.77 GB
Unclaimed reservations: Reservation=71.94 MB OtherMemory=0 Total=71.94 MB Peak=139.94 MB
Fragment 2647a4f63d37fdaa:690ad3b50000001c: Reservation=0 OtherMemory=114.48 KB Total=114.48 KB Peak=855.48 KB
AGGREGATION_NODE (id=9): Total=102.12 KB Peak=102.12 KB
Exprs: Total=102.12 KB Peak=102.12 KB
EXCHANGE_NODE (id=8): Total=0 Peak=0
DataStreamRecvr: Total=0 Peak=0
DataStreamSender (dst_id=10): Total=872.00 B Peak=872.00 B
CodeGen: Total=3.50 KB Peak=744.50 KB
Fragment 2647a4f63d37fdaa:690ad3b500000014: Reservation=0 OtherMemory=243.31 KB Total=243.31 KB Peak=1.57 MB
AGGREGATION_NODE (id=3): Total=102.12 KB Peak=102.12 KB
Exprs: Total=102.12 KB Peak=102.12 KB
AGGREGATION_NODE (id=7): Total=119.12 KB Peak=119.12 KB
Exprs: Total=119.12 KB Peak=119.12 KB
EXCHANGE_NODE (id=6): Total=0 Peak=0
DataStreamRecvr: Total=0 Peak=0
DataStreamSender (dst_id=8): Total=6.81 KB Peak=6.81 KB
CodeGen: Total=7.25 KB Peak=1.34 MB
Fragment 2647a4f63d37fdaa:690ad3b50000000c: Reservation=2.32 GB OtherMemory=349.48 KB Total=2.32 GB Peak=2.32 GB
AGGREGATION_NODE (id=2): Total=119.12 KB Peak=119.12 KB
Exprs: Total=119.12 KB Peak=119.12 KB
AGGREGATION_NODE (id=5): Reservation=2.32 GB OtherMemory=199.74 KB Total=2.32 GB Peak=2.32 GB
Exprs: Total=120.12 KB Peak=120.12 KB
EXCHANGE_NODE (id=4): Total=0 Peak=0
DataStreamRecvr: Total=336.00 B Peak=549.14 KB
DataStreamSender (dst_id=6): Total=6.44 KB Peak=6.44 KB
CodeGen: Total=15.85 KB Peak=3.10 MB
Fragment 2647a4f63d37fdaa:690ad3b500000004: Reservation=18.29 GB OtherMemory=100.52 MB Total=18.38 GB Peak=18.38 GB
AGGREGATION_NODE (id=1): Reservation=18.29 GB OtherMemory=334.12 KB Total=18.29 GB Peak=18.29 GB
Exprs: Total=148.12 KB Peak=148.12 KB
HDFS_SCAN_NODE (id=0): Total=100.17 MB Peak=178.15 MB
Exprs: Total=4.00 KB Peak=4.00 KB
DataStreamSender (dst_id=4): Total=6.75 KB Peak=6.75 KB
CodeGen: Total=9.72 KB Peak=2.92 MB
RequestPool=fe-eval-exprs: Total=0 Peak=12.00 KB
Untracked Memory: Total=1.44 GB
tcmalloc
------------------------------------------------
MALLOC: 24646559936 (23504.8 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 725840992 ( 692.2 MiB) Bytes in central cache freelist
MALLOC: + 4726720 ( 4.5 MiB) Bytes in transfer cache freelist
MALLOC: + 208077600 ( 198.4 MiB) Bytes in thread cache freelists
MALLOC: + 105918656 ( 101.0 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 25691123904 (24501.0 MiB) Actual memory used (physical + swap)
MALLOC: + 53904392192 (51407.2 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 79595516096 (75908.2 MiB) Virtual address space used
MALLOC:
MALLOC: 133041 Spans in use
MALLOC: 842 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
System
Physical Memory: 252.41 GB
Transparent Huge Pages Config:
enabled: always [madvise] never
defrag: [always] madvise never
khugepaged defrag: 1
Process and system memory metrics
Name Value Description
memory.anon-huge-page-bytes 19.01 GB Total bytes of anonymous (a.k.a. transparent) huge pages used by this process.
memory.mapped-bytes 113.09 GB Total bytes of memory mappings in this process (the virtual memory size).
memory.num-maps 18092 Total number of memory mappings in this process.
memory.rss 24.51 GB Resident set size (RSS) of this process, including TCMalloc, buffer pool and Jvm.
memory.thp.defrag [always] madvise never The system-wide 'defrag' setting for Transparent Huge Pages.
memory.thp.enabled always [madvise] never The system-wide 'enabled' setting for Transparent Huge Pages.
memory.thp.khugepaged-defrag 1 The system-wide 'defrag' setting for khugepaged.
memory.total-used 23.83 GB Total memory currently used by TCMalloc and buffer pool.
Buffer pool memory metrics
Name Value Description
buffer-pool.clean-page-bytes 0 Total bytes of clean page memory cached in the buffer pool.
buffer-pool.clean-pages 0 Total number of clean pages cached in the buffer pool.
buffer-pool.clean-pages-limit 12.00 GB Limit on number of clean pages cached in the buffer pool.
buffer-pool.free-buffer-bytes 72.69 MB Total bytes of free buffer memory cached in the buffer pool.
buffer-pool.free-buffers 177 Total number of free buffers cached in the buffer pool.
buffer-pool.limit 120.00 GB Maximum allowed bytes allocated by the buffer pool.
buffer-pool.reserved 20.67 GB Total bytes of buffers reserved by Impala subsystems
buffer-pool.system-allocated 20.67 GB Total buffer memory currently allocated by the buffer pool.
buffer-pool.unused-reservation-bytes 71.94 MB Total bytes of buffer reservations by Impala subsystems that are currently unused
JVM aggregate memory metrics
Name Value Description
jvm.total.committed-usage-bytes 1.45 GB Jvm total Committed Usage Bytes
jvm.total.current-usage-bytes 903.10 MB Jvm total Current Usage Bytes
jvm.total.init-usage-bytes 1.92 GB Jvm total Init Usage Bytes
jvm.total.max-usage-bytes 31.23 GB Jvm total Max Usage Bytes
jvm.total.peak-committed-usage-bytes 2.09 GB Jvm total Peak Committed Usage Bytes
jvm.total.peak-current-usage-bytes 1.48 GB Jvm total Peak Current Usage Bytes
jvm.total.peak-init-usage-bytes 1.92 GB Jvm total Peak Init Usage Bytes
jvm.total.peak-max-usage-bytes 31.41 GB Jvm total Peak Max Usage Bytes
JVM heap memory metrics
Name Value Description
jvm.heap.committed-usage-bytes 1.37 GB Jvm heap Committed Usage Bytes
jvm.heap.current-usage-bytes 827.25 MB Jvm heap Current Usage Bytes
jvm.heap.init-usage-bytes 2.00 GB Jvm heap Init Usage Bytes
jvm.heap.max-usage-bytes 26.67 GB Jvm heap Max Usage Bytes
jvm.heap.peak-committed-usage-bytes 0 Jvm heap Peak Committed Usage Bytes
jvm.heap.peak-current-usage-bytes 0 Jvm heap Peak Current Usage Bytes
jvm.heap.peak-init-usage-bytes 0 Jvm heap Peak Init Usage Bytes
jvm.heap.peak-max-usage-bytes 0 Jvm heap Peak Max Usage Bytes
JVM non-heap memory metrics
Name Value Description
jvm.non-heap.committed-usage-bytes 76.90 MB Jvm non-heap Committed Usage Bytes
jvm.non-heap.current-usage-bytes 75.68 MB Jvm non-heap Current Usage Bytes
jvm.non-heap.init-usage-bytes 2.44 MB Jvm non-heap Init Usage Bytes
jvm.non-heap.max-usage-bytes -1.00 B Jvm non-heap Max Usage Bytes
jvm.non-heap.peak-committed-usage-bytes 0 Jvm non-heap Peak Committed Usage Bytes
jvm.non-heap.peak-current-usage-bytes 0 Jvm non-heap Peak Current Usage Bytes
jvm.non-heap.peak-init-usage-bytes 0 Jvm non-heap Peak Init Usage Bytes
jvm.non-heap.peak-max-usage-bytes 0 Jvm non-heap Peak Max Usage Bytes

Process: Limit=150.00 GB Total=23.83 GB Peak=58.75 GB
this caused by memory limit Memorylimit exceeded
change those setting memory.soft_limit_in_bytes ,memory.limit_in_bytes mem_limit ,default_pool_mem_limit value to 0 or -1
1 or 0 represents unlimited .

Enhanced REP MOVSB for memcpy

I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy.
ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and STOSB operation (ERMSB)" in the Intel optimization manual if you don't know what ERMSB is.
The only way I know to do this directly is with inline assembly. I got the following function from https://groups.google.com/forum/#!topic/gnu.gcc.help/-Bmlm_EG_fE
static inline void *__movsb(void *d, const void *s, size_t n) {
asm volatile ("rep movsb"
: "=D" (d),
"=S" (s),
"=c" (n)
: "0" (d),
"1" (s),
"2" (n)
: "memory");
return d;
}
When I use this however, the bandwidth is much less than with memcpy.
__movsb gets 15 GB/s and memcpy get 26 GB/s with my i7-6700HQ (Skylake) system, Ubuntu 16.10, DDR4#2400 MHz dual channel 32 GB, GCC 6.2.
Why is the bandwidth so much lower with REP MOVSB? What can I do to improve it?
Here is the code I used to test this.
//gcc -O3 -march=native -fopenmp foo.c
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <stddef.h>
#include <omp.h>
#include <x86intrin.h>
static inline void *__movsb(void *d, const void *s, size_t n) {
asm volatile ("rep movsb"
: "=D" (d),
"=S" (s),
"=c" (n)
: "0" (d),
"1" (s),
"2" (n)
: "memory");
return d;
}
int main(void) {
int n = 1<<30;
//char *a = malloc(n), *b = malloc(n);
char *a = _mm_malloc(n,4096), *b = _mm_malloc(n,4096);
memset(a,2,n), memset(b,1,n);
__movsb(b,a,n);
printf("%d\n", memcmp(b,a,n));
double dtime;
dtime = -omp_get_wtime();
for(int i=0; i<10; i++) __movsb(b,a,n);
dtime += omp_get_wtime();
printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime);
dtime = -omp_get_wtime();
for(int i=0; i<10; i++) memcpy(b,a,n);
dtime += omp_get_wtime();
printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime);
}
The reason I am interested in rep movsb is based off these comments
Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not...
rep movsb is significantly faster than movntdqa when streaming to memory on Ivybridge and Haswell (but be aware that pre-Ivybridge it is slow!)
What's missing/sub-optimal in this memcpy implementation?
Here are my results on the same system from tinymembnech.
C copy backwards : 7910.6 MB/s (1.4%)
C copy backwards (32 byte blocks) : 7696.6 MB/s (0.9%)
C copy backwards (64 byte blocks) : 7679.5 MB/s (0.7%)
C copy : 8811.0 MB/s (1.2%)
C copy prefetched (32 bytes step) : 9328.4 MB/s (0.5%)
C copy prefetched (64 bytes step) : 9355.1 MB/s (0.6%)
C 2-pass copy : 6474.3 MB/s (1.3%)
C 2-pass copy prefetched (32 bytes step) : 7072.9 MB/s (1.2%)
C 2-pass copy prefetched (64 bytes step) : 7065.2 MB/s (0.8%)
C fill : 14426.0 MB/s (1.5%)
C fill (shuffle within 16 byte blocks) : 14198.0 MB/s (1.1%)
C fill (shuffle within 32 byte blocks) : 14422.0 MB/s (1.7%)
C fill (shuffle within 64 byte blocks) : 14178.3 MB/s (1.0%)
---
standard memcpy : 12784.4 MB/s (1.9%)
standard memset : 30630.3 MB/s (1.1%)
---
MOVSB copy : 8712.0 MB/s (2.0%)
MOVSD copy : 8712.7 MB/s (1.9%)
SSE2 copy : 8952.2 MB/s (0.7%)
SSE2 nontemporal copy : 12538.2 MB/s (0.8%)
SSE2 copy prefetched (32 bytes step) : 9553.6 MB/s (0.8%)
SSE2 copy prefetched (64 bytes step) : 9458.5 MB/s (0.5%)
SSE2 nontemporal copy prefetched (32 bytes step) : 13103.2 MB/s (0.7%)
SSE2 nontemporal copy prefetched (64 bytes step) : 13179.1 MB/s (0.9%)
SSE2 2-pass copy : 7250.6 MB/s (0.7%)
SSE2 2-pass copy prefetched (32 bytes step) : 7437.8 MB/s (0.6%)
SSE2 2-pass copy prefetched (64 bytes step) : 7498.2 MB/s (0.9%)
SSE2 2-pass nontemporal copy : 3776.6 MB/s (1.4%)
SSE2 fill : 14701.3 MB/s (1.6%)
SSE2 nontemporal fill : 34188.3 MB/s (0.8%)
Note that on my system SSE2 copy prefetched is also faster than MOVSB copy.
In my original tests I did not disable turbo. I disabled turbo and tested again and it does not appear to make much of a difference. However, changing the power management does make a big difference.
When I do
sudo cpufreq-set -r -g performance
I sometimes see over 20 GB/s with rep movsb.
with
sudo cpufreq-set -r -g powersave
the best I see is about 17 GB/s. But memcpy does not seem to be sensitive to the power management.
I checked the frequency (using turbostat) with and without SpeedStep enabled, with performance and with powersave for idle, a 1 core load and a 4 core load. I ran Intel's MKL dense matrix multiplication to create a load and set the number of threads using OMP_SET_NUM_THREADS. Here is a table of the results (numbers in GHz).
SpeedStep idle 1 core 4 core
powersave OFF 0.8 2.6 2.6
performance OFF 2.6 2.6 2.6
powersave ON 0.8 3.5 3.1
performance ON 3.5 3.5 3.1
This shows that with powersave even with SpeedStep disabled the CPU
still clocks down to the idle frequency of 0.8 GHz. It's only with performance without SpeedStep that the CPU runs at a constant frequency.
I used e.g sudo cpufreq-set -r performance (because cpufreq-set was giving strange results) to change the power settings. This turns turbo back on so I had to disable turbo after.

This is a topic pretty near to my heart and recent investigations, so I'll look at it from a few angles: history, some technical notes (mostly academic), test results on my box, and finally an attempt to answer your actual question of when and where rep movsb might make sense.
Partly, this is a call to share results - if you can run Tinymembench and share the results along with details of your CPU and RAM configuration it would be great. Especially if you have a 4-channel setup, an Ivy Bridge box, a server box, etc.
History and Official Advice
The performance history of the fast string copy instructions has been a bit of a stair-step affair - i.e., periods of stagnant performance alternating with big upgrades that brought them into line or even faster than competing approaches. For example, there was a jump in performance in Nehalem (mostly targeting startup overheads) and again in Ivy Bridge (most targeting total throughput for large copies). You can find decade-old insight on the difficulties of implementing the rep movs instructions from an Intel engineer in this thread.
For example, in guides preceding the introduction of Ivy Bridge, the typical advice is to avoid them or use them very carefully1.
The current (well, June 2016) guide has a variety of confusing and somewhat inconsistent advice, such as2:
The specific variant of the implementation is chosen at execution time
based on data layout, alignment and the counter (ECX) value. For
example, MOVSB/STOSB with the REP prefix should be used with counter
value less than or equal to three for best performance.
So for copies of 3 or less bytes? You don't need a rep prefix for that in the first place, since with a claimed startup latency of ~9 cycles you are almost certainly better off with a simple DWORD or QWORD mov with a bit of bit-twiddling to mask off the unused bytes (or perhaps with 2 explicit byte, word movs if you know the size is exactly three).
They go on to say:
String MOVE/STORE instructions have multiple data granularities. For
efficient data movement, larger data granularities are preferable.
This means better efficiency can be achieved by decomposing an
arbitrary counter value into a number of double words plus single byte
moves with a count value less than or equal to 3.
This certainly seems wrong on current hardware with ERMSB where rep movsb is at least as fast, or faster, than the movd or movq variants for large copies.
In general, that section (3.7.5) of the current guide contains a mix of reasonable and badly obsolete advice. This is common throughput the Intel manuals, since they are updated in an incremental fashion for each architecture (and purport to cover nearly two decades worth of architectures even in the current manual), and old sections are often not updated to replace or make conditional advice that doesn't apply to the current architecture.
They then go on to cover ERMSB explicitly in section 3.7.6.
I won't go over the remaining advice exhaustively, but I'll summarize the good parts in the "why use it" below.
Other important claims from the guide are that on Haswell, rep movsb has been enhanced to use 256-bit operations internally.
Technical Considerations
This is just a quick summary of the underlying advantages and disadvantages that the rep instructions have from an implementation standpoint.
Advantages for rep movs
When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:
Avoiding the RFO request when it knows the entire cache line will be overwritten.
Issuing prefetch requests immediately and exactly. Hardware prefetching does a good job at detecting memcpy-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region. rep movsb knows exactly the region size and can prefetch exactly.
Apparently, there is no guarantee of ordering among the stores within3 a single rep movs which can help simplify coherency traffic and simply other aspects of the block move, versus simple mov instructions which have to obey rather strict memory ordering4.
In principle, the rep movs instruction could take advantage of various architectural tricks that aren't exposed in the ISA. For example, architectures may have wider internal data paths that the ISA exposes5 and rep movs could use that internally.
Disadvantages
rep movsb must implement a specific semantic which may be stronger than the underlying software requirement. In particular, memcpy forbids overlapping regions, and so may ignore that possibility, but rep movsb allows them and must produce the expected result. On current implementations mostly affects to startup overhead, but probably not to large-block throughput. Similarly, rep movsb must support byte-granular copies even if you are actually using it to copy large blocks which are a multiple of some large power of 2.
The software may have information about alignment, copy size and possible aliasing that cannot be communicated to the hardware if using rep movsb. Compilers can often determine the alignment of memory blocks6 and so can avoid much of the startup work that rep movs must do on every invocation.
Test Results
Here are test results for many different copy methods from tinymembench on my i7-6700HQ at 2.6 GHz (too bad I have the identical CPU so we aren't getting a new data point...):
C copy backwards : 8284.8 MB/s (0.3%)
C copy backwards (32 byte blocks) : 8273.9 MB/s (0.4%)
C copy backwards (64 byte blocks) : 8321.9 MB/s (0.8%)
C copy : 8863.1 MB/s (0.3%)
C copy prefetched (32 bytes step) : 8900.8 MB/s (0.3%)
C copy prefetched (64 bytes step) : 8817.5 MB/s (0.5%)
C 2-pass copy : 6492.3 MB/s (0.3%)
C 2-pass copy prefetched (32 bytes step) : 6516.0 MB/s (2.4%)
C 2-pass copy prefetched (64 bytes step) : 6520.5 MB/s (1.2%)
---
standard memcpy : 12169.8 MB/s (3.4%)
standard memset : 23479.9 MB/s (4.2%)
---
MOVSB copy : 10197.7 MB/s (1.6%)
MOVSD copy : 10177.6 MB/s (1.6%)
SSE2 copy : 8973.3 MB/s (2.5%)
SSE2 nontemporal copy : 12924.0 MB/s (1.7%)
SSE2 copy prefetched (32 bytes step) : 9014.2 MB/s (2.7%)
SSE2 copy prefetched (64 bytes step) : 8964.5 MB/s (2.3%)
SSE2 nontemporal copy prefetched (32 bytes step) : 11777.2 MB/s (5.6%)
SSE2 nontemporal copy prefetched (64 bytes step) : 11826.8 MB/s (3.2%)
SSE2 2-pass copy : 7529.5 MB/s (1.8%)
SSE2 2-pass copy prefetched (32 bytes step) : 7122.5 MB/s (1.0%)
SSE2 2-pass copy prefetched (64 bytes step) : 7214.9 MB/s (1.4%)
SSE2 2-pass nontemporal copy : 4987.0 MB/s
Some key takeaways:
The rep movs methods are faster than all the other methods which aren't "non-temporal"7, and considerably faster than the "C" approaches which copy 8 bytes at a time.
The "non-temporal" methods are faster, by up to about 26% than the rep movs ones - but that's a much smaller delta than the one you reported (26 GB/s vs 15 GB/s = ~73%).
If you are not using non-temporal stores, using 8-byte copies from C is pretty much just as good as 128-bit wide SSE load/stores. That's because a good copy loop can generate enough memory pressure to saturate the bandwidth (e.g., 2.6 GHz * 1 store/cycle * 8 bytes = 26 GB/s for stores).
There are no explicit 256-bit algorithms in tinymembench (except probably the "standard" memcpy) but it probably doesn't matter due to the above note.
The increased throughput of the non-temporal store approaches over the temporal ones is about 1.45x, which is very close to the 1.5x you would expect if NT eliminates 1 out of 3 transfers (i.e., 1 read, 1 write for NT vs 2 reads, 1 write). The rep movs approaches lie in the middle.
The combination of fairly low memory latency and modest 2-channel bandwidth means this particular chip happens to be able to saturate its memory bandwidth from a single-thread, which changes the behavior dramatically.
rep movsd seems to use the same magic as rep movsb on this chip. That's interesting because ERMSB only explicitly targets movsb and earlier tests on earlier archs with ERMSB show movsb performing much faster than movsd. This is mostly academic since movsb is more general than movsd anyway.
Haswell
Looking at the Haswell results kindly provided by iwillnotexist in the comments, we see the same general trends (most relevant results extracted):
C copy : 6777.8 MB/s (0.4%)
standard memcpy : 10487.3 MB/s (0.5%)
MOVSB copy : 9393.9 MB/s (0.2%)
MOVSD copy : 9155.0 MB/s (1.6%)
SSE2 copy : 6780.5 MB/s (0.4%)
SSE2 nontemporal copy : 10688.2 MB/s (0.3%)
The rep movsb approach is still slower than the non-temporal memcpy, but only by about 14% here (compared to ~26% in the Skylake test). The advantage of the NT techniques above their temporal cousins is now ~57%, even a bit more than the theoretical benefit of the bandwidth reduction.
When should you use rep movs?
Finally a stab at your actual question: when or why should you use it? It draw on the above and introduces a few new ideas. Unfortunately there is no simple answer: you'll have to trade off various factors, including some which you probably can't even know exactly, such as future developments.
A note that the alternative to rep movsb may be the optimized libc memcpy (including copies inlined by the compiler), or it may be a hand-rolled memcpy version. Some of the benefits below apply only in comparison to one or the other of these alternatives (e.g., "simplicity" helps against a hand-rolled version, but not against built-in memcpy), but some apply to both.
Restrictions on available instructions
In some environments there is a restriction on certain instructions or using certain registers. For example, in the Linux kernel, use of SSE/AVX or FP registers is generally disallowed. Therefore most of the optimized memcpy variants cannot be used as they rely on SSE or AVX registers, and a plain 64-bit mov-based copy is used on x86. For these platforms, using rep movsb allows most of the performance of an optimized memcpy without breaking the restriction on SIMD code.
A more general example might be code that has to target many generations of hardware, and which doesn't use hardware-specific dispatching (e.g., using cpuid). Here you might be forced to use only older instruction sets, which rules out any AVX, etc. rep movsb might be a good approach here since it allows "hidden" access to wider loads and stores without using new instructions. If you target pre-ERMSB hardware you'd have to see if rep movsb performance is acceptable there, though...
Future Proofing
A nice aspect of rep movsb is that it can, in theory take advantage of architectural improvement on future architectures, without source changes, that explicit moves cannot. For example, when 256-bit data paths were introduced, rep movsb was able to take advantage of them (as claimed by Intel) without any changes needed to the software. Software using 128-bit moves (which was optimal prior to Haswell) would have to be modified and recompiled.
So it is both a software maintenance benefit (no need to change source) and a benefit for existing binaries (no need to deploy new binaries to take advantage of the improvement).
How important this is depends on your maintenance model (e.g., how often new binaries are deployed in practice) and a very difficult to make judgement of how fast these instructions are likely to be in the future. At least Intel is kind of guiding uses in this direction though, by committing to at least reasonable performance in the future (15.3.3.6):
REP MOVSB and REP STOSB will continue to perform reasonably well on
future processors.
Overlapping with subsequent work
This benefit won't show up in a plain memcpy benchmark of course, which by definition doesn't have subsequent work to overlap, so the magnitude of the benefit would have to be carefully measured in a real-world scenario. Taking maximum advantage might require re-organization of the code surrounding the memcpy.
This benefit is pointed out by Intel in their optimization manual (section 11.16.3.4) and in their words:
When the count is known to be at least a thousand byte or more, using
enhanced REP MOVSB/STOSB can provide another advantage to amortize the
cost of the non-consuming code. The heuristic can be understood
using a value of Cnt = 4096 and memset() as example:
• A 256-bit SIMD implementation of memset() will need to issue/execute
retire 128 instances of 32- byte store operation with VMOVDQA, before
the non-consuming instruction sequences can make their way to
retirement.
• An instance of enhanced REP STOSB with ECX= 4096 is decoded as a
long micro-op flow provided by hardware, but retires as one
instruction. There are many store_data operation that must complete
before the result of memset() can be consumed. Because the completion
of store data operation is de-coupled from program-order retirement, a
substantial part of the non-consuming code stream can process through
the issue/execute and retirement, essentially cost-free if the
non-consuming sequence does not compete for store buffer resources.
So Intel is saying that after all some uops the code after rep movsb has issued, but while lots of stores are still in flight and the rep movsb as a whole hasn't retired yet, uops from following instructions can make more progress through the out-of-order machinery than they could if that code came after a copy loop.
The uops from an explicit load and store loop all have to actually retire separately in program order. That has to happen to make room in the ROB for following uops.
There doesn't seem to be much detailed information about how very long microcoded instruction like rep movsb work, exactly. We don't know exactly how micro-code branches request a different stream of uops from the microcode sequencer, or how the uops retire. If the individual uops don't have to retire separately, perhaps the whole instruction only takes up one slot in the ROB?
When the front-end that feeds the OoO machinery sees a rep movsb instruction in the uop cache, it activates the Microcode Sequencer ROM (MS-ROM) to send microcode uops into the queue that feeds the issue/rename stage. It's probably not possible for any other uops to mix in with that and issue/execute8 while rep movsb is still issuing, but subsequent instructions can be fetched/decoded and issue right after the last rep movsb uop does, while some of the copy hasn't executed yet.
This is only useful if at least some of your subsequent code doesn't depend on the result of the memcpy (which isn't unusual).
Now, the size of this benefit is limited: at most you can execute N instructions (uops actually) beyond the slow rep movsb instruction, at which point you'll stall, where N is the ROB size. With current ROB sizes of ~200 (192 on Haswell, 224 on Skylake), that's a maximum benefit of ~200 cycles of free work for subsequent code with an IPC of 1. In 200 cycles you can copy somewhere around 800 bytes at 10 GB/s, so for copies of that size you may get free work close to the cost of the copy (in a way making the copy free).
As copy sizes get much larger, however, the relative importance of this diminishes rapidly (e.g., if you are copying 80 KB instead, the free work is only 1% of the copy cost). Still, it is quite interesting for modest-sized copies.
Copy loops don't totally block subsequent instructions from executing, either. Intel does not go into detail on the size of the benefit, or on what kind of copies or surrounding code there is most benefit. (Hot or cold destination or source, high ILP or low ILP high-latency code after).
Code Size
The executed code size (a few bytes) is microscopic compared to a typical optimized memcpy routine. If performance is at all limited by i-cache (including uop cache) misses, the reduced code size might be of benefit.
Again, we can bound the magnitude of this benefit based on the size of the copy. I won't actually work it out numerically, but the intuition is that reducing the dynamic code size by B bytes can save at most C * B cache-misses, for some constant C. Every call to memcpy incurs the cache miss cost (or benefit) once, but the advantage of higher throughput scales with the number of bytes copied. So for large transfers, higher throughput will dominate the cache effects.
Again, this is not something that will show up in a plain benchmark, where the entire loop will no doubt fit in the uop cache. You'll need a real-world, in-place test to evaluate this effect.
Architecture Specific Optimization
You reported that on your hardware, rep movsb was considerably slower than the platform memcpy. However, even here there are reports of the opposite result on earlier hardware (like Ivy Bridge).
That's entirely plausible, since it seems that the string move operations get love periodically - but not every generation, so it may well be faster or at least tied (at which point it may win based on other advantages) on the architectures where it has been brought up to date, only to fall behind in subsequent hardware.
Quoting Andy Glew, who should know a thing or two about this after implementing these on the P6:
the big weakness of doing fast strings in microcode was [...] the
microcode fell out of tune with every generation, getting slower and
slower until somebody got around to fixing it. Just like a library men
copy falls out of tune. I suppose that it is possible that one of the
missed opportunities was to use 128-bit loads and stores when they
became available, and so on.
In that case, it can be seen as just another "platform specific" optimization to apply in the typical every-trick-in-the-book memcpy routines you find in standard libraries and JIT compilers: but only for use on architectures where it is better. For JIT or AOT-compiled stuff this is easy, but for statically compiled binaries this does require platform specific dispatch, but that often already exists (sometimes implemented at link time), or the mtune argument can be used to make a static decision.
Simplicity
Even on Skylake, where it seems like it has fallen behind the absolute fastest non-temporal techniques, it is still faster than most approaches and is very simple. This means less time in validation, fewer mystery bugs, less time tuning and updating a monster memcpy implementation (or, conversely, less dependency on the whims of the standard library implementors if you rely on that).
Latency Bound Platforms
Memory throughput bound algorithms9 can actually be operating in two main overall regimes: DRAM bandwidth bound or concurrency/latency bound.
The first mode is the one that you are probably familiar with: the DRAM subsystem has a certain theoretic bandwidth that you can calculate pretty easily based on the number of channels, data rate/width and frequency. For example, my DDR4-2133 system with 2 channels has a max bandwidth of 2.133 * 8 * 2 = 34.1 GB/s, same as reported on ARK.
You won't sustain more than that rate from DRAM (and usually somewhat less due to various inefficiencies) added across all cores on the socket (i.e., it is a global limit for single-socket systems).
The other limit is imposed by how many concurrent requests a core can actually issue to the memory subsystem. Imagine if a core could only have 1 request in progress at once, for a 64-byte cache line - when the request completed, you could issue another. Assume also very fast 50ns memory latency. Then despite the large 34.1 GB/s DRAM bandwidth, you'd actually only get 64 bytes / 50 ns = 1.28 GB/s, or less than 4% of the max bandwidth.
In practice, cores can issue more than one request at a time, but not an unlimited number. It is usually understood that there are only 10 line fill buffers per core between the L1 and the rest of the memory hierarchy, and perhaps 16 or so fill buffers between L2 and DRAM. Prefetching competes for the same resources, but at least helps reduce the effective latency. For more details look at any of the great posts Dr. Bandwidth has written on the topic, mostly on the Intel forums.
Still, most recent CPUs are limited by this factor, not the RAM bandwidth. Typically they achieve 12 - 20 GB/s per core, while the RAM bandwidth may be 50+ GB/s (on a 4 channel system). Only some recent gen 2-channel "client" cores, which seem to have a better uncore, perhaps more line buffers can hit the DRAM limit on a single core, and our Skylake chips seem to be one of them.
Now of course, there is a reason Intel designs systems with 50 GB/s DRAM bandwidth, while only being to sustain < 20 GB/s per core due to concurrency limits: the former limit is socket-wide and the latter is per core. So each core on an 8 core system can push 20 GB/s worth of requests, at which point they will be DRAM limited again.
Why I am going on and on about this? Because the best memcpy implementation often depends on which regime you are operating in. Once you are DRAM BW limited (as our chips apparently are, but most aren't on a single core), using non-temporal writes becomes very important since it saves the read-for-ownership that normally wastes 1/3 of your bandwidth. You see that exactly in the test results above: the memcpy implementations that don't use NT stores lose 1/3 of their bandwidth.
If you are concurrency limited, however, the situation equalizes and sometimes reverses, however. You have DRAM bandwidth to spare, so NT stores don't help and they can even hurt since they may increase the latency since the handoff time for the line buffer may be longer than a scenario where prefetch brings the RFO line into LLC (or even L2) and then the store completes in LLC for an effective lower latency. Finally, server uncores tend to have much slower NT stores than client ones (and high bandwidth), which accentuates this effect.
So on other platforms you might find that NT stores are less useful (at least when you care about single-threaded performance) and perhaps rep movsb wins where (if it gets the best of both worlds).
Really, this last item is a call for most testing. I know that NT stores lose their apparent advantage for single-threaded tests on most archs (including current server archs), but I don't know how rep movsb will perform relatively...
References
Other good sources of info not integrated in the above.
comp.arch investigation of rep movsb versus alternatives. Lots of good notes about branch prediction, and an implementation of the approach I've often suggested for small blocks: using overlapping first and/or last read/writes rather than trying to write only exactly the required number of bytes (for example, implementing all copies from 9 to 16 bytes as two 8-byte copies which might overlap in up to 7 bytes).
1 Presumably the intention is to restrict it to cases where, for example, code-size is very important.
2 See Section 3.7.5: REP Prefix and Data Movement.
3 It is key to note this applies only for the various stores within the single instruction itself: once complete, the block of stores still appear ordered with respect to prior and subsequent stores. So code can see stores from the rep movs out of order with respect to each other but not with respect to prior or subsequent stores (and it's the latter guarantee you usually need). It will only be a problem if you use the end of the copy destination as a synchronization flag, instead of a separate store.
4 Note that non-temporal discrete stores also avoid most of the ordering requirements, although in practice rep movs has even more freedom since there are still some ordering constraints on WC/NT stores.
5 This is was common in the latter part of the 32-bit era, where many chips had 64-bit data paths (e.g, to support FPUs which had support for the 64-bit double type). Today, "neutered" chips such as the Pentium or Celeron brands have AVX disabled, but presumably rep movs microcode can still use 256b loads/stores.
6 E.g., due to language alignment rules, alignment attributes or operators, aliasing rules or other information determined at compile time. In the case of alignment, even if the exact alignment can't be determined, they may at least be able to hoist alignment checks out of loops or otherwise eliminate redundant checks.
7 I'm making the assumption that "standard" memcpy is choosing a non-temporal approach, which is highly likely for this size of buffer.
8 That isn't necessarily obvious, since it could be the case that the uop stream that is generated by the rep movsb simply monopolizes dispatch and then it would look very much like the explicit mov case. It seems that it doesn't work like that however - uops from subsequent instructions can mingle with uops from the microcoded rep movsb.
9 I.e., those which can issue a large number of independent memory requests and hence saturate the available DRAM-to-core bandwidth, of which memcpy would be a poster child (and as apposed to purely latency bound loads such as pointer chasing).

Enhanced REP MOVSB (Ivy Bridge and later)
Ivy Bridge microarchitecture (processors released in 2012 and 2013) introduced Enhanced REP MOVSB (ERMSB). We still need to check the corresponding bit. ERMS was intended to allow us to copy memory fast with rep movsb.
Cheapest versions of later processors - Kaby Lake Celeron and Pentium, released in 2017, don't have AVX that could have been used for fast memory copy, but still have the Enhanced REP MOVSB. And some of Intel's mobile and low-power architectures released in 2018 and onwards, which were not based on SkyLake, copy about twice more bytes per CPU cycle with REP MOVSB than previous generations of microarchitectures.
Enhanced REP MOVSB (ERMSB) before the Ice Lake microarchitecture with Fast Short REP MOV (FSRM) was only faster than AVX copy or general-use register copy if the block size is at least 256 bytes. For the blocks below 64 bytes, it was much slower, because there is a high internal startup in ERMSB - about 35 cycles. The FSRM feature intended blocks before 128 bytes also be quick.
See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (applies to processors which did not yet have FSRM):
startup cost is 35 cycles;
both the source and destination addresses have to be aligned to a 16-Byte boundary;
the source region should not overlap with the destination region;
the length has to be a multiple of 64 to produce higher performance;
the direction has to be forward (CLD).
As I said earlier, REP MOVSB (on processors before FSRM) begins to outperform other methods when the length is at least 256 bytes, but to see the clear benefit over AVX copy, the length has to be more than 2048 bytes. Also, it should be noted that merely using AVX (256-bit registers) or AVX-512 (512-bit registers) for memory copy may sometimes have dire consequences like AVX/SSE transition penalties or reduced turbo frequency. So the REP MOVSB is a safer way to copy memory than AVX.
On the effect of the alignment if REP MOVSB vs. AVX copy, the Intel Manual gives the following information:
if the source buffer is not aligned, the impact on ERMSB implementation versus 128-bit AVX is similar;
if the destination buffer is not aligned, the effect on ERMSB implementation can be 25% degradation, while 128-bit AVX implementation of memory copy may degrade only 5%, relative to 16-byte aligned scenario.
I have made tests on Intel Core i5-6600, under 64-bit, and I have compared REP MOVSB memcpy() with a simple MOV RAX, [SRC]; MOV [DST], RAX implementation when the data fits L1 cache:
REP MOVSB memory copy
- 1622400000 data blocks of 32 bytes took 17.9337 seconds to copy; 2760.8205 MB/s
- 1622400000 data blocks of 64 bytes took 17.8364 seconds to copy; 5551.7463 MB/s
- 811200000 data blocks of 128 bytes took 10.8098 seconds to copy; 9160.5659 MB/s
- 405600000 data blocks of 256 bytes took 5.8616 seconds to copy; 16893.5527 MB/s
- 202800000 data blocks of 512 bytes took 3.9315 seconds to copy; 25187.2976 MB/s
- 101400000 data blocks of 1024 bytes took 2.1648 seconds to copy; 45743.4214 MB/s
- 50700000 data blocks of 2048 bytes took 1.5301 seconds to copy; 64717.0642 MB/s
- 25350000 data blocks of 4096 bytes took 1.3346 seconds to copy; 74198.4030 MB/s
- 12675000 data blocks of 8192 bytes took 1.1069 seconds to copy; 89456.2119 MB/s
- 6337500 data blocks of 16384 bytes took 1.1120 seconds to copy; 89053.2094 MB/s
MOV RAX... memory copy
- 1622400000 data blocks of 32 bytes took 7.3536 seconds to copy; 6733.0256 MB/s
- 1622400000 data blocks of 64 bytes took 10.7727 seconds to copy; 9192.1090 MB/s
- 811200000 data blocks of 128 bytes took 8.9408 seconds to copy; 11075.4480 MB/s
- 405600000 data blocks of 256 bytes took 8.4956 seconds to copy; 11655.8805 MB/s
- 202800000 data blocks of 512 bytes took 9.1032 seconds to copy; 10877.8248 MB/s
- 101400000 data blocks of 1024 bytes took 8.2539 seconds to copy; 11997.1185 MB/s
- 50700000 data blocks of 2048 bytes took 7.7909 seconds to copy; 12710.1252 MB/s
- 25350000 data blocks of 4096 bytes took 7.5992 seconds to copy; 13030.7062 MB/s
- 12675000 data blocks of 8192 bytes took 7.4679 seconds to copy; 13259.9384 MB/s
So, even on 128-bit blocks, REP MOVSB (on processors before FSRM) is slower than just a simple MOV RAX copy in a loop (not unrolled). The ERMSB implementation begins to outperform the MOV RAX loop only starting from 256-byte blocks.
Fast Short REP MOV (FSRM)
The Ice Lake microarchitecture launched in September 2019 introduced the Fast Short REP MOV (FSRM). This feature can be tested by a CPUID bit. It was intended for strings of 128 bytes and less to also be quick, but, in fact, strings before 64 bytes are still slower with rep movsb than with, for example, simple 64-bit register copy. Besides that, FSRM is only implemented under 64-bit, not under 32-bit. At least on my i7-1065G7 CPU, rep movsb is only quick for small strings under 64-bit, but on 32-bit strings have to be at least 4KB in order for rep movsb to start outperforming other methods.
Normal (not enhanced) REP MOVS on Nehalem (2009-2013)
Surprisingly, previous architectures (Nehalem and later, up to, but not including Ivy Bridge), that didn't yet have Enhanced REP MOVB, had relatively fast REP MOVSD/MOVSQ (but not REP MOVSB/MOVSW) implementation for large blocks, but not large enough to outsize the L1 cache.
Intel Optimization Manual (2.5.6 REP String Enhancement) gives the following information is related to Nehalem microarchitecture - Intel Core i5, i7 and Xeon processors released in 2009 and 2010, and later microarchitectures, including Sandy Bridge manufactured up to 2013.
REP MOVSB
The latency for MOVSB is 9 cycles if ECX < 4. Otherwise, REP MOVSB with ECX > 9 has a 50-cycle startup cost.
tiny string (ECX < 4): the latency of REP MOVSB is 9 cycles;
small string (ECX is between 4 and 9): no official information in the Intel manual, probably more than 9 cycles but less than 50 cycles;
long string (ECX > 9): 50-cycle startup cost.
MOVSW/MOVSD/MOVSQ
Quote from the Intel Optimization Manual (2.5.6 REP String Enhancement):
Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles.
Fast string (ECX >= 76: excluding REP MOVSB): the processor implementation provides hardware optimization by moving as many pieces of data in 16 bytes as possible. The latency of REP string latency will vary if one of the 16-byte data transfer spans across cache line boundary:
= Split-free: the latency consists of a startup cost of about 40 cycles, and every 64 bytes of data adds 4 cycles.
= Cache splits: the latency consists of a startup cost of about 35 cycles, and every 64 bytes of data adds 6 cycles.
Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has a startup cost of about 15 cycles plus one cycle for each iteration of the data movement in word/dword/qword.
Therefore, according to Intel, for very large memory blocks, REP MOVSW is as fast as REP MOVSD/MOVSQ. Anyway, my tests have shown that only REP MOVSD/MOVSQ are fast, while REP MOVSW is even slower than REP MOVSB on Nehalem and Westmere.
According to the information provided by Intel in the manual, on previous Intel microarchitectures (before 2008) the startup costs are even higher.
Conclusion: if you just need to copy data that fits L1 cache, just 4 cycles to copy 64 bytes of data is excellent, and you don't need to use XMM registers!
#REP MOVSD/MOVSQ is the universal solution that works excellent on all Intel processors (no ERMSB required) if the data fits L1 cache #
Here are the tests of REP MOVS* when the source and destination was in the L1 cache, of blocks large enough to not be seriously affected by startup costs, but not that large to exceed the L1 cache size. Source: http://users.atw.hu/instlatx64/
Yonah (2006-2008)
REP MOVSB 10.91 B/c
REP MOVSW 10.85 B/c
REP MOVSD 11.05 B/c
Nehalem (2009-2010)
REP MOVSB 25.32 B/c
REP MOVSW 19.72 B/c
REP MOVSD 27.56 B/c
REP MOVSQ 27.54 B/c
Westmere (2010-2011)
REP MOVSB 21.14 B/c
REP MOVSW 19.11 B/c
REP MOVSD 24.27 B/c
Ivy Bridge (2012-2013) - with Enhanced REP MOVSB (all subsequent CPUs also have Enhanced REP MOVSB)
REP MOVSB 28.72 B/c
REP MOVSW 19.40 B/c
REP MOVSD 27.96 B/c
REP MOVSQ 27.89 B/c
SkyLake (2015-2016)
REP MOVSB 57.59 B/c
REP MOVSW 58.20 B/c
REP MOVSD 58.10 B/c
REP MOVSQ 57.59 B/c
Kaby Lake (2016-2017)
REP MOVSB 58.00 B/c
REP MOVSW 57.69 B/c
REP MOVSD 58.00 B/c
REP MOVSQ 57.89 B/c
I have presented test results for both SkyLake and Kaby Lake just for the sake of confirmation - these architectures have the same cycle-per-instruction data.
Cannon Lake, mobile (May 2018 - February 2020)
REP MOVSB 107.44 B/c
REP MOVSW 106.74 B/c
REP MOVSD 107.08 B/c
REP MOVSQ 107.08 B/c
Cascade lake, server (April 2019)
REP MOVSB 58.72 B/c
REP MOVSW 58.51 B/c
REP MOVSD 58.51 B/c
REP MOVSQ 58.20 B/c
Comet Lake, desktop, workstation, mobile (August 2019)
REP MOVSB 58.72 B/c
REP MOVSW 58.62 B/c
REP MOVSD 58.72 B/c
REP MOVSQ 58.72 B/c
Ice Lake, mobile (September 2019)
REP MOVSB 102.40 B/c
REP MOVSW 101.14 B/c
REP MOVSD 101.14 B/c
REP MOVSQ 101.14 B/c
Tremont, low power (September, 2020)
REP MOVSB 119.84 B/c
REP MOVSW 121.78 B/c
REP MOVSD 121.78 B/c
REP MOVSQ 121.78 B/c
Tiger Lake, mobile (October, 2020)
REP MOVSB 93.27 B/c
REP MOVSW 93.09 B/c
REP MOVSD 93.09 B/c
REP MOVSQ 93.09 B/c
As you see, the implementation of REP MOVS differs significantly from one microarchitecture to another. On some processors, like Ivy Bridge - REP MOVSB is fastest, albeit just slightly faster than REP MOVSD/MOVSQ, but no doubt that on all processors since Nehalem, REP MOVSD/MOVSQ works very well - you even don't need "Enhanced REP MOVSB", since, on Ivy Bridge (2013) with Enhacnced REP MOVSB, REP MOVSD shows the same byte per clock data as on Nehalem (2010) without Enhacnced REP MOVSB, while in fact REP MOVSB became very fast only since SkyLake (2015) - twice as fast as on Ivy Bridge. So this Enhacnced REP MOVSB bit in the CPUID may be confusing - it only shows that REP MOVSB per se is OK, but not that any REP MOVS* is faster.
The most confusing ERMSB implementation is on the Ivy Bridge microarchitecture. Yes, on very old processors, before ERMSB, REP MOVS* for large blocks did use a cache protocol feature that is not available to regular code (no-RFO). But this protocol is no longer used on Ivy Bridge that has ERMSB. According to Andy Glew's comments on an answer to "why are complicated memcpy/memset superior?" from a Peter Cordes answer, a cache protocol feature that is not available to regular code was once used on older processors, but no longer on Ivy Bridge. And there comes an explanation of why the startup costs are so high for REP MOVS*: „The large overhead for choosing and setting up the right method is mainly due to the lack of microcode branch prediction”. There has also been an interesting note that Pentium Pro (P6) in 1996 implemented REP MOVS* with 64 bit microcode loads and stores and a no-RFO cache protocol - they did not violate memory ordering, unlike ERMSB in Ivy Bridge.
As about rep movsb vs rep movsq, on some processors with ERMSB rep movsb is slightly faster (e.g., Xeon E3-1246 v3), on other rep movsq is faster (Skylake), and on other it is the same speed (e.g. i7-1065G7). However, I would go for rep movsq rather than rep movsb anyway.
Please also note that this answer is only relevant for the cases where the source and the destination data fits L1 cache. Depending on circumstances, the particularities of memory access (cache, etc.) should be taken into consideration.
Please also note that the information in this answer is only related to Intel processors and not to the processors by other manufacturers like AMD that may have better or worse implementations of REP MOVS* instructions.
Tinymembench results
Here are some of the tinymembench results to show relative performance of the rep movsb and rep movsd.
Intel Xeon E5-1650V3
Haswell microarchitecture, ERMS, AVX-2, released on September 2014 for $583, base frequency 3.5 GHz, max turbo frequency: 3.8 GHz (one core), L2 cache 6 × 256 KB, L3 cache 15 MB, supports up to 4×DDR4-2133, installed 8 modules of 32768 MB DDR4 ECC reg (256GB total RAM).
C copy backwards : 7268.8 MB/s (1.5%)
C copy backwards (32 byte blocks) : 7264.3 MB/s
C copy backwards (64 byte blocks) : 7271.2 MB/s
C copy : 7147.2 MB/s
C copy prefetched (32 bytes step) : 7044.6 MB/s
C copy prefetched (64 bytes step) : 7032.5 MB/s
C 2-pass copy : 6055.3 MB/s
C 2-pass copy prefetched (32 bytes step) : 6350.6 MB/s
C 2-pass copy prefetched (64 bytes step) : 6336.4 MB/s
C fill : 11072.2 MB/s
C fill (shuffle within 16 byte blocks) : 11071.3 MB/s
C fill (shuffle within 32 byte blocks) : 11070.8 MB/s
C fill (shuffle within 64 byte blocks) : 11072.0 MB/s
---
standard memcpy : 11608.9 MB/s
standard memset : 15789.7 MB/s
---
MOVSB copy : 8123.9 MB/s
MOVSD copy : 8100.9 MB/s (0.3%)
SSE2 copy : 7213.2 MB/s
SSE2 nontemporal copy : 11985.5 MB/s
SSE2 copy prefetched (32 bytes step) : 7055.8 MB/s
SSE2 copy prefetched (64 bytes step) : 7044.3 MB/s
SSE2 nontemporal copy prefetched (32 bytes step) : 11794.4 MB/s
SSE2 nontemporal copy prefetched (64 bytes step) : 11813.1 MB/s
SSE2 2-pass copy : 6394.3 MB/s
SSE2 2-pass copy prefetched (32 bytes step) : 6255.9 MB/s
SSE2 2-pass copy prefetched (64 bytes step) : 6234.0 MB/s
SSE2 2-pass nontemporal copy : 4279.5 MB/s
SSE2 fill : 10745.0 MB/s
SSE2 nontemporal fill : 22014.4 MB/s
Intel Xeon E3-1246 v3
Haswell, ERMS, AVX-2, 3.50GHz
C copy backwards : 6911.8 MB/s
C copy backwards (32 byte blocks) : 6919.0 MB/s
C copy backwards (64 byte blocks) : 6924.6 MB/s
C copy : 6934.3 MB/s (0.2%)
C copy prefetched (32 bytes step) : 6860.1 MB/s
C copy prefetched (64 bytes step) : 6875.6 MB/s (0.1%)
C 2-pass copy : 6471.2 MB/s
C 2-pass copy prefetched (32 bytes step) : 6710.3 MB/s
C 2-pass copy prefetched (64 bytes step) : 6745.5 MB/s (0.3%)
C fill : 10812.1 MB/s (0.2%)
C fill (shuffle within 16 byte blocks) : 10807.7 MB/s
C fill (shuffle within 32 byte blocks) : 10806.6 MB/s
C fill (shuffle within 64 byte blocks) : 10809.7 MB/s
---
standard memcpy : 10922.0 MB/s
standard memset : 28935.1 MB/s
---
MOVSB copy : 9656.7 MB/s
MOVSD copy : 9430.1 MB/s
SSE2 copy : 6939.1 MB/s
SSE2 nontemporal copy : 10820.6 MB/s
SSE2 copy prefetched (32 bytes step) : 6857.4 MB/s
SSE2 copy prefetched (64 bytes step) : 6854.9 MB/s
SSE2 nontemporal copy prefetched (32 bytes step) : 10774.2 MB/s
SSE2 nontemporal copy prefetched (64 bytes step) : 10782.1 MB/s
SSE2 2-pass copy : 6683.0 MB/s
SSE2 2-pass copy prefetched (32 bytes step) : 6687.6 MB/s
SSE2 2-pass copy prefetched (64 bytes step) : 6685.8 MB/s
SSE2 2-pass nontemporal copy : 5234.9 MB/s
SSE2 fill : 10622.2 MB/s
SSE2 nontemporal fill : 22515.2 MB/s (0.1%)
Intel Xeon Skylake-SP
Skylake, ERMS, AVX-512, 2.1 GHz (Xeon Gold 6152 at base frequency, no turbo)
MOVSB copy : 4619.3 MB/s (0.6%)
SSE2 fill : 9774.4 MB/s (1.5%)
SSE2 nontemporal fill : 6715.7 MB/s (1.1%)
Intel Xeon E3-1275V6
Kaby Lake, released on March 2017 for $339, base frequency 3.8 GHz, max turbo frequency 4.2 GHz, L2 cache 4 × 256 KB, L3 cache 8 MB, 4 cores (8 threads), 4 RAM modules of 16384 MB DDR4 ECC installed, but it can use only 2 memory channels.
MOVSB copy : 11720.8 MB/s
SSE2 fill : 15877.6 MB/s (2.7%)
SSE2 nontemporal fill : 36407.1 MB/s
Intel i7-1065G7
Ice Lake, AVX-512, ERMS, FSRM, 1.37 GHz (worked at the base frequency, turbo mode disabled)
MOVSB copy : 7322.7 MB/s
SSE2 fill : 9681.7 MB/s
SSE2 nontemporal fill : 16426.2 MB/s
AMD EPYC 7401P
Released on June 2017 at US $1075, based on Zen gen.1 microarchitecture, 24 cores (48 threads), base frequency: 2.0GHz, max turbo boost: 3.0GHz (few cores) or 2.8 (all cores); cache: L1 - 64 KB inst. & 32 KB data per core, L2 - 512 KB per core, L3 - 64 MB, 8 MB per CCX, DDR4-2666 8 channels, but only 4 RAM modules of 32768 MB each of DDR4 ECC reg. installed.
MOVSB copy : 7718.0 MB/s
SSE2 fill : 11233.5 MB/s
SSE2 nontemporal fill : 34893.3 MB/s
AMD Ryzen 7 1700X (4 RAM modules installed)
MOVSB copy : 7444.7 MB/s
SSE2 fill : 11100.1 MB/s
SSE2 nontemporal fill : 31019.8 MB/s
AMD Ryzen 7 Pro 1700X (2 RAM modules installed)
MOVSB copy : 7251.6 MB/s
SSE2 fill : 10691.6 MB/s
SSE2 nontemporal fill : 31014.7 MB/s
AMD Ryzen 7 Pro 1700X (4 RAM modules installed)
MOVSB copy : 7429.1 MB/s
SSE2 fill : 10954.6 MB/s
SSE2 nontemporal fill : 30957.5 MB/s
Conclusion
REP MOVSD/MOVSQ is the universal solution that works relatively well on all Intel processors for large memory blocks of at least 4KB (no ERMSB required) if the destination is aligned by at least 64 bytes.
REP MOVSD/MOVSQ works even better on newer processors, starting from Skylake. And, for Ice Lake or newer microarchitectures, it works perfectly for even very small strings of at least 64 bytes.

You say that you want:
an answer that shows when ERMSB is useful
But I'm not sure it means what you think it means. Looking at the 3.7.6.1 docs you link to, it explicitly says:
implementing memcpy using ERMSB might not reach the same level of throughput as using 256-bit or 128-bit AVX alternatives, depending on length and alignment factors.
So just because CPUID indicates support for ERMSB, that isn't a guarantee that REP MOVSB will be the fastest way to copy memory. It just means it won't suck as bad as it has in some previous CPUs.
However just because there may be alternatives that can, under certain conditions, run faster doesn't mean that REP MOVSB is useless. Now that the performance penalties that this instruction used to incur are gone, it is potentially a useful instruction again.
Remember, it is a tiny bit of code (2 bytes!) compared to some of the more involved memcpy routines I have seen. Since loading and running big chunks of code also has a penalty (throwing some of your other code out of the cpu's cache), sometimes the 'benefit' of AVX et al is going to be offset by the impact it has on the rest of your code. Depends on what you are doing.
You also ask:
Why is the bandwidth so much lower with REP MOVSB? What can I do to improve it?
It isn't going to be possible to "do something" to make REP MOVSB run any faster. It does what it does.
If you want the higher speeds you are seeing from from memcpy, you can dig up the source for it. It's out there somewhere. Or you can trace into it from a debugger and see the actual code paths being taken. My expectation is that it's using some of those AVX instructions to work with 128 or 256bits at a time.
Or you can just... Well, you asked us not to say it.

This is not an answer to the stated question(s), only my results (and personal conclusions) when trying to find out.
In summary: GCC already optimizes memset()/memmove()/memcpy() (see e.g. gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look for stringop_algs in the same file to see architecture-dependent variants). So, there is no reason to expect massive gains by using your own variant with GCC (unless you've forgotten important stuff like alignment attributes for your aligned data, or do not enable sufficiently specific optimizations like -O2 -march= -mtune=). If you agree, then the answers to the stated question are more or less irrelevant in practice.
(I only wish there was a memrepeat(), the opposite of memcpy() compared to memmove(), that would repeat the initial part of a buffer to fill the entire buffer.)
I currently have an Ivy Bridge machine in use (Core i5-6200U laptop, Linux 4.4.0 x86-64 kernel, with erms in /proc/cpuinfo flags). Because I wanted to find out if I can find a case where a custom memcpy() variant based on rep movsb would outperform a straightforward memcpy(), I wrote an overly complicated benchmark.
The core idea is that the main program allocates three large memory areas: original, current, and correct, each exactly the same size, and at least page-aligned. The copy operations are grouped into sets, with each set having distinct properties, like all sources and targets being aligned (to some number of bytes), or all lengths being within the same range. Each set is described using an array of src, dst, n triplets, where all src to src+n-1 and dst to dst+n-1 are completely within the current area.
A Xorshift* PRNG is used to initialize original to random data. (Like I warned above, this is overly complicated, but I wanted to ensure I'm not leaving any easy shortcuts for the compiler.) The correct area is obtained by starting with original data in current, applying all the triplets in the current set, using memcpy() provided by the C library, and copying the current area to correct. This allows each benchmarked function to be verified to behave correctly.
Each set of copy operations is timed a large number of times using the same function, and the median of these is used for comparison. (In my opinion, median makes the most sense in benchmarking, and provides sensible semantics -- the function is at least that fast at least half the time.)
To avoid compiler optimizations, I have the program load the functions and benchmarks dynamically, at run time. The functions all have the same form, void function(void *, const void *, size_t) -- note that unlike memcpy() and memmove(), they return nothing. The benchmarks (named sets of copy operations) are generated dynamically by a function call (that takes the pointer to the current area and its size as parameters, among others).
Unfortunately, I have not yet found any set where
static void rep_movsb(void *dst, const void *src, size_t n)
{
__asm__ __volatile__ ( "rep movsb\n\t"
: "+D" (dst), "+S" (src), "+c" (n)
:
: "memory" );
}
would beat
static void normal_memcpy(void *dst, const void *src, size_t n)
{
memcpy(dst, src, n);
}
using gcc -Wall -O2 -march=ivybridge -mtune=ivybridge using GCC 5.4.0 on aforementioned Core i5-6200U laptop running a linux-4.4.0 64-bit kernel. Copying 4096-byte aligned and sized chunks comes close, however.
This means that at least thus far, I have not found a case where using a rep movsb memcpy variant would make sense. It does not mean there is no such case; I just haven't found one.
(At this point the code is a spaghetti mess I'm more ashamed than proud of, so I shall omit publishing the sources unless someone asks. The above description should be enough to write a better one, though.)
This does not surprise me much, though. The C compiler can infer a lot of information about the alignment of the operand pointers, and whether the number of bytes to copy is a compile-time constant, a multiple of a suitable power of two. This information can, and will/should, be used by the compiler to replace the C library memcpy()/memmove() functions with its own.
GCC does exactly this (see e.g. gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look for stringop_algs in the same file to see architecture-dependent variants). Indeed, memcpy()/memset()/memmove() has already been separately optimized for quite a few x86 processor variants; it would quite surprise me if the GCC developers had not already included erms support.
GCC provides several function attributes that developers can use to ensure good generated code. For example, alloc_align (n) tells GCC that the function returns memory aligned to at least n bytes. An application or a library can choose which implementation of a function to use at run time, by creating a "resolver function" (that returns a function pointer), and defining the function using the ifunc (resolver) attribute.
One of the most common patterns I use in my code for this is
some_type *pointer = __builtin_assume_aligned(ptr, alignment);
where ptr is some pointer, alignment is the number of bytes it is aligned to; GCC then knows/assumes that pointer is aligned to alignment bytes.
Another useful built-in, albeit much harder to use correctly, is __builtin_prefetch(). To maximize overall bandwidth/efficiency, I have found that minimizing latencies in each sub-operation, yields the best results. (For copying scattered elements to consecutive temporary storage, this is difficult, as prefetching typically involves a full cache line; if too many elements are prefetched, most of the cache is wasted by storing unused items.)

There are far more efficient ways to move data. These days, the implementation of memcpy will generate architecture specific code from the compiler that is optimized based upon the memory alignment of the data and other factors. This allows better use of non-temporal cache instructions and XMM and other registers in the x86 world.
When you hard-code rep movsb prevents this use of intrinsics.
Therefore, for something like a memcpy, unless you are writing something that will be tied to a very specific piece of hardware and unless you are going to take the time to write a highly optimized memcpy function in assembly (or using C level intrinsics), you are far better off allowing the compiler to figure it out for you.

As a general memcpy() guide:
a) If the data being copied is tiny (less than maybe 20 bytes) and has a fixed size, let the compiler do it. Reason: Compiler can use normal mov instructions and avoid the startup overheads.
b) If the data being copied is small (less than about 4 KiB) and is guaranteed to be aligned, use rep movsb (if ERMSB is supported) or rep movsd (if ERMSB is not supported). Reason: Using an SSE or AVX alternative has a huge amount of "startup overhead" before it copies anything.
c) If the data being copied is small (less than about 4 KiB) and is not guaranteed to be aligned, use rep movsb. Reason: Using SSE or AVX, or using rep movsd for the bulk of it plus some rep movsb at the start or end, has too much overhead.
d) For all other cases use something like this:
mov edx,0
.again:
pushad
.nextByte:
pushad
popad
mov al,[esi]
pushad
popad
mov [edi],al
pushad
popad
inc esi
pushad
popad
inc edi
pushad
popad
loop .nextByte
popad
inc edx
cmp edx,1000
jb .again
Reason: This will be so slow that it will force programmers to find an alternative that doesn't involve copying huge globs of data; and the resulting software will be significantly faster because copying large globs of data was avoided.

Cache bandwidth per tick for modern CPUs

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?
Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.
PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.

For nehalem: rolfed.com/nehalem/nehalemPaper.pdf
Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.
128 bit = 16 bytes / clock read
AND
128 bit = 16 bytes / clock write
(can I combine read and write in single cycle?)
The L2 and L3 caches each have a 256-bit port for reading or writing,
but the L3 cache must share its port with three other cores on the chip.
Can L2 and L3 read and write ports be used in single clock?
Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.
Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7
L1 L2 L3, cycles; mem link
Core 2 3 15 -- 66 ns http://www.anandtech.com/show/2542/5
Core i7-xxx 4 11 39 40c+67ns http://www.anandtech.com/show/2542/5
Itanium 1 5-6 12-17 130-1000 (cycles)
Itanium2 2 6-10 20 35c+160ns http://www.7-cpu.com/cpu/Itanium2.html
AMD K8 12 40-70c +64ns http://www.anandtech.com/show/2139/3
Intel P4 2 19 43 200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3 20 180 (cycles) --//--
AthlonFX-51 3 13 125 (cycles) --//--
POWER4 4 12-20 ?? hundreds cycles --//--
Haswell 4 11-12 36 36c+57ns http://www.realworldtech.com/haswell-cpu/5/
And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html
More about lat_mem_rd program is in its man page or here on SO.

Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.
I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio