What setup does REP do?

What setup does REP do? - performance

Quoting Intel® 64 and IA-32 architectures optimization reference manual, §2.4.6 "REP String Enhancement":
The performance characteristics of using REP string can be attributed to two components:
startup overhead and data transfer throughput.
[...]
For REP string of larger granularity data transfer, as ECX value
increases, the startup overhead of REP String exhibit step-wise increase:
Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles,
Fast string (ECX >= 76:
excluding REP MOVSB): the processor implementation provides hardware
optimization by moving as many pieces of data in 16 bytes as possible.
The latency of REP string latency will vary if one of the 16-byte data
transfer spans across cache line boundary:
Split-free: the latency consists of a startup cost of about 40 cycles and each 64 bytes of data adds 4 cycles,
Cache splits: the latency consists of a startup
cost of about 35 cycles and each 64 bytes of data adds 6 cycles.
Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has
a startup cost of about 15 cycles plus one cycle for each iteration of
the data movement in word/dword/qword.
(emphasis mine)
There is no further mention of such startup cost. What is it? What does it do and why does it take always more time?

Note that only rep movs and rep stos are fast. repe/ne cmps and scas on current CPUs only loop 1 element at a time. (https://agner.org/optimize/ has some perf numbers, like 2 cycles per RCX count for repe cmpsb). They still have some microcode startup overhead, though.
The rep movs microcode has several strategies to choose from. If the src and dest don't overlap closely, the microcoded loop can transfer in 64b chunks larger. (This is the so-called "fast strings" feature introduced with P6 and occasionally re-tuned for later CPUs that support wider loads/stores). But if dest is only one byte from src, rep movs has to produce the exact same result you'd get from that many separate movs instructions.
So the microcode has to check for overlap, and probably for alignment (of src and dest separately, or relative alignment). It probably also chooses something based on small/medium/large counter values.
According to Andy Glew's comments on an answer to Why are complicated memcpy/memset superior?, conditional branches in microcode aren't subject to branch-prediction. So there's a significant penalty in startup cycles if the default not-taken path isn't the one actually taken, even for a loop that uses the same rep movs with the same alignment and size.
He supervised the initial rep string implementation in P6, so he should know. :)
REP MOVS uses a cache protocol feature that is not available to
regular code. Basically like SSE streaming stores, but in a manner
that is compatible with normal memory ordering rules, etc. // The
"large overhead for choosing and setting up the right method" is
mainly due to the lack of microcode branch prediction. I have long
wished that I had implemented REP MOVS using a hardware state machine
rather than microcode, which could have completely eliminated the
overhead.
By the way, I have long said that one of the things that hardware can do
better/faster than software is complex multiway branches.
Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996,
which I supervised. The P6 fast strings took REP MOVSB and larger, and
implemented them with 64 bit microcode loads and stores and a no-RFO
cache protocol. They did not violate memory ordering, unlike ERMSB in
iVB.
The big weakness of doing fast strings in microcode was (a) microcode
branch mispredictions, and (b) the microcode fell out of tune with
every generation, getting slower and slower until somebody got around
to fixing it. Just like a library men copy falls out of tune. I
suppose that it is possible that one of the missed opportunities was
to use 128-bit loads and stores when they became available, and so on
In retrospect, I should have written a self-tuning infrastructure, to
get reasonably good microcode on every generation. But that would not
have helped use new, wider, loads and stores, when they became
available. // The Linux kernel seems to have such an autotuning
infrastructure, that is run on boot. // Overall, however, I advocate
hardware state machines that can smoothly transition between modes,
without incurring branch mispredictions. // It is debatable whether
good microcode branch prediction would obviate this.
Based on this, my best guess at a specific answer is: the fast-path through the microcode (as many branches as possible actually take the default not-taken path) is the 15-cycle startup case, for intermediate lengths.
Since Intel doesn't publish the full details, black-box measurements of cycle counts for various sizes and alignments are the best we can do. Fortunately, that's all we need to make good choices. Intel's manual, and http://agner.org/optimize/, have good info on how to use rep movs.
Fun fact: without ERMSB (new in IvB): rep movsb is optimized for small-ish copies. It takes longer to start up than rep movsd or rep movsq for large (more than a couple hundred bytes, I think) copies, and even after that may not achieve the same throughput.
The optimal sequence for large aligned copies without ERMSB and without SSE/AVX (e.g. in kernel code) may be rep movsq and then clean-up with something like an unaligned mov that copies the last 8 bytes of the buffer, possibly overlapping with the last aligned chunk of what rep movsq did. (basically use glibc's small-copy memcpy strategy). But if the size might be smaller than 8 bytes, you need to branch unless it's safe to copy more bytes than needed. Or rep movsb is an option for cleanup if small code-size matters more than performance. (rep will copy 0 bytes if RCX = 0).
A SIMD vector loop is often at least slightly faster than rep movsb even on CPUs with Enhanced Rep Move/Stos B. Especially if alignment isn't guaranteed. (Enhanced REP MOVSB for memcpy, and see also Intel's optimization manual. Links in the x86 tag wiki)
Further details: I think there's some discussion somewhere on SO about testing how rep movsb affects out-of-order exec of surrounding instructions, how soon uops from later instructions can get into the pipeline. I think we found some info in an Intel patent that shed some light on the mechanism.
Microcode can use a kind of predicated load and store uop that lets it issue a bunch of uops without initially knowing the value of RCX. If it turns out RCX was a small value, some of those uops choose not to do anything.
I've done some testing of rep movsb on Skylake. It seems consistent with that initial-burst mechanism: below a certain threshold of size like 96 bytes or something, IIRC performance was nearly constant for any size. (With small aligned buffers hot in L1d cache). I had rep movs in a loop with an independent imul dependency chain, testing that it can overlap execution.
But then there was a significant dropoff beyond that size, presumably when the microcode sequencer finds out that it needs to emit more copy uops. So I think when the rep movsb microcoded-uop reaches the front of the IDQ, it gets the microcode sequencer to emit enough load + store uops for some fixed size, and a check to see if that was sufficient or if more are needed.
This is all from memory, I didn't re-test while updating this answer. If this doesn't match reality for anyone else, let me know and I'll check again.

The quote that you have given only applies to Nehalem microarchitecture (Intel Core i5, i7 and Xeon processors released in 2009 and 2010), and the Intel is explicit about it.
Before Nehalem, REP MOVSB was even slower. Intel is silent on what had happened in subsequent microarchitectures, but, then, with the Ivy Bridge microarchtecture (processors released in 2012 and 2013) Intel has introduced Enhanced REP MOVSB (we still need to check the corresponding CPUID bit) that allowed us to copy memory fast.
Cheapest versions of later processors - Kaby Lake "Celeron" and "Pentium", released in 2017, don't have AVX that could have been used for fast memory copy, but they still have the Enhanced REP MOVSB. That's why REP MOVSB is very beneficial on the processors released since 2013.
Surprisingly, Nehalem processors had quite fast REP MOVSD/MOVSQ implementation (but not REP MOVSW/MOVSB) for very large-sized blocks - just 4 cycles to copy each subsequent 64 bytes of data (if the data is aligned to cache line boundaries) after we've paid startup costs of 40 cycles - which is excellent when we copy 256 bytes and more, and you don't need to use XMM registers!
Thus, on Nehalem microarchitecture, REP MOVSB/MOVSW is almost useless, but REP MOVSD/MOVSQ is excellent when we need to copy more than 256 bytes of data and the data is aligned to cache line boundaries.
On previous Intel microarchitectures (before 2008) the startup costs are even higher. Intel x86 processors have had "fast strings" since the Pentium Pro (P6) in 1996. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a non-RFO (Read For Ownership) cache protocol. They did not violate memory ordering, unlike ERMSB in Ivy Bridge.
The Ice Lake microarchitecture launched in September 2019 introduced the Fast Short REP MOV (FSRM). This feature can be tested by a CPUID bit. It was intended for strings of 128 bytes and less to also be quick, but, in fact, strings before 64 bytes are still slower with rep movsb than with, for example, simple 64-bit register copy. Besides that, FSRM is only implemented under 64-bit, not under 32-bit. At least on my i7-1065G7 CPU, rep movsb is only quick for small strings under 64-bit, but on the 32-bit architecture, strings have to be at least 4KB in order for rep movsb to start outperforming other methods.
Here are the tests of REP MOVS* when the source and destination was in L1 cache, of blocks large enough to not be seriously affected by startup costs, but not that large to exceed the L1 cache size. Source: http://users.atw.hu/instlatx64/
Yonah (2006-2008)
REP MOVSB 10.91 B/c
REP MOVSW 10.85 B/c
REP MOVSD 11.05 B/c
Nehalem (2009-2010)
REP MOVSB 25.32 B/c
REP MOVSW 19.72 B/c
REP MOVSD 27.56 B/c
REP MOVSQ 27.54 B/c
Westmere (2010-2011)
REP MOVSB 21.14 B/c
REP MOVSW 19.11 B/c
REP MOVSD 24.27 B/c
Ivy Bridge (2012-2013) - with Enhanced REP MOVSB
REP MOVSB 28.72 B/c
REP MOVSW 19.40 B/c
REP MOVSD 27.96 B/c
REP MOVSQ 27.89 B/c
SkyLake (2015-2016) - with Enhanced REP MOVSB
REP MOVSB 57.59 B/c
REP MOVSW 58.20 B/c
REP MOVSD 58.10 B/c
REP MOVSQ 57.59 B/c
Kaby Lake (2016-2017) - with Enhanced REP MOVSB
REP MOVSB 58.00 B/c
REP MOVSW 57.69 B/c
REP MOVSD 58.00 B/c
REP MOVSQ 57.89 B/c
As you see, the implementation of REP MOVS differs significantly from one microarchitecture to another.
According to Intel, on Nehalem, REP MOVSB startup costs for strings larger than 9 bytes are 50 cycles, but for REP MOVSW/MOVSD/MOVSQ they from 35 to 40 cycles - so REP MOVSB has larger startup costs; tests have shown that the overall performance is worst for REP MOVSW, not REP MOVSB on Nehalem and Westmere.
On Ivy Bridge, SkyLake and Kaby Lake, the results are the opposite for these instructions: REP MOVSB is faster than REP MOVSW/MOVSD/MOVSQ, albeit just slightly. On Ivy Bridge REP MOVSW is still a laggard, but on SkyLake and Kaby Lake REP MOVSW isn't worse than REP MOVSD/MOVSQ.
Please note that I have presented test results for both SkyLake and Kaby Lake, taken from the instaltx64 site just for the sake of confirmation - these architectures have the same cycle-per-instruction data.
Conclusion: you may use MOVSD/MOVSQ for very large memory blocks since it produces sufficient results on all Intel microarchitectures from Yohan to Kaby Lake. Although, on Yonan architectures and earlier, SSE copy may produce better results than REP MOVSD, but, for the sake of universality, REP MOVSD is preferred. Besides that, REP MOVS* may internally use different algorithms to work with cache, which is not available for normal instructions.
As about REP MOVSB for very small strings (less than 9 bytes or less than 4 bytes) - I would not even had recommended it. On the Kaby Lake, a single MOVSB even without REP is 4 cycles, on Yohan it is 5 cycles. Depending on context, you can do better just with normal MOVs.
The startup costs does not increase with size increase, as you have written. It is the latency of the overall instruction to complete the whole sequence of bytes that is increased - which is quite obvioius - more bytes you need to copy, more cycles it take, i.e. the overall latency, not just the startup cost. Intel did not disclose the startup cost for small strings, it did only specify for string of 76 bytes and more, for Nehalem. For example, take this data about the Nehalem:
The latency for MOVSB, is 9 cycles if ECX < 4. So, it means that it takes exactly 9 cycles to copy any string as soon as this string has 1 byte or 2 bytes or 3 bytes. This is not that bad – for example if you need to copy a tail and you don’t want to use orverlapping stores. Just 9 cycles to determine the size (between 1 and 3) and actually copy the data – it is hard to achieve this with normal instructions and all this branching – and for a 3-byte copy, if you didn’t copy previous data, you will have to use 2 loads and 2 stores (word+byte), and since we have at most one store unit, we wont’ do that much faster with normal MOV instructions.
Intel is silent on what latency has REP MOVSB if ECX is between 4 and 9
Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles to copy the whole string – not just the startup cost of 20 cycles. So it takes about 20 cycles to copy the whole string of <= 12 bytes, thus we have a higher thoughoutput rate per byte than with REP MOVSB with ECX < 4.
ECX >= 76 with REP MOVSD/MOVSQ – yes, here we DO have startup cost of 40 cycles, but, this is more than reasonable, since we later use copy each 64 bytes of data at just 4 cycles. I’m not an Intel engineer authorized to reply WHY there is a startup costs, but I suppose that it is because for these strings, REP MOVS* uses (according to to Andy Glew's comments on an answer to Why are complicated memcpy/memset superior? from the Peter Cordes’ answer) a cache protocol feature that is not available to regular code. And there comes an explanation at this quote: „The large overhead for choosing and setting up the right method is mainly due to the lack of microcode branch prediction”. There has also been an interesting note that Pentium Pro (P6) in 1996 implemented REP MOVS* with 64 bit microcode loads and stores and a no-RFO cache protocol - they did not violate memory ordering, unlike ERMSB in Ivy Bridge.

This patent shows that the decoder is able to determine whether the last move to rcx was immediate or whether it was modified in a manner such that the value in rcx is unknown at the decoder. It does this by setting a bit upon decoding an immediate mov to rcx and also calls this a 'fast string bit' and stores the immediate value in a register. The bit is cleared when it decodes an instruction that modifies rcx in an unknown manner. If the bit is set then it jumps to a position in a separate microcode routine which might be a size of 12 repetitions -- it jumps to repetition 7 if rcx = 5 i.e. the immediate value in the register it keeps is 5. This is a fast implementation that doesn't contain microbranches. If it is not set, in line with the SGX paper which talks about a 'microcode assist' for larger arrays, then it may emit a uop that traps to the slow looping microcode routine at retire, when the value of rcx is known, although this is more of a 'trap' uop that always traps rather than a uop that may result in an 'assist' being required. Alternatively, as the patent suggests ('otherwise, the instruction translator 206 transfers control to the looping REP MOVS microinstruction sequence') the MSROM could instead execute the slow routine inline and immediately, and it just continues issuing repetitions and looping until the branch mispredicts and is finally corrected to not taken and the microcode ends.
I would assume that the micro-branch in the main body of the regular (looping) MSROM procedure would be statically predicted taken by the uop itself (in the opcode), since this is a loop that's going to execute multiple times and mispredict once. This fast method would therefore only eliminate the branch misprediction at the end of the sequence as well as the micro-branch instruction per iteration, which reduces the number of uops. The main bulk of misprediction happens in the setup Peter mentions, which appears to be the setup of P6 'fast strings' (apparently unrelated to the term 'fast string' in the patent, which came after P6), or indeed ERMSB, which I think only happens in the slow (looping) routine mentioned by the patent. In the slow routine, if ecx >= 76, then it can be enhanced and goes through an initial setup process, but seemingly ecx needs to be above a certain size for it to actually be faster with the overhead of the startup process of 'fast strings' or ERMSB. This would entail the value of ecx being known, which is likely just a regular ecx comparison and jump that may mispredict. Apparently this slow routine enhancement also uses a different cache protocol, as discussed.
The microbranch misprediction is costly because it has to flush the whole pipeline, refetch the rep movs instruction and then resume decoding at the mispredicted micro-ip, returning to the MSROM procedure after it may have already finished decoding and other uops were being decoded behind it. The BOB can likely be used with microbranch mispredictions too, where it would be more beneficial than with a macrobranch misprediction. The RAT snapshot is likely associated with the ROB entry of every branch instruction.

Just from the description it sounds to me that there is an optimal transfer size of 16 bytes, so if you are transferring 79 bytes that is 4*16 + 15. so not knowing more about alignment that could mean that there is a cost for the 15 bytes either up front or at the end (or split) and the 4 16 byte transfers are faster than the fractions of 16. Kind of like high gear in your car vs shifting up through the gears to high gear.
Look at an optimized memcpy in glibc or gcc or other places. They transfer up to a few individual bytes, then they can maybe do 16 bit transfers until they get to an optimal aligned size of a 32 bit aligned, 64 bit aligned, 128 bit aligned address, then they can do multi-word transfers for the bulk of the copy, then they downshift, maybe one 32 bit thing maybe one 16 maybe 1 byte to cover the lack of alignment on the backend.
Sounds like the rep does the same kind of thing, innefficient single transfers to get to an optimized alignment size, then large transfers till near then end then maybe some small individual transfers to cover the last fraction.

Related

Why are the x86 bit-string manipulation instructions slow with a memory destination? (BTS, BTR, BTC)

Agner finds that the x86 bit manipulation instructions (btr bts btc, no lock) applied to a memory operand are slower than other read-modify-write instructions (like add, xor, etc.) on most processors where they are supported. Why is this? The instructions seem quite straightforward to implement.
Is it because the address actually loaded from is not the same as that specified by the memory operand, and this confuses some frontend mechanism for tracking memory accesses? This seems plausible, but I wouldn't expect it to affect throughput (at least, not by so much); only latency.

Is it because the address actually loaded from is not the same as that specified by the memory operand
Yes, pretty clearly that's the thing that separates it from a memory-destination shift.
The reg-reg version is 1 uop with 1 cycle latency on Intel, running on execution ports 0 or 6 on Intel Haswell and later for example, same as shifts. (Decoding an index to a 1-hot mask is cheaper than a general shifter, but since there are shift units presumably Intel just uses those.)
AMD for some reason runs bts reg,reg as 2 uops, slower than simple shifts. IDK why, maybe something about the FLAGS setting.
bts mem, imm8 is also pretty normal, 3 front-end uops on Intel. xor mem, imm8 is only 2 front-end uops, but that's because it can micro-fuse the load+xor. not mem is 3 front-end uops, only micro-fusing the store-address and store-uop instructions.
and this confuses some frontend mechanism for tracking memory accesses?
No. The front-end doesn't track memory accesses, that's the back end.
It's partly slow because it's implemented as multiple uops; that hurts even when you do one surrounded by different instructions. On Intel Haswell and Alder Lake (and probably all in between), it's 10 front-end uops for bts mem, r32, vs. 3 for bts mem, imm8
Since it can't use the usual address-generation hardware directly, it's implemented in microcode as multiple uops, presumably something like LEA into a temporary from the normal addressing mode, and adding (bit_index>>6) * 4 to that to index by dwords or something like that. Oh, maybe the reason it's 10 uops is that it always wants to access the aligned dword containing the bit, not just a multiple-of-4 offset from the address in the [] addressing mode for something like [rax + rdx*4 + 123].
Doing it manually is more efficient for the normal case where you know the start of the bitstring is aligned, so you can shr the bit-index to get a dword index for load / bts reg,reg (1 uop) / store. That takes fewer uops
than bts [mem], reg. Note that bts reg,reg truncates / wraps the bit-index, so if you arrange things correctly that modulo comes for free. For example a Sieve of Eratosthenes. Also How can memory destination BTS be significantly slower than load / BTS reg,reg / store?
But Agner Fog and https://uops.info/ both measure a throughput of 5 cycles on Haswell / Alder Lake P-cores, significantly lower than the front-end bottleneck (or any per-port back-end bottleneck) would account for.
I don't know what accounts for that. The actual load and store uops should just be normal, with inputs coming from internal temporary registers but still a normal load uop and store uop as far as the addresses in the store buffer and load buffer are concerned. (Together, Intel calls that a Memory order buffer = MOB.)
I don't expect it to be a special case of memory-dependency prediction since that happens when a load uop executes (and there are previous store-address uops not yet executed, so the addresses are some previous stores are still unknown.)
TODO: run some experiments to see what if any other instructions mixed in with bts mem,reg will slow it down, competing for whatever resource it bottlenecks on.
It doesn't look like a benchmarking error on the part of https://uops.info/ (e.g. using the same address every time and stalling on store-forwarding latency). Their testing included some unrolled sequences using different offsets. e.g. Haswell throughput testing for bts m64, r64 measured 6.02 or 6.0 cycle throughput with the same address every time (bts qword ptr [r14], r8), or an average of 5.0 cycles per BTS when unrolling a repeated sequence like bts [r14],r8 ; bts [r14+0x8],r8 ; ... ; bts [r14+0x38],r8. Even for a sequence of 16 independent instructions covering two adjacent cache lines, it was still the same 5 cycles per iteration.

MOVSD performance depends on arguments

I just noticed a pieces of my code exhibit different performance when copying memory. A test showed that a memory copying performance degraded if the address of destination buffer is greater than address of source. Sounds ridiculous, but the following code shows the difference (Delphi):
const MEM_CHUNK = 50 * 1024 * 1024;
ROUNDS_COUNT = 100;
LpSrc := VirtualAlloc(0,MEM_CHUNK,MEM_COMMIT,PAGE_READWRITE);
LpDest := VirtualAlloc(0,MEM_CHUNK,MEM_COMMIT,PAGE_READWRITE);
QueryPerformanceCounter(LTick1);
for i := 0 to ROUNDS_COUNT - 1 do
CopyMemory(LpDest,LpSrc,MEM_CHUNK);
QueryPerformanceCounter(LTick2);
// show timings
QueryPerformanceCounter(LTick1);
for i := 0 to ROUNDS_COUNT - 1 do
CopyMemory(LpSrc,LpDest,MEM_CHUNK);
QueryPerformanceCounter(LTick2);
// show timings
Here CopyMemory is based on MOVSD. The results :
Starting Memory Bandwidth Test...
LpSrc 0x06FC0000
LpDest 0x0A1C0000
src->dest Transfer: 5242880000 bytes in 1,188 sec #4,110 GB/s.
dest->src Transfer: 5242880000 bytes in 0,805 sec #6,066 GB/s.
src->dest Transfer: 5242880000 bytes in 1,142 sec #4,275 GB/s.
dest->src Transfer: 5242880000 bytes in 0,832 sec #5,871 GB/s.
Tried on two systems, the results are consistent no matter how many times repeated.
Never saw anything like that. Was unable to google it. Is this a known behavior? Is this just another cache-related peculiarity?
Update:
Here are the final results with page-aligned buffers and forward direction of MOVSD (DF=0):
Starting Memory Bandwidth Test...
LpSrc 0x06F70000
LpDest 0x0A170000
src->dest Transfer: 5242880000 bytes in 0,781 sec #6,250 GB/s.
dest->src Transfer: 5242880000 bytes in 0,731 sec #6,676 GB/s.
src->dest Transfer: 5242880000 bytes in 0,750 sec #6,510 GB/s.
dest->src Transfer: 5242880000 bytes in 0,735 sec #6,640 GB/s.
src->dest Transfer: 5242880000 bytes in 0,742 sec #6,585 GB/s.
dest->src Transfer: 5242880000 bytes in 0,750 sec #6,515 GB/s.
... and so on.
Here the transfer rates are constant.

Normally fast-strings or ERMSB microcode makes rep movsb/w/d/q and rep stosb/w/d/q fast for large counts (copying in 16, 32, or maybe even 64-byte chunks). And possibly with an RFO-avoiding protocol for the stores. (Other repe/repne scas/cmps are always slow).
Some conditions of the inputs can interfere with that best-case, notably having DF=1 (backward) instead of the normal DF=0.
rep movsd performance can depend on alignment of src and dst, including their relative misalignment. Apparently having both pointers = 32*n + same is not too bad, so most of the copy can be done after reaching an alignment boundary. (Absolute misalignment, but the pointers are aligned relative to each other. i.e. dst-src is a multiple of 32 or 64 bytes).
Performance does not depend on src > dst or src < dst per-se. If the pointers are within 16 or 32 byte of overlapping, that can also force a fall-back to 1 element at a time.
Intel's optimization manual has a section about memcpy implementations and comparing rep movs with well-optimized SIMD loops. Startup overhead is one of the the biggest downsides for rep movs, but so are misalignments that it doesn't handle well. (IceLake's "fast short rep" feature presumably addresses that.)
I did not disclose the CopyMemory body - and it indeed used copying backwards (df=1) when avoiding overlaps.
Yup, there's your problem. Only copy backwards if there would be actual overlap you need to avoid, not just based on which address is higher. And then do it with SIMD vectors, not rep movsd.
rep movsd is only fast with DF=0 (ascending addresses), at least on Intel CPUs. I just checked on Skylake: 1000000 reps of copying 4096 non-overlapping bytes from page-aligned buffers with rep movsb runs in:
174M cycles with cld (DF=0 forwards). about 42ms at about 4.1GHz, or about 90GiB/s L1d read+write bandwidth achieved. About 23 bytes per cycle, so startup overhead of each rep movsb seems to be hurting us. An AVX copy loop should achieve close to 32B/s with this easy case of pure L1d cache hits, even with a branch mispredict on loop exit from an inner loop.
4161M cycles with std (DF=1 backwards). about 1010ms at about 4.1GHz, or about 3.77GiB/s read+write. About 0.98 bytes / cycle, consistent with rep movsb being totally un-optimized. (1 count per cycle, so rep movsd would be about 4x that bandwidth with cache hits.)
uops_executed perf counter also confirms that it's spending many more uops when copying backwards. (This was inside a dec ebp / jnz loop in long mode under Linux. The same test loop as Can x86's MOV really be "free"? Why can't I reproduce this at all? built with NASM, with the buffers in the BSS. The loop did cld or std / 2x lea / mov ecx, 4096 / rep movsb. Hoisting cld out of the loop didn't make much difference.)
You were using rep movsd which copies 4 bytes at a time, so for backwards copying we can expect 4 bytes / cycle if they hit in cache. And you were probably using large buffers so cache misses bottleneck the forward direction to not much faster than backwards. But the extra uops from backward copy would hurt memory parallelism: fewer cache lines are touched by the load uops that fit in the out-of-order window. Also, some prefetchers work less well going backwards, in Intel CPUs. The L2 streamer works in either direction, but I think L1d prefetch only goes forward.
Related: Enhanced REP MOVSB for memcpy Your Sandybridge is too old for ERMSB, but Fast Strings for rep movs/rep stos has existed since original P6. Your Clovertown Xeon from ~2006 is pretty much ancient by today's standards. (Conroe/Merom microarchitecture). Those CPUs might be so old that a single core of a Xeon can saturate the meagre memory bandwidth, unlike today's many-core Xeons.
My buffers were page-aligned. For downward, I tried having the initial RSI/RDI point to the last byte of a page so the initial pointers were not aligned but the total region to be copied was. I also tried lea rdi, [buf+4096] so the starting pointers were page-aligned, so [buf+0] didn't get written. Neither made backwards copy any faster; rep movs is just garbage with DF=1; use SIMD vectors if you need to copy backwards.
Usually a SIMD vector loop can be at least as fast as rep movs, if you can use vectors as wide as the machine supports. That means having SSE, AVX, and AVX512 versions... In portable code without runtime dispatching to a memcpy implementation tuned for the specific CPU, rep movsd is often pretty good, and should be even better on future CPUs like IceLake.
You don't actually need page alignment for rep movs to be fast. IIRC, 32-byte aligned source and destination is sufficient. But also 4k aliasing could be a problem: if dst & 4095 is slightly higher than src & 4095, the load uops might internally have to wait some extra cycles for the store uops because the fast-path mechanism for detecting when a load is reloading a recent store only looks at page-offset bits.
Page alignment is one way to make sure you get the optimal case for rep movs, though.
Normally you get best performance from a SIMD loop, but only if you use SIMD vectors as wide as the machine supports (like AVX, or maybe even AVX512). And you should choose NT stores vs. normal depending on the hardware and the surrounding code.

Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?

I'm doing micro-optimization on a performance critical part of my code and came across the sequence of instructions (in AT&T syntax):
add %rax, %rbx
mov %rdx, %rax
mov %rbx, %rdx
I thought I finally had a use case for xchg which would allow me to shave an instruction and write:
add %rbx, %rax
xchg %rax, %rdx
However, to my dimay I found from Agner Fog's instruction tables, that xchg is a 3 micro-op instruction with a 2 cycle latency on Sandy Bridge, Ivy Bridge, Broadwell, Haswell and even Skylake. 3 whole micro-ops and 2 cycles of latency! The 3 micro-ops throws off my 4-1-1-1 cadence and the 2 cycle latency makes it worse than the original in the best case since the last 2 instructions in the original might execute in parallel.
Now... I get that the CPU might be breaking the instruction into micro-ops that are equivalent to:
mov %rax, %tmp
mov %rdx, %rax
mov %tmp, %rdx
where tmp is an anonymous internal register and I suppose the last two micro-ops could be run in parallel so the latency is 2 cycles.
Given that register renaming occurs on these micro-architectures, though, it doesn't make sense to me that this is done this way. Why wouldn't the register renamer just swap the labels? In theory, this would have a latency of only 1 cycle (possibly 0?) and could be represented as a single micro-op so it would be much cheaper.

Supporting efficient xchg is non-trivial, and presumably not worth the extra complexity it would require in various parts of the CPU. A real CPU's microarchitecture is much more complicated than the mental model that you can use while optimizing software for it. For example, speculative execution makes everything more complicated, because it has to be able to roll back to the point where an exception occurred.
Making fxch efficient was important for x87 performance because the stack nature of x87 makes it (or alternatives like fld st(2)) hard to avoid. Compiler-generated FP code (for targets without SSE support) really does use fxch a significant amount. It seems that fast fxch was done because it was important, not because it's easy. Intel Haswell even dropped support for single-uop fxch. It's still zero-latency, but decodes to 2 uops on HSW and later (up from 1 in P5, and PPro through IvyBridge).
xchg is usually easy to avoid. In most cases, you can just unroll a loop so it's ok that the same value is now in a different register. e.g. Fibonacci with add rax, rdx / add rdx, rax instead of add rax, rdx / xchg rax, rdx. Compilers generally don't use xchg reg,reg, and usually hand-written asm doesn't either. (This chicken/egg problem is pretty similar to loop being slow (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?). loop would have been very useful for for adc loops on Core2/Nehalem where an adc + dec/jnz loop causes partial-flag stalls.)
Since xchg is still slow-ish on previous CPUs, compilers wouldn't start using it with -mtune=generic for several years. Unlike fxch or mov-elimination, a design-change to support fast xchg wouldn't help the CPU run most existing code faster, and would only enable performance gains over the current design in rare cases where it's actually a useful peephole optimization.
Integer registers are complicated by partial-register stuff, unlike x87
There are 4 operand sizes of xchg, 3 of which use the same opcode with REX or operand-size prefixes. (xchg r8,r8 is a separate opcode, so it's probably easier to make the decoders decode it differently from the others). The decoders already have to recognize xchg with a memory operand as special, because of the implicit lock prefix, but it's probably less decoder complexity (transistor-count + power) if the reg-reg forms all decode to the same number of uops for different operand sizes.
Making some r,r forms decode to a single uop would be even more complexity, because single-uop instructions have to be handled by the "simple" decoders as well as the complex decoder. So they would all need to be able to parse xchg and decide whether it was a single uop or multi-uop form.
AMD and Intel CPUs behave somewhat similarly from a programmer's perspective, but there are many signs that the internal implementation is vastly different. For example, Intel mov-elimination only works some of the time, limited by some kind of microarchitectural resources, but AMD CPUs that do mov-elimination do it 100% of the time (e.g. Bulldozer for the low lane of vector regs).
See Intel's optimization manual, Example 3-23. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions, where they discuss overwriting the zero-latency-movzx result right away to free up the internal resource sooner. (I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.)
I don't know exactly what needs tracking in a limited-size table(?) for mov-elimination. Probably it's related to needing to free register-file entries as soon as possible when they're no longer needed, because Physical Register File size limits rather than ROB size can be the bottleneck for the out-of-order window size. Swapping around indices might make this harder.
xor-zeroing is eliminated 100% of the time on Intel Sandybridge-family; it's assumed that this works by renaming to a physical zero register, and this register never needs to be freed.
If xchg used the same mechanism that mov-elimination does, it also could probably only work some of the time. It would need to decode to enough uops to work in cases where it isn't handled at rename. (Or else the issue/rename stage would have to insert extra uops when an xchg will take more than 1 uop, like it does when un-laminating micro-fused uops with indexed addressing modes that can't stay micro-fused in the ROB, or when inserting merging uops for flags or high-8 partial registers. But that's a significant complication that would only be worth doing if xchg was a common and important instruction.)
Note that xchg r32,r32 has to zero-extend both results to 64 bits, so it can't be a simple swap of RAT (Register Alias Table) entries. It would be more like truncating both registers in-place. And note that Intel CPUs never eliminate mov same,same. It does already need to support mov r32,r32 and movzx r32, r8 with no execution port, so presumably it has some bits that indicate that rax = al or something. (And yes, Intel HSW/SKL do that, not just Ivybridge, despite what Agner's microarch guide says.)
We know P6 and SnB had upper-zeroed bits like this, because xor eax,eax before setz al avoids a partial-register stall when reading eax. HSW/SKL never rename al separately in the first place, only ah. It may not be a coincidence that partial-register renaming (other than AH) seems to have been dropped in the same uarch that introduced mov-elimination (Ivybridge). Still, setting that bit for 2 registers at once would be a special case that required special support.
xchg r64,r64 could maybe just swap the RAT entries, but decoding that differently from the r32 case is yet another complication. It might still need to trigger partial-register merging for both inputs, but add r64,r64 needs to do that, too.
Also note that an Intel uop (other than fxch) only ever produces one register result (plus flags). Not touching flags doesn't "free up" an output slot; For example mulx r64,r64,r64 still takes 2 uops to produce 2 integer outputs on HSW/SKL, even though all the "work" is done in the multiply unit on port 1, same as with mul r64 which does produce a flag result.)
Even if it is as simple as "swap the RAT entries", building a RAT that supports writing more than one entry per uop is a complication. What to do when renaming 4 xchg uops in a single issue group? It seems to me like it would make the logic significantly more complicated. Remember that this has to be built out of logic gates / transistors. Even if you say "handle that special case with a trap to microcode", you have to build the whole pipeline to support the possibility that that pipeline stage could take that kind of exception.
Single-uop fxch requires support for swapping RAT entries (or some other mechanism) in the FP RAT (fRAT), but it's a separate block of hardware from the integer RAT (iRAT). Leaving out that complication in the iRAT seems reasonable even if you have it in the fRAT (pre-Haswell).
Issue/rename complexity is definitely an issue for power consumption, though. Note that Skylake widened a lot of the front-end (legacy decode and uop cache fetch), and retirement, but kept the 4-wide issue/rename limit. SKL also added replicated execution units on more port in the back-end, so issue bandwidth is a bottleneck even more of the time, especially in code with a mix of loads, stores, and ALU.
The RAT (or the integer register file, IDK) may even have limited read ports, since there seem to be some front-end bottlenecks in issuing/renaming many 3-input uops like add rax, [rcx+rdx]. I posted some microbenchmarks (this and the follow-up post) showing Skylake being faster than Haswell when reading lots of registers, e.g. with micro-fusion of indexed addressing modes. Or maybe the bottleneck there was really some other microarchitectural limit.
But how does 1-uop fxch work? IDK how it's done in Sandybridge / Ivybridge. In P6-family CPUs, an extra remapping table exists basically to support FXCH. That might only be needed because P6 uses a Retirement Register File with 1 entry per "logical" register, instead of a physical register file (PRF). As you say, you'd expect it to be simpler when even "cold" register values are just a pointer to a PRF entry. (Source: US patent 5,499,352: Floating point register alias table FXCH and retirement floating point register array (describes Intel's P6 uarch).
One main reason the rfRAT array 802 is included within the present invention fRAT logic is a direct result of the manner in which the present invention implements the FXCH instruction.
(Thanks Andy Glew (#krazyglew), I hadn't thought of looking up patents to find out about CPU internals.) It's pretty heavy going, but may provide some insight into the bookkeeping needed for speculative execution.
Interesting tidbit: the patent describes integer as well, and mentions that there are some "hidden" logical registers which are reserved for use by microcode. (Intel's 3-uop xchg almost certain uses one of these as a temporary.)
We might be able to get some insight from looking at what AMD does.
Interestingly, AMD has 2-uop xchg r,r in K10, Bulldozer-family, Bobcat/Jaguar, and Ryzen. (But Jaguar xchg r8,r8 is 3 uops. Maybe to support the xchg ah,al corner case without a special uop for swapping the low 16 of a single reg).
Presumably both uops read the old values of the input architectural registers before the first one updates the RAT. IDK exactly how this works, since they aren't necessarily issued/renamed in the same cycle (but they are at least contiguous in the uop flow, so at worst the 2nd uop is the first uop in the next cycle). I have no idea if Haswell's 2-uop fxch works similarly, or if they're doing something else.
Ryzen is a new architecture designed after mov-elimination was "invented", so presumably they take advantage of it wherever possible. (Bulldozer-family renames vector moves (but only for the low 128b lane of YMM vectors); Ryzen is the first AMD architecture to do it for GP regs too.) xchg r32,r32 and r64,r64 are zero-latency (renamed), but still 2 uops each. (r8 and r16 need an execution unit, because they merge with the old value instead of zero-extending or copying the entire reg, but are still only 2 uops).
Ryzen's fxch is 1 uop. AMD (like Intel) probably isn't spending a lot of transistors on making x87 fast (e.g. fmul is only 1 per clock and on the same port as fadd), so presumably they were able to do this without a lot of extra support. Their micro-coded x87 instructions (like fyl2x) are faster than on recent Intel CPUs, so maybe Intel cares even less (at least about the microcoded x87 instruction).
Maybe AMD could have made xchg r64,r64 a single uop too, more easily than Intel. Maybe even xchg r32,r32 could be single uop, since like Intel it needs to support mov r32,r32 zero-extension with no execution port, so maybe it could just set whatever "upper 32 zeroed" bit exists to support that. Ryzen doesn't eliminate movzx r32, r8 at rename, so presumably there's only an upper32-zero bit, not bits for other widths.
What Intel might be able to do cheaply if they wanted to:
It's possible that Intel could support 2-uop xchg r,r the way Ryzen does (zero latency for the r32,r32 and r64,r64 forms, or 1c for the r8,r8 and r16,r16 forms) without too much extra complexity in critical parts of the core, like the issue/rename and retirement stages that manage the Register Alias Table (RAT). But maybe not, if they can't have 2 uops read the "old" value of a register when the first uop writes it.
Stuff like xchg ah,al is definitely a extra complication, since Intel CPUs don't rename partial registers separately anymore, except AH/BH/CH/DH.
xchg latency in practice on current hardware
Your guess about how it might work internally is good. It almost certainly uses one of the internal temporary registers (accessible only to microcode). Your guess about how they can reorder is too limited, though.
In fact, one direction has 2c latency and the other direction has ~1c latency.
00000000004000e0 <_start.loop>:
4000e0: 48 87 d1 xchg rcx,rdx # slow version
4000e3: 48 83 c1 01 add rcx,0x1
4000e7: 48 83 c1 01 add rcx,0x1
4000eb: 48 87 ca xchg rdx,rcx
4000ee: 48 83 c2 01 add rdx,0x1
4000f2: 48 83 c2 01 add rdx,0x1
4000f6: ff cd dec ebp
4000f8: 7f e6 jg 4000e0 <_start.loop>
This loop runs in ~8.06 cycles per iteration on Skylake. Reversing the xchg operands makes it run in ~6.23c cycles per iteration (measured with perf stat on Linux). uops issued/executed counters are equal, so no elimination happened. It looks like the dst <- src direction is the slow one, since putting the add uops on that dependency chain makes things slower than when they're on the dst -> src dependency chain.
If you ever want to use xchg reg,reg on the critical path (code-size reasons?), do it with the dst -> src direction on the critical path, because that's only about 1c latency.
Other side-topics from comments and the question
The 3 micro-ops throws off my 4-1-1-1 cadence
Sandybridge-family decoders are different from Core2/Nehalem. They can produce up to 4 uops total, not 7, so the patterns are 1-1-1-1, 2-1-1, 3-1, or 4.
Also beware that if the last uop is one that can macro-fuse, they will hang onto it until the next decode cycle in case the first instruction in the next block is a jcc. (This is a win when code runs multiple times from the uop cache for each time it's decoded. And that's still usually 3 uops per clock decode throughput.)
Skylake has an extra "simple" decoder so it can do 1-1-1-1-1 up to 4-1 I guess, but > 4 uops for one instruction still requires the microcode ROM. Skylake beefed up the uop cache, too, and can often bottleneck on the 4 fused-domain uops per clock issue/rename throughput limit if the back-end (or branch misses) aren't a bottleneck first.
I'm literally searching for ~1% speed bumps so hand optimization has been working out on the main loop code. Unfortunately that's ~18kB of code so I'm not even trying to consider the uop cache anymore.
That seems kinda crazy, unless you're mostly limiting yourself to asm-level optimization in shorter loops inside your main loop. Any inner loops within the main loop will still run from the uop cache, and that should probably be where you're spending most of your time optimizing. Compilers usually do a good-enough job that it's not practical for a human to do much over a large scale. Try to write your C or C++ in such a way that the compiler can do a good job with it, of course, but looking for tiny peephole optimizations like this over 18kB of code seems like going down the rabbit hole.
Use perf counters like idq.dsb_uops vs. uops_issued.any to see how many of your total uops came from the uop cache (DSB = Decoded Stream Buffer or something). Intel's optimization manual has some suggestions for other perf counters to look at for code that doesn't fit in the uop cache, such as DSB2MITE_SWITCHES.PENALTY_CYCLES. (MITE is the legacy-decode path). Search the pdf for DSB to find a few places it's mentioned.
Perf counters will help you find spots with potential problems, e.g. regions with higher than average uops_issued.stall_cycles could benefit from finding ways to expose more ILP if there are any, or from solving a front-end problem, or from reducing branch-mispredicts.
As discussed in comments, a single uop produces at most 1 register result
As an aside, with a mul %rbx, do you really get %rdx and %rax all at once or does the ROB technically have access to the lower part of the result one cycle earlier than the higher part? Or is it like the "mul" uop goes into the multiplication unit and then the multiplication unit issues two uops straight into the ROB to write the result at the end?
Terminology: the multiply result doesn't go into the ROB. It goes over the forwarding network to whatever other uops read it, and goes into the PRF.
The mul %rbx instruction decodes to 2 uops in the decoders. They don't even have to issue in the same cycle, let alone execute in the same cycle.
However, Agner Fog's instruction tables only list a single latency number. It turns out that 3 cycles is the latency from both inputs to RAX. The minimum latency for RDX is 4c, according to InstlatX64 testing on both Haswell and Skylake-X.
From this, I conclude that the 2nd uop is dependent on the first, and exists to write the high half of the result to an architectural register. The port1 uop produces a full 128b multiply result.
I don't know where the high-half result lives until the p6 uop reads it. Perhaps there's some sort of internal queue between the multiply execution unit and hardware connected to port 6. By scheduling the p6 uop with a dependency on the low-half result, that might arrange for the p6 uops from multiple in-flight mul instructions to run in the correct order. But then instead of actually using that dummy low-half input, the uop would take the high half result from the queue output in an execution unit that's connected to port 6 and return that as the result. (This is pure guess work, but I think it's plausible as one possible internal implementation. See comments for some earlier ideas).
Interestingly, according to Agner Fog's instruction tables, on Haswell the two uops for mul r64 go to ports 1 and 6. mul r32 is 3 uops, and runs on p1 + p0156. Agner doesn't say whether that's really 2p1 + p0156 or p1 + 2p0156 like he does for some other insns. (However, he says that mulx r32,r32,r32 runs on p1 + 2p056 (note that p056 doesn't include p1).)
Even more strangely, he says that Skylake runs mulx r64,r64,r64 on p1 p5 but mul r64 on p1 p6. If that's accurate and not a typo (which is a possibility), it pretty much rules out the possibility that the extra uop is an upper-half multiplier.

Enhanced REP MOVSB for memcpy

I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy.
ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and STOSB operation (ERMSB)" in the Intel optimization manual if you don't know what ERMSB is.
The only way I know to do this directly is with inline assembly. I got the following function from https://groups.google.com/forum/#!topic/gnu.gcc.help/-Bmlm_EG_fE
static inline void *__movsb(void *d, const void *s, size_t n) {
asm volatile ("rep movsb"
: "=D" (d),
"=S" (s),
"=c" (n)
: "0" (d),
"1" (s),
"2" (n)
: "memory");
return d;
}
When I use this however, the bandwidth is much less than with memcpy.
__movsb gets 15 GB/s and memcpy get 26 GB/s with my i7-6700HQ (Skylake) system, Ubuntu 16.10, DDR4#2400 MHz dual channel 32 GB, GCC 6.2.
Why is the bandwidth so much lower with REP MOVSB? What can I do to improve it?
Here is the code I used to test this.
//gcc -O3 -march=native -fopenmp foo.c
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <stddef.h>
#include <omp.h>
#include <x86intrin.h>
static inline void *__movsb(void *d, const void *s, size_t n) {
asm volatile ("rep movsb"
: "=D" (d),
"=S" (s),
"=c" (n)
: "0" (d),
"1" (s),
"2" (n)
: "memory");
return d;
}
int main(void) {
int n = 1<<30;
//char *a = malloc(n), *b = malloc(n);
char *a = _mm_malloc(n,4096), *b = _mm_malloc(n,4096);
memset(a,2,n), memset(b,1,n);
__movsb(b,a,n);
printf("%d\n", memcmp(b,a,n));
double dtime;
dtime = -omp_get_wtime();
for(int i=0; i<10; i++) __movsb(b,a,n);
dtime += omp_get_wtime();
printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime);
dtime = -omp_get_wtime();
for(int i=0; i<10; i++) memcpy(b,a,n);
dtime += omp_get_wtime();
printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime);
}
The reason I am interested in rep movsb is based off these comments
Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not...
rep movsb is significantly faster than movntdqa when streaming to memory on Ivybridge and Haswell (but be aware that pre-Ivybridge it is slow!)
What's missing/sub-optimal in this memcpy implementation?
Here are my results on the same system from tinymembnech.
C copy backwards : 7910.6 MB/s (1.4%)
C copy backwards (32 byte blocks) : 7696.6 MB/s (0.9%)
C copy backwards (64 byte blocks) : 7679.5 MB/s (0.7%)
C copy : 8811.0 MB/s (1.2%)
C copy prefetched (32 bytes step) : 9328.4 MB/s (0.5%)
C copy prefetched (64 bytes step) : 9355.1 MB/s (0.6%)
C 2-pass copy : 6474.3 MB/s (1.3%)
C 2-pass copy prefetched (32 bytes step) : 7072.9 MB/s (1.2%)
C 2-pass copy prefetched (64 bytes step) : 7065.2 MB/s (0.8%)
C fill : 14426.0 MB/s (1.5%)
C fill (shuffle within 16 byte blocks) : 14198.0 MB/s (1.1%)
C fill (shuffle within 32 byte blocks) : 14422.0 MB/s (1.7%)
C fill (shuffle within 64 byte blocks) : 14178.3 MB/s (1.0%)
---
standard memcpy : 12784.4 MB/s (1.9%)
standard memset : 30630.3 MB/s (1.1%)
---
MOVSB copy : 8712.0 MB/s (2.0%)
MOVSD copy : 8712.7 MB/s (1.9%)
SSE2 copy : 8952.2 MB/s (0.7%)
SSE2 nontemporal copy : 12538.2 MB/s (0.8%)
SSE2 copy prefetched (32 bytes step) : 9553.6 MB/s (0.8%)
SSE2 copy prefetched (64 bytes step) : 9458.5 MB/s (0.5%)
SSE2 nontemporal copy prefetched (32 bytes step) : 13103.2 MB/s (0.7%)
SSE2 nontemporal copy prefetched (64 bytes step) : 13179.1 MB/s (0.9%)
SSE2 2-pass copy : 7250.6 MB/s (0.7%)
SSE2 2-pass copy prefetched (32 bytes step) : 7437.8 MB/s (0.6%)
SSE2 2-pass copy prefetched (64 bytes step) : 7498.2 MB/s (0.9%)
SSE2 2-pass nontemporal copy : 3776.6 MB/s (1.4%)
SSE2 fill : 14701.3 MB/s (1.6%)
SSE2 nontemporal fill : 34188.3 MB/s (0.8%)
Note that on my system SSE2 copy prefetched is also faster than MOVSB copy.
In my original tests I did not disable turbo. I disabled turbo and tested again and it does not appear to make much of a difference. However, changing the power management does make a big difference.
When I do
sudo cpufreq-set -r -g performance
I sometimes see over 20 GB/s with rep movsb.
with
sudo cpufreq-set -r -g powersave
the best I see is about 17 GB/s. But memcpy does not seem to be sensitive to the power management.
I checked the frequency (using turbostat) with and without SpeedStep enabled, with performance and with powersave for idle, a 1 core load and a 4 core load. I ran Intel's MKL dense matrix multiplication to create a load and set the number of threads using OMP_SET_NUM_THREADS. Here is a table of the results (numbers in GHz).
SpeedStep idle 1 core 4 core
powersave OFF 0.8 2.6 2.6
performance OFF 2.6 2.6 2.6
powersave ON 0.8 3.5 3.1
performance ON 3.5 3.5 3.1
This shows that with powersave even with SpeedStep disabled the CPU
still clocks down to the idle frequency of 0.8 GHz. It's only with performance without SpeedStep that the CPU runs at a constant frequency.
I used e.g sudo cpufreq-set -r performance (because cpufreq-set was giving strange results) to change the power settings. This turns turbo back on so I had to disable turbo after.

This is a topic pretty near to my heart and recent investigations, so I'll look at it from a few angles: history, some technical notes (mostly academic), test results on my box, and finally an attempt to answer your actual question of when and where rep movsb might make sense.
Partly, this is a call to share results - if you can run Tinymembench and share the results along with details of your CPU and RAM configuration it would be great. Especially if you have a 4-channel setup, an Ivy Bridge box, a server box, etc.
History and Official Advice
The performance history of the fast string copy instructions has been a bit of a stair-step affair - i.e., periods of stagnant performance alternating with big upgrades that brought them into line or even faster than competing approaches. For example, there was a jump in performance in Nehalem (mostly targeting startup overheads) and again in Ivy Bridge (most targeting total throughput for large copies). You can find decade-old insight on the difficulties of implementing the rep movs instructions from an Intel engineer in this thread.
For example, in guides preceding the introduction of Ivy Bridge, the typical advice is to avoid them or use them very carefully1.
The current (well, June 2016) guide has a variety of confusing and somewhat inconsistent advice, such as2:
The specific variant of the implementation is chosen at execution time
based on data layout, alignment and the counter (ECX) value. For
example, MOVSB/STOSB with the REP prefix should be used with counter
value less than or equal to three for best performance.
So for copies of 3 or less bytes? You don't need a rep prefix for that in the first place, since with a claimed startup latency of ~9 cycles you are almost certainly better off with a simple DWORD or QWORD mov with a bit of bit-twiddling to mask off the unused bytes (or perhaps with 2 explicit byte, word movs if you know the size is exactly three).
They go on to say:
String MOVE/STORE instructions have multiple data granularities. For
efficient data movement, larger data granularities are preferable.
This means better efficiency can be achieved by decomposing an
arbitrary counter value into a number of double words plus single byte
moves with a count value less than or equal to 3.
This certainly seems wrong on current hardware with ERMSB where rep movsb is at least as fast, or faster, than the movd or movq variants for large copies.
In general, that section (3.7.5) of the current guide contains a mix of reasonable and badly obsolete advice. This is common throughput the Intel manuals, since they are updated in an incremental fashion for each architecture (and purport to cover nearly two decades worth of architectures even in the current manual), and old sections are often not updated to replace or make conditional advice that doesn't apply to the current architecture.
They then go on to cover ERMSB explicitly in section 3.7.6.
I won't go over the remaining advice exhaustively, but I'll summarize the good parts in the "why use it" below.
Other important claims from the guide are that on Haswell, rep movsb has been enhanced to use 256-bit operations internally.
Technical Considerations
This is just a quick summary of the underlying advantages and disadvantages that the rep instructions have from an implementation standpoint.
Advantages for rep movs
When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:
Avoiding the RFO request when it knows the entire cache line will be overwritten.
Issuing prefetch requests immediately and exactly. Hardware prefetching does a good job at detecting memcpy-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region. rep movsb knows exactly the region size and can prefetch exactly.
Apparently, there is no guarantee of ordering among the stores within3 a single rep movs which can help simplify coherency traffic and simply other aspects of the block move, versus simple mov instructions which have to obey rather strict memory ordering4.
In principle, the rep movs instruction could take advantage of various architectural tricks that aren't exposed in the ISA. For example, architectures may have wider internal data paths that the ISA exposes5 and rep movs could use that internally.
Disadvantages
rep movsb must implement a specific semantic which may be stronger than the underlying software requirement. In particular, memcpy forbids overlapping regions, and so may ignore that possibility, but rep movsb allows them and must produce the expected result. On current implementations mostly affects to startup overhead, but probably not to large-block throughput. Similarly, rep movsb must support byte-granular copies even if you are actually using it to copy large blocks which are a multiple of some large power of 2.
The software may have information about alignment, copy size and possible aliasing that cannot be communicated to the hardware if using rep movsb. Compilers can often determine the alignment of memory blocks6 and so can avoid much of the startup work that rep movs must do on every invocation.
Test Results
Here are test results for many different copy methods from tinymembench on my i7-6700HQ at 2.6 GHz (too bad I have the identical CPU so we aren't getting a new data point...):
C copy backwards : 8284.8 MB/s (0.3%)
C copy backwards (32 byte blocks) : 8273.9 MB/s (0.4%)
C copy backwards (64 byte blocks) : 8321.9 MB/s (0.8%)
C copy : 8863.1 MB/s (0.3%)
C copy prefetched (32 bytes step) : 8900.8 MB/s (0.3%)
C copy prefetched (64 bytes step) : 8817.5 MB/s (0.5%)
C 2-pass copy : 6492.3 MB/s (0.3%)
C 2-pass copy prefetched (32 bytes step) : 6516.0 MB/s (2.4%)
C 2-pass copy prefetched (64 bytes step) : 6520.5 MB/s (1.2%)
---
standard memcpy : 12169.8 MB/s (3.4%)
standard memset : 23479.9 MB/s (4.2%)
---
MOVSB copy : 10197.7 MB/s (1.6%)
MOVSD copy : 10177.6 MB/s (1.6%)
SSE2 copy : 8973.3 MB/s (2.5%)
SSE2 nontemporal copy : 12924.0 MB/s (1.7%)
SSE2 copy prefetched (32 bytes step) : 9014.2 MB/s (2.7%)
SSE2 copy prefetched (64 bytes step) : 8964.5 MB/s (2.3%)
SSE2 nontemporal copy prefetched (32 bytes step) : 11777.2 MB/s (5.6%)
SSE2 nontemporal copy prefetched (64 bytes step) : 11826.8 MB/s (3.2%)
SSE2 2-pass copy : 7529.5 MB/s (1.8%)
SSE2 2-pass copy prefetched (32 bytes step) : 7122.5 MB/s (1.0%)
SSE2 2-pass copy prefetched (64 bytes step) : 7214.9 MB/s (1.4%)
SSE2 2-pass nontemporal copy : 4987.0 MB/s
Some key takeaways:
The rep movs methods are faster than all the other methods which aren't "non-temporal"7, and considerably faster than the "C" approaches which copy 8 bytes at a time.
The "non-temporal" methods are faster, by up to about 26% than the rep movs ones - but that's a much smaller delta than the one you reported (26 GB/s vs 15 GB/s = ~73%).
If you are not using non-temporal stores, using 8-byte copies from C is pretty much just as good as 128-bit wide SSE load/stores. That's because a good copy loop can generate enough memory pressure to saturate the bandwidth (e.g., 2.6 GHz * 1 store/cycle * 8 bytes = 26 GB/s for stores).
There are no explicit 256-bit algorithms in tinymembench (except probably the "standard" memcpy) but it probably doesn't matter due to the above note.
The increased throughput of the non-temporal store approaches over the temporal ones is about 1.45x, which is very close to the 1.5x you would expect if NT eliminates 1 out of 3 transfers (i.e., 1 read, 1 write for NT vs 2 reads, 1 write). The rep movs approaches lie in the middle.
The combination of fairly low memory latency and modest 2-channel bandwidth means this particular chip happens to be able to saturate its memory bandwidth from a single-thread, which changes the behavior dramatically.
rep movsd seems to use the same magic as rep movsb on this chip. That's interesting because ERMSB only explicitly targets movsb and earlier tests on earlier archs with ERMSB show movsb performing much faster than movsd. This is mostly academic since movsb is more general than movsd anyway.
Haswell
Looking at the Haswell results kindly provided by iwillnotexist in the comments, we see the same general trends (most relevant results extracted):
C copy : 6777.8 MB/s (0.4%)
standard memcpy : 10487.3 MB/s (0.5%)
MOVSB copy : 9393.9 MB/s (0.2%)
MOVSD copy : 9155.0 MB/s (1.6%)
SSE2 copy : 6780.5 MB/s (0.4%)
SSE2 nontemporal copy : 10688.2 MB/s (0.3%)
The rep movsb approach is still slower than the non-temporal memcpy, but only by about 14% here (compared to ~26% in the Skylake test). The advantage of the NT techniques above their temporal cousins is now ~57%, even a bit more than the theoretical benefit of the bandwidth reduction.
When should you use rep movs?
Finally a stab at your actual question: when or why should you use it? It draw on the above and introduces a few new ideas. Unfortunately there is no simple answer: you'll have to trade off various factors, including some which you probably can't even know exactly, such as future developments.
A note that the alternative to rep movsb may be the optimized libc memcpy (including copies inlined by the compiler), or it may be a hand-rolled memcpy version. Some of the benefits below apply only in comparison to one or the other of these alternatives (e.g., "simplicity" helps against a hand-rolled version, but not against built-in memcpy), but some apply to both.
Restrictions on available instructions
In some environments there is a restriction on certain instructions or using certain registers. For example, in the Linux kernel, use of SSE/AVX or FP registers is generally disallowed. Therefore most of the optimized memcpy variants cannot be used as they rely on SSE or AVX registers, and a plain 64-bit mov-based copy is used on x86. For these platforms, using rep movsb allows most of the performance of an optimized memcpy without breaking the restriction on SIMD code.
A more general example might be code that has to target many generations of hardware, and which doesn't use hardware-specific dispatching (e.g., using cpuid). Here you might be forced to use only older instruction sets, which rules out any AVX, etc. rep movsb might be a good approach here since it allows "hidden" access to wider loads and stores without using new instructions. If you target pre-ERMSB hardware you'd have to see if rep movsb performance is acceptable there, though...
Future Proofing
A nice aspect of rep movsb is that it can, in theory take advantage of architectural improvement on future architectures, without source changes, that explicit moves cannot. For example, when 256-bit data paths were introduced, rep movsb was able to take advantage of them (as claimed by Intel) without any changes needed to the software. Software using 128-bit moves (which was optimal prior to Haswell) would have to be modified and recompiled.
So it is both a software maintenance benefit (no need to change source) and a benefit for existing binaries (no need to deploy new binaries to take advantage of the improvement).
How important this is depends on your maintenance model (e.g., how often new binaries are deployed in practice) and a very difficult to make judgement of how fast these instructions are likely to be in the future. At least Intel is kind of guiding uses in this direction though, by committing to at least reasonable performance in the future (15.3.3.6):
REP MOVSB and REP STOSB will continue to perform reasonably well on
future processors.
Overlapping with subsequent work
This benefit won't show up in a plain memcpy benchmark of course, which by definition doesn't have subsequent work to overlap, so the magnitude of the benefit would have to be carefully measured in a real-world scenario. Taking maximum advantage might require re-organization of the code surrounding the memcpy.
This benefit is pointed out by Intel in their optimization manual (section 11.16.3.4) and in their words:
When the count is known to be at least a thousand byte or more, using
enhanced REP MOVSB/STOSB can provide another advantage to amortize the
cost of the non-consuming code. The heuristic can be understood
using a value of Cnt = 4096 and memset() as example:
• A 256-bit SIMD implementation of memset() will need to issue/execute
retire 128 instances of 32- byte store operation with VMOVDQA, before
the non-consuming instruction sequences can make their way to
retirement.
• An instance of enhanced REP STOSB with ECX= 4096 is decoded as a
long micro-op flow provided by hardware, but retires as one
instruction. There are many store_data operation that must complete
before the result of memset() can be consumed. Because the completion
of store data operation is de-coupled from program-order retirement, a
substantial part of the non-consuming code stream can process through
the issue/execute and retirement, essentially cost-free if the
non-consuming sequence does not compete for store buffer resources.
So Intel is saying that after all some uops the code after rep movsb has issued, but while lots of stores are still in flight and the rep movsb as a whole hasn't retired yet, uops from following instructions can make more progress through the out-of-order machinery than they could if that code came after a copy loop.
The uops from an explicit load and store loop all have to actually retire separately in program order. That has to happen to make room in the ROB for following uops.
There doesn't seem to be much detailed information about how very long microcoded instruction like rep movsb work, exactly. We don't know exactly how micro-code branches request a different stream of uops from the microcode sequencer, or how the uops retire. If the individual uops don't have to retire separately, perhaps the whole instruction only takes up one slot in the ROB?
When the front-end that feeds the OoO machinery sees a rep movsb instruction in the uop cache, it activates the Microcode Sequencer ROM (MS-ROM) to send microcode uops into the queue that feeds the issue/rename stage. It's probably not possible for any other uops to mix in with that and issue/execute8 while rep movsb is still issuing, but subsequent instructions can be fetched/decoded and issue right after the last rep movsb uop does, while some of the copy hasn't executed yet.
This is only useful if at least some of your subsequent code doesn't depend on the result of the memcpy (which isn't unusual).
Now, the size of this benefit is limited: at most you can execute N instructions (uops actually) beyond the slow rep movsb instruction, at which point you'll stall, where N is the ROB size. With current ROB sizes of ~200 (192 on Haswell, 224 on Skylake), that's a maximum benefit of ~200 cycles of free work for subsequent code with an IPC of 1. In 200 cycles you can copy somewhere around 800 bytes at 10 GB/s, so for copies of that size you may get free work close to the cost of the copy (in a way making the copy free).
As copy sizes get much larger, however, the relative importance of this diminishes rapidly (e.g., if you are copying 80 KB instead, the free work is only 1% of the copy cost). Still, it is quite interesting for modest-sized copies.
Copy loops don't totally block subsequent instructions from executing, either. Intel does not go into detail on the size of the benefit, or on what kind of copies or surrounding code there is most benefit. (Hot or cold destination or source, high ILP or low ILP high-latency code after).
Code Size
The executed code size (a few bytes) is microscopic compared to a typical optimized memcpy routine. If performance is at all limited by i-cache (including uop cache) misses, the reduced code size might be of benefit.
Again, we can bound the magnitude of this benefit based on the size of the copy. I won't actually work it out numerically, but the intuition is that reducing the dynamic code size by B bytes can save at most C * B cache-misses, for some constant C. Every call to memcpy incurs the cache miss cost (or benefit) once, but the advantage of higher throughput scales with the number of bytes copied. So for large transfers, higher throughput will dominate the cache effects.
Again, this is not something that will show up in a plain benchmark, where the entire loop will no doubt fit in the uop cache. You'll need a real-world, in-place test to evaluate this effect.
Architecture Specific Optimization
You reported that on your hardware, rep movsb was considerably slower than the platform memcpy. However, even here there are reports of the opposite result on earlier hardware (like Ivy Bridge).
That's entirely plausible, since it seems that the string move operations get love periodically - but not every generation, so it may well be faster or at least tied (at which point it may win based on other advantages) on the architectures where it has been brought up to date, only to fall behind in subsequent hardware.
Quoting Andy Glew, who should know a thing or two about this after implementing these on the P6:
the big weakness of doing fast strings in microcode was [...] the
microcode fell out of tune with every generation, getting slower and
slower until somebody got around to fixing it. Just like a library men
copy falls out of tune. I suppose that it is possible that one of the
missed opportunities was to use 128-bit loads and stores when they
became available, and so on.
In that case, it can be seen as just another "platform specific" optimization to apply in the typical every-trick-in-the-book memcpy routines you find in standard libraries and JIT compilers: but only for use on architectures where it is better. For JIT or AOT-compiled stuff this is easy, but for statically compiled binaries this does require platform specific dispatch, but that often already exists (sometimes implemented at link time), or the mtune argument can be used to make a static decision.
Simplicity
Even on Skylake, where it seems like it has fallen behind the absolute fastest non-temporal techniques, it is still faster than most approaches and is very simple. This means less time in validation, fewer mystery bugs, less time tuning and updating a monster memcpy implementation (or, conversely, less dependency on the whims of the standard library implementors if you rely on that).
Latency Bound Platforms
Memory throughput bound algorithms9 can actually be operating in two main overall regimes: DRAM bandwidth bound or concurrency/latency bound.
The first mode is the one that you are probably familiar with: the DRAM subsystem has a certain theoretic bandwidth that you can calculate pretty easily based on the number of channels, data rate/width and frequency. For example, my DDR4-2133 system with 2 channels has a max bandwidth of 2.133 * 8 * 2 = 34.1 GB/s, same as reported on ARK.
You won't sustain more than that rate from DRAM (and usually somewhat less due to various inefficiencies) added across all cores on the socket (i.e., it is a global limit for single-socket systems).
The other limit is imposed by how many concurrent requests a core can actually issue to the memory subsystem. Imagine if a core could only have 1 request in progress at once, for a 64-byte cache line - when the request completed, you could issue another. Assume also very fast 50ns memory latency. Then despite the large 34.1 GB/s DRAM bandwidth, you'd actually only get 64 bytes / 50 ns = 1.28 GB/s, or less than 4% of the max bandwidth.
In practice, cores can issue more than one request at a time, but not an unlimited number. It is usually understood that there are only 10 line fill buffers per core between the L1 and the rest of the memory hierarchy, and perhaps 16 or so fill buffers between L2 and DRAM. Prefetching competes for the same resources, but at least helps reduce the effective latency. For more details look at any of the great posts Dr. Bandwidth has written on the topic, mostly on the Intel forums.
Still, most recent CPUs are limited by this factor, not the RAM bandwidth. Typically they achieve 12 - 20 GB/s per core, while the RAM bandwidth may be 50+ GB/s (on a 4 channel system). Only some recent gen 2-channel "client" cores, which seem to have a better uncore, perhaps more line buffers can hit the DRAM limit on a single core, and our Skylake chips seem to be one of them.
Now of course, there is a reason Intel designs systems with 50 GB/s DRAM bandwidth, while only being to sustain < 20 GB/s per core due to concurrency limits: the former limit is socket-wide and the latter is per core. So each core on an 8 core system can push 20 GB/s worth of requests, at which point they will be DRAM limited again.
Why I am going on and on about this? Because the best memcpy implementation often depends on which regime you are operating in. Once you are DRAM BW limited (as our chips apparently are, but most aren't on a single core), using non-temporal writes becomes very important since it saves the read-for-ownership that normally wastes 1/3 of your bandwidth. You see that exactly in the test results above: the memcpy implementations that don't use NT stores lose 1/3 of their bandwidth.
If you are concurrency limited, however, the situation equalizes and sometimes reverses, however. You have DRAM bandwidth to spare, so NT stores don't help and they can even hurt since they may increase the latency since the handoff time for the line buffer may be longer than a scenario where prefetch brings the RFO line into LLC (or even L2) and then the store completes in LLC for an effective lower latency. Finally, server uncores tend to have much slower NT stores than client ones (and high bandwidth), which accentuates this effect.
So on other platforms you might find that NT stores are less useful (at least when you care about single-threaded performance) and perhaps rep movsb wins where (if it gets the best of both worlds).
Really, this last item is a call for most testing. I know that NT stores lose their apparent advantage for single-threaded tests on most archs (including current server archs), but I don't know how rep movsb will perform relatively...
References
Other good sources of info not integrated in the above.
comp.arch investigation of rep movsb versus alternatives. Lots of good notes about branch prediction, and an implementation of the approach I've often suggested for small blocks: using overlapping first and/or last read/writes rather than trying to write only exactly the required number of bytes (for example, implementing all copies from 9 to 16 bytes as two 8-byte copies which might overlap in up to 7 bytes).
1 Presumably the intention is to restrict it to cases where, for example, code-size is very important.
2 See Section 3.7.5: REP Prefix and Data Movement.
3 It is key to note this applies only for the various stores within the single instruction itself: once complete, the block of stores still appear ordered with respect to prior and subsequent stores. So code can see stores from the rep movs out of order with respect to each other but not with respect to prior or subsequent stores (and it's the latter guarantee you usually need). It will only be a problem if you use the end of the copy destination as a synchronization flag, instead of a separate store.
4 Note that non-temporal discrete stores also avoid most of the ordering requirements, although in practice rep movs has even more freedom since there are still some ordering constraints on WC/NT stores.
5 This is was common in the latter part of the 32-bit era, where many chips had 64-bit data paths (e.g, to support FPUs which had support for the 64-bit double type). Today, "neutered" chips such as the Pentium or Celeron brands have AVX disabled, but presumably rep movs microcode can still use 256b loads/stores.
6 E.g., due to language alignment rules, alignment attributes or operators, aliasing rules or other information determined at compile time. In the case of alignment, even if the exact alignment can't be determined, they may at least be able to hoist alignment checks out of loops or otherwise eliminate redundant checks.
7 I'm making the assumption that "standard" memcpy is choosing a non-temporal approach, which is highly likely for this size of buffer.
8 That isn't necessarily obvious, since it could be the case that the uop stream that is generated by the rep movsb simply monopolizes dispatch and then it would look very much like the explicit mov case. It seems that it doesn't work like that however - uops from subsequent instructions can mingle with uops from the microcoded rep movsb.
9 I.e., those which can issue a large number of independent memory requests and hence saturate the available DRAM-to-core bandwidth, of which memcpy would be a poster child (and as apposed to purely latency bound loads such as pointer chasing).

Enhanced REP MOVSB (Ivy Bridge and later)
Ivy Bridge microarchitecture (processors released in 2012 and 2013) introduced Enhanced REP MOVSB (ERMSB). We still need to check the corresponding bit. ERMS was intended to allow us to copy memory fast with rep movsb.
Cheapest versions of later processors - Kaby Lake Celeron and Pentium, released in 2017, don't have AVX that could have been used for fast memory copy, but still have the Enhanced REP MOVSB. And some of Intel's mobile and low-power architectures released in 2018 and onwards, which were not based on SkyLake, copy about twice more bytes per CPU cycle with REP MOVSB than previous generations of microarchitectures.
Enhanced REP MOVSB (ERMSB) before the Ice Lake microarchitecture with Fast Short REP MOV (FSRM) was only faster than AVX copy or general-use register copy if the block size is at least 256 bytes. For the blocks below 64 bytes, it was much slower, because there is a high internal startup in ERMSB - about 35 cycles. The FSRM feature intended blocks before 128 bytes also be quick.
See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (applies to processors which did not yet have FSRM):
startup cost is 35 cycles;
both the source and destination addresses have to be aligned to a 16-Byte boundary;
the source region should not overlap with the destination region;
the length has to be a multiple of 64 to produce higher performance;
the direction has to be forward (CLD).
As I said earlier, REP MOVSB (on processors before FSRM) begins to outperform other methods when the length is at least 256 bytes, but to see the clear benefit over AVX copy, the length has to be more than 2048 bytes. Also, it should be noted that merely using AVX (256-bit registers) or AVX-512 (512-bit registers) for memory copy may sometimes have dire consequences like AVX/SSE transition penalties or reduced turbo frequency. So the REP MOVSB is a safer way to copy memory than AVX.
On the effect of the alignment if REP MOVSB vs. AVX copy, the Intel Manual gives the following information:
if the source buffer is not aligned, the impact on ERMSB implementation versus 128-bit AVX is similar;
if the destination buffer is not aligned, the effect on ERMSB implementation can be 25% degradation, while 128-bit AVX implementation of memory copy may degrade only 5%, relative to 16-byte aligned scenario.
I have made tests on Intel Core i5-6600, under 64-bit, and I have compared REP MOVSB memcpy() with a simple MOV RAX, [SRC]; MOV [DST], RAX implementation when the data fits L1 cache:
REP MOVSB memory copy
- 1622400000 data blocks of 32 bytes took 17.9337 seconds to copy; 2760.8205 MB/s
- 1622400000 data blocks of 64 bytes took 17.8364 seconds to copy; 5551.7463 MB/s
- 811200000 data blocks of 128 bytes took 10.8098 seconds to copy; 9160.5659 MB/s
- 405600000 data blocks of 256 bytes took 5.8616 seconds to copy; 16893.5527 MB/s
- 202800000 data blocks of 512 bytes took 3.9315 seconds to copy; 25187.2976 MB/s
- 101400000 data blocks of 1024 bytes took 2.1648 seconds to copy; 45743.4214 MB/s
- 50700000 data blocks of 2048 bytes took 1.5301 seconds to copy; 64717.0642 MB/s
- 25350000 data blocks of 4096 bytes took 1.3346 seconds to copy; 74198.4030 MB/s
- 12675000 data blocks of 8192 bytes took 1.1069 seconds to copy; 89456.2119 MB/s
- 6337500 data blocks of 16384 bytes took 1.1120 seconds to copy; 89053.2094 MB/s
MOV RAX... memory copy
- 1622400000 data blocks of 32 bytes took 7.3536 seconds to copy; 6733.0256 MB/s
- 1622400000 data blocks of 64 bytes took 10.7727 seconds to copy; 9192.1090 MB/s
- 811200000 data blocks of 128 bytes took 8.9408 seconds to copy; 11075.4480 MB/s
- 405600000 data blocks of 256 bytes took 8.4956 seconds to copy; 11655.8805 MB/s
- 202800000 data blocks of 512 bytes took 9.1032 seconds to copy; 10877.8248 MB/s
- 101400000 data blocks of 1024 bytes took 8.2539 seconds to copy; 11997.1185 MB/s
- 50700000 data blocks of 2048 bytes took 7.7909 seconds to copy; 12710.1252 MB/s
- 25350000 data blocks of 4096 bytes took 7.5992 seconds to copy; 13030.7062 MB/s
- 12675000 data blocks of 8192 bytes took 7.4679 seconds to copy; 13259.9384 MB/s
So, even on 128-bit blocks, REP MOVSB (on processors before FSRM) is slower than just a simple MOV RAX copy in a loop (not unrolled). The ERMSB implementation begins to outperform the MOV RAX loop only starting from 256-byte blocks.
Fast Short REP MOV (FSRM)
The Ice Lake microarchitecture launched in September 2019 introduced the Fast Short REP MOV (FSRM). This feature can be tested by a CPUID bit. It was intended for strings of 128 bytes and less to also be quick, but, in fact, strings before 64 bytes are still slower with rep movsb than with, for example, simple 64-bit register copy. Besides that, FSRM is only implemented under 64-bit, not under 32-bit. At least on my i7-1065G7 CPU, rep movsb is only quick for small strings under 64-bit, but on 32-bit strings have to be at least 4KB in order for rep movsb to start outperforming other methods.
Normal (not enhanced) REP MOVS on Nehalem (2009-2013)
Surprisingly, previous architectures (Nehalem and later, up to, but not including Ivy Bridge), that didn't yet have Enhanced REP MOVB, had relatively fast REP MOVSD/MOVSQ (but not REP MOVSB/MOVSW) implementation for large blocks, but not large enough to outsize the L1 cache.
Intel Optimization Manual (2.5.6 REP String Enhancement) gives the following information is related to Nehalem microarchitecture - Intel Core i5, i7 and Xeon processors released in 2009 and 2010, and later microarchitectures, including Sandy Bridge manufactured up to 2013.
REP MOVSB
The latency for MOVSB is 9 cycles if ECX < 4. Otherwise, REP MOVSB with ECX > 9 has a 50-cycle startup cost.
tiny string (ECX < 4): the latency of REP MOVSB is 9 cycles;
small string (ECX is between 4 and 9): no official information in the Intel manual, probably more than 9 cycles but less than 50 cycles;
long string (ECX > 9): 50-cycle startup cost.
MOVSW/MOVSD/MOVSQ
Quote from the Intel Optimization Manual (2.5.6 REP String Enhancement):
Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles.
Fast string (ECX >= 76: excluding REP MOVSB): the processor implementation provides hardware optimization by moving as many pieces of data in 16 bytes as possible. The latency of REP string latency will vary if one of the 16-byte data transfer spans across cache line boundary:
= Split-free: the latency consists of a startup cost of about 40 cycles, and every 64 bytes of data adds 4 cycles.
= Cache splits: the latency consists of a startup cost of about 35 cycles, and every 64 bytes of data adds 6 cycles.
Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has a startup cost of about 15 cycles plus one cycle for each iteration of the data movement in word/dword/qword.
Therefore, according to Intel, for very large memory blocks, REP MOVSW is as fast as REP MOVSD/MOVSQ. Anyway, my tests have shown that only REP MOVSD/MOVSQ are fast, while REP MOVSW is even slower than REP MOVSB on Nehalem and Westmere.
According to the information provided by Intel in the manual, on previous Intel microarchitectures (before 2008) the startup costs are even higher.
Conclusion: if you just need to copy data that fits L1 cache, just 4 cycles to copy 64 bytes of data is excellent, and you don't need to use XMM registers!
#REP MOVSD/MOVSQ is the universal solution that works excellent on all Intel processors (no ERMSB required) if the data fits L1 cache #
Here are the tests of REP MOVS* when the source and destination was in the L1 cache, of blocks large enough to not be seriously affected by startup costs, but not that large to exceed the L1 cache size. Source: http://users.atw.hu/instlatx64/
Yonah (2006-2008)
REP MOVSB 10.91 B/c
REP MOVSW 10.85 B/c
REP MOVSD 11.05 B/c
Nehalem (2009-2010)
REP MOVSB 25.32 B/c
REP MOVSW 19.72 B/c
REP MOVSD 27.56 B/c
REP MOVSQ 27.54 B/c
Westmere (2010-2011)
REP MOVSB 21.14 B/c
REP MOVSW 19.11 B/c
REP MOVSD 24.27 B/c
Ivy Bridge (2012-2013) - with Enhanced REP MOVSB (all subsequent CPUs also have Enhanced REP MOVSB)
REP MOVSB 28.72 B/c
REP MOVSW 19.40 B/c
REP MOVSD 27.96 B/c
REP MOVSQ 27.89 B/c
SkyLake (2015-2016)
REP MOVSB 57.59 B/c
REP MOVSW 58.20 B/c
REP MOVSD 58.10 B/c
REP MOVSQ 57.59 B/c
Kaby Lake (2016-2017)
REP MOVSB 58.00 B/c
REP MOVSW 57.69 B/c
REP MOVSD 58.00 B/c
REP MOVSQ 57.89 B/c
I have presented test results for both SkyLake and Kaby Lake just for the sake of confirmation - these architectures have the same cycle-per-instruction data.
Cannon Lake, mobile (May 2018 - February 2020)
REP MOVSB 107.44 B/c
REP MOVSW 106.74 B/c
REP MOVSD 107.08 B/c
REP MOVSQ 107.08 B/c
Cascade lake, server (April 2019)
REP MOVSB 58.72 B/c
REP MOVSW 58.51 B/c
REP MOVSD 58.51 B/c
REP MOVSQ 58.20 B/c
Comet Lake, desktop, workstation, mobile (August 2019)
REP MOVSB 58.72 B/c
REP MOVSW 58.62 B/c
REP MOVSD 58.72 B/c
REP MOVSQ 58.72 B/c
Ice Lake, mobile (September 2019)
REP MOVSB 102.40 B/c
REP MOVSW 101.14 B/c
REP MOVSD 101.14 B/c
REP MOVSQ 101.14 B/c
Tremont, low power (September, 2020)
REP MOVSB 119.84 B/c
REP MOVSW 121.78 B/c
REP MOVSD 121.78 B/c
REP MOVSQ 121.78 B/c
Tiger Lake, mobile (October, 2020)
REP MOVSB 93.27 B/c
REP MOVSW 93.09 B/c
REP MOVSD 93.09 B/c
REP MOVSQ 93.09 B/c
As you see, the implementation of REP MOVS differs significantly from one microarchitecture to another. On some processors, like Ivy Bridge - REP MOVSB is fastest, albeit just slightly faster than REP MOVSD/MOVSQ, but no doubt that on all processors since Nehalem, REP MOVSD/MOVSQ works very well - you even don't need "Enhanced REP MOVSB", since, on Ivy Bridge (2013) with Enhacnced REP MOVSB, REP MOVSD shows the same byte per clock data as on Nehalem (2010) without Enhacnced REP MOVSB, while in fact REP MOVSB became very fast only since SkyLake (2015) - twice as fast as on Ivy Bridge. So this Enhacnced REP MOVSB bit in the CPUID may be confusing - it only shows that REP MOVSB per se is OK, but not that any REP MOVS* is faster.
The most confusing ERMSB implementation is on the Ivy Bridge microarchitecture. Yes, on very old processors, before ERMSB, REP MOVS* for large blocks did use a cache protocol feature that is not available to regular code (no-RFO). But this protocol is no longer used on Ivy Bridge that has ERMSB. According to Andy Glew's comments on an answer to "why are complicated memcpy/memset superior?" from a Peter Cordes answer, a cache protocol feature that is not available to regular code was once used on older processors, but no longer on Ivy Bridge. And there comes an explanation of why the startup costs are so high for REP MOVS*: „The large overhead for choosing and setting up the right method is mainly due to the lack of microcode branch prediction”. There has also been an interesting note that Pentium Pro (P6) in 1996 implemented REP MOVS* with 64 bit microcode loads and stores and a no-RFO cache protocol - they did not violate memory ordering, unlike ERMSB in Ivy Bridge.
As about rep movsb vs rep movsq, on some processors with ERMSB rep movsb is slightly faster (e.g., Xeon E3-1246 v3), on other rep movsq is faster (Skylake), and on other it is the same speed (e.g. i7-1065G7). However, I would go for rep movsq rather than rep movsb anyway.
Please also note that this answer is only relevant for the cases where the source and the destination data fits L1 cache. Depending on circumstances, the particularities of memory access (cache, etc.) should be taken into consideration.
Please also note that the information in this answer is only related to Intel processors and not to the processors by other manufacturers like AMD that may have better or worse implementations of REP MOVS* instructions.
Tinymembench results
Here are some of the tinymembench results to show relative performance of the rep movsb and rep movsd.
Intel Xeon E5-1650V3
Haswell microarchitecture, ERMS, AVX-2, released on September 2014 for $583, base frequency 3.5 GHz, max turbo frequency: 3.8 GHz (one core), L2 cache 6 × 256 KB, L3 cache 15 MB, supports up to 4×DDR4-2133, installed 8 modules of 32768 MB DDR4 ECC reg (256GB total RAM).
C copy backwards : 7268.8 MB/s (1.5%)
C copy backwards (32 byte blocks) : 7264.3 MB/s
C copy backwards (64 byte blocks) : 7271.2 MB/s
C copy : 7147.2 MB/s
C copy prefetched (32 bytes step) : 7044.6 MB/s
C copy prefetched (64 bytes step) : 7032.5 MB/s
C 2-pass copy : 6055.3 MB/s
C 2-pass copy prefetched (32 bytes step) : 6350.6 MB/s
C 2-pass copy prefetched (64 bytes step) : 6336.4 MB/s
C fill : 11072.2 MB/s
C fill (shuffle within 16 byte blocks) : 11071.3 MB/s
C fill (shuffle within 32 byte blocks) : 11070.8 MB/s
C fill (shuffle within 64 byte blocks) : 11072.0 MB/s
---
standard memcpy : 11608.9 MB/s
standard memset : 15789.7 MB/s
---
MOVSB copy : 8123.9 MB/s
MOVSD copy : 8100.9 MB/s (0.3%)
SSE2 copy : 7213.2 MB/s
SSE2 nontemporal copy : 11985.5 MB/s
SSE2 copy prefetched (32 bytes step) : 7055.8 MB/s
SSE2 copy prefetched (64 bytes step) : 7044.3 MB/s
SSE2 nontemporal copy prefetched (32 bytes step) : 11794.4 MB/s
SSE2 nontemporal copy prefetched (64 bytes step) : 11813.1 MB/s
SSE2 2-pass copy : 6394.3 MB/s
SSE2 2-pass copy prefetched (32 bytes step) : 6255.9 MB/s
SSE2 2-pass copy prefetched (64 bytes step) : 6234.0 MB/s
SSE2 2-pass nontemporal copy : 4279.5 MB/s
SSE2 fill : 10745.0 MB/s
SSE2 nontemporal fill : 22014.4 MB/s
Intel Xeon E3-1246 v3
Haswell, ERMS, AVX-2, 3.50GHz
C copy backwards : 6911.8 MB/s
C copy backwards (32 byte blocks) : 6919.0 MB/s
C copy backwards (64 byte blocks) : 6924.6 MB/s
C copy : 6934.3 MB/s (0.2%)
C copy prefetched (32 bytes step) : 6860.1 MB/s
C copy prefetched (64 bytes step) : 6875.6 MB/s (0.1%)
C 2-pass copy : 6471.2 MB/s
C 2-pass copy prefetched (32 bytes step) : 6710.3 MB/s
C 2-pass copy prefetched (64 bytes step) : 6745.5 MB/s (0.3%)
C fill : 10812.1 MB/s (0.2%)
C fill (shuffle within 16 byte blocks) : 10807.7 MB/s
C fill (shuffle within 32 byte blocks) : 10806.6 MB/s
C fill (shuffle within 64 byte blocks) : 10809.7 MB/s
---
standard memcpy : 10922.0 MB/s
standard memset : 28935.1 MB/s
---
MOVSB copy : 9656.7 MB/s
MOVSD copy : 9430.1 MB/s
SSE2 copy : 6939.1 MB/s
SSE2 nontemporal copy : 10820.6 MB/s
SSE2 copy prefetched (32 bytes step) : 6857.4 MB/s
SSE2 copy prefetched (64 bytes step) : 6854.9 MB/s
SSE2 nontemporal copy prefetched (32 bytes step) : 10774.2 MB/s
SSE2 nontemporal copy prefetched (64 bytes step) : 10782.1 MB/s
SSE2 2-pass copy : 6683.0 MB/s
SSE2 2-pass copy prefetched (32 bytes step) : 6687.6 MB/s
SSE2 2-pass copy prefetched (64 bytes step) : 6685.8 MB/s
SSE2 2-pass nontemporal copy : 5234.9 MB/s
SSE2 fill : 10622.2 MB/s
SSE2 nontemporal fill : 22515.2 MB/s (0.1%)
Intel Xeon Skylake-SP
Skylake, ERMS, AVX-512, 2.1 GHz (Xeon Gold 6152 at base frequency, no turbo)
MOVSB copy : 4619.3 MB/s (0.6%)
SSE2 fill : 9774.4 MB/s (1.5%)
SSE2 nontemporal fill : 6715.7 MB/s (1.1%)
Intel Xeon E3-1275V6
Kaby Lake, released on March 2017 for $339, base frequency 3.8 GHz, max turbo frequency 4.2 GHz, L2 cache 4 × 256 KB, L3 cache 8 MB, 4 cores (8 threads), 4 RAM modules of 16384 MB DDR4 ECC installed, but it can use only 2 memory channels.
MOVSB copy : 11720.8 MB/s
SSE2 fill : 15877.6 MB/s (2.7%)
SSE2 nontemporal fill : 36407.1 MB/s
Intel i7-1065G7
Ice Lake, AVX-512, ERMS, FSRM, 1.37 GHz (worked at the base frequency, turbo mode disabled)
MOVSB copy : 7322.7 MB/s
SSE2 fill : 9681.7 MB/s
SSE2 nontemporal fill : 16426.2 MB/s
AMD EPYC 7401P
Released on June 2017 at US $1075, based on Zen gen.1 microarchitecture, 24 cores (48 threads), base frequency: 2.0GHz, max turbo boost: 3.0GHz (few cores) or 2.8 (all cores); cache: L1 - 64 KB inst. & 32 KB data per core, L2 - 512 KB per core, L3 - 64 MB, 8 MB per CCX, DDR4-2666 8 channels, but only 4 RAM modules of 32768 MB each of DDR4 ECC reg. installed.
MOVSB copy : 7718.0 MB/s
SSE2 fill : 11233.5 MB/s
SSE2 nontemporal fill : 34893.3 MB/s
AMD Ryzen 7 1700X (4 RAM modules installed)
MOVSB copy : 7444.7 MB/s
SSE2 fill : 11100.1 MB/s
SSE2 nontemporal fill : 31019.8 MB/s
AMD Ryzen 7 Pro 1700X (2 RAM modules installed)
MOVSB copy : 7251.6 MB/s
SSE2 fill : 10691.6 MB/s
SSE2 nontemporal fill : 31014.7 MB/s
AMD Ryzen 7 Pro 1700X (4 RAM modules installed)
MOVSB copy : 7429.1 MB/s
SSE2 fill : 10954.6 MB/s
SSE2 nontemporal fill : 30957.5 MB/s
Conclusion
REP MOVSD/MOVSQ is the universal solution that works relatively well on all Intel processors for large memory blocks of at least 4KB (no ERMSB required) if the destination is aligned by at least 64 bytes.
REP MOVSD/MOVSQ works even better on newer processors, starting from Skylake. And, for Ice Lake or newer microarchitectures, it works perfectly for even very small strings of at least 64 bytes.

You say that you want:
an answer that shows when ERMSB is useful
But I'm not sure it means what you think it means. Looking at the 3.7.6.1 docs you link to, it explicitly says:
implementing memcpy using ERMSB might not reach the same level of throughput as using 256-bit or 128-bit AVX alternatives, depending on length and alignment factors.
So just because CPUID indicates support for ERMSB, that isn't a guarantee that REP MOVSB will be the fastest way to copy memory. It just means it won't suck as bad as it has in some previous CPUs.
However just because there may be alternatives that can, under certain conditions, run faster doesn't mean that REP MOVSB is useless. Now that the performance penalties that this instruction used to incur are gone, it is potentially a useful instruction again.
Remember, it is a tiny bit of code (2 bytes!) compared to some of the more involved memcpy routines I have seen. Since loading and running big chunks of code also has a penalty (throwing some of your other code out of the cpu's cache), sometimes the 'benefit' of AVX et al is going to be offset by the impact it has on the rest of your code. Depends on what you are doing.
You also ask:
Why is the bandwidth so much lower with REP MOVSB? What can I do to improve it?
It isn't going to be possible to "do something" to make REP MOVSB run any faster. It does what it does.
If you want the higher speeds you are seeing from from memcpy, you can dig up the source for it. It's out there somewhere. Or you can trace into it from a debugger and see the actual code paths being taken. My expectation is that it's using some of those AVX instructions to work with 128 or 256bits at a time.
Or you can just... Well, you asked us not to say it.

This is not an answer to the stated question(s), only my results (and personal conclusions) when trying to find out.
In summary: GCC already optimizes memset()/memmove()/memcpy() (see e.g. gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look for stringop_algs in the same file to see architecture-dependent variants). So, there is no reason to expect massive gains by using your own variant with GCC (unless you've forgotten important stuff like alignment attributes for your aligned data, or do not enable sufficiently specific optimizations like -O2 -march= -mtune=). If you agree, then the answers to the stated question are more or less irrelevant in practice.
(I only wish there was a memrepeat(), the opposite of memcpy() compared to memmove(), that would repeat the initial part of a buffer to fill the entire buffer.)
I currently have an Ivy Bridge machine in use (Core i5-6200U laptop, Linux 4.4.0 x86-64 kernel, with erms in /proc/cpuinfo flags). Because I wanted to find out if I can find a case where a custom memcpy() variant based on rep movsb would outperform a straightforward memcpy(), I wrote an overly complicated benchmark.
The core idea is that the main program allocates three large memory areas: original, current, and correct, each exactly the same size, and at least page-aligned. The copy operations are grouped into sets, with each set having distinct properties, like all sources and targets being aligned (to some number of bytes), or all lengths being within the same range. Each set is described using an array of src, dst, n triplets, where all src to src+n-1 and dst to dst+n-1 are completely within the current area.
A Xorshift* PRNG is used to initialize original to random data. (Like I warned above, this is overly complicated, but I wanted to ensure I'm not leaving any easy shortcuts for the compiler.) The correct area is obtained by starting with original data in current, applying all the triplets in the current set, using memcpy() provided by the C library, and copying the current area to correct. This allows each benchmarked function to be verified to behave correctly.
Each set of copy operations is timed a large number of times using the same function, and the median of these is used for comparison. (In my opinion, median makes the most sense in benchmarking, and provides sensible semantics -- the function is at least that fast at least half the time.)
To avoid compiler optimizations, I have the program load the functions and benchmarks dynamically, at run time. The functions all have the same form, void function(void *, const void *, size_t) -- note that unlike memcpy() and memmove(), they return nothing. The benchmarks (named sets of copy operations) are generated dynamically by a function call (that takes the pointer to the current area and its size as parameters, among others).
Unfortunately, I have not yet found any set where
static void rep_movsb(void *dst, const void *src, size_t n)
{
__asm__ __volatile__ ( "rep movsb\n\t"
: "+D" (dst), "+S" (src), "+c" (n)
:
: "memory" );
}
would beat
static void normal_memcpy(void *dst, const void *src, size_t n)
{
memcpy(dst, src, n);
}
using gcc -Wall -O2 -march=ivybridge -mtune=ivybridge using GCC 5.4.0 on aforementioned Core i5-6200U laptop running a linux-4.4.0 64-bit kernel. Copying 4096-byte aligned and sized chunks comes close, however.
This means that at least thus far, I have not found a case where using a rep movsb memcpy variant would make sense. It does not mean there is no such case; I just haven't found one.
(At this point the code is a spaghetti mess I'm more ashamed than proud of, so I shall omit publishing the sources unless someone asks. The above description should be enough to write a better one, though.)
This does not surprise me much, though. The C compiler can infer a lot of information about the alignment of the operand pointers, and whether the number of bytes to copy is a compile-time constant, a multiple of a suitable power of two. This information can, and will/should, be used by the compiler to replace the C library memcpy()/memmove() functions with its own.
GCC does exactly this (see e.g. gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look for stringop_algs in the same file to see architecture-dependent variants). Indeed, memcpy()/memset()/memmove() has already been separately optimized for quite a few x86 processor variants; it would quite surprise me if the GCC developers had not already included erms support.
GCC provides several function attributes that developers can use to ensure good generated code. For example, alloc_align (n) tells GCC that the function returns memory aligned to at least n bytes. An application or a library can choose which implementation of a function to use at run time, by creating a "resolver function" (that returns a function pointer), and defining the function using the ifunc (resolver) attribute.
One of the most common patterns I use in my code for this is
some_type *pointer = __builtin_assume_aligned(ptr, alignment);
where ptr is some pointer, alignment is the number of bytes it is aligned to; GCC then knows/assumes that pointer is aligned to alignment bytes.
Another useful built-in, albeit much harder to use correctly, is __builtin_prefetch(). To maximize overall bandwidth/efficiency, I have found that minimizing latencies in each sub-operation, yields the best results. (For copying scattered elements to consecutive temporary storage, this is difficult, as prefetching typically involves a full cache line; if too many elements are prefetched, most of the cache is wasted by storing unused items.)

There are far more efficient ways to move data. These days, the implementation of memcpy will generate architecture specific code from the compiler that is optimized based upon the memory alignment of the data and other factors. This allows better use of non-temporal cache instructions and XMM and other registers in the x86 world.
When you hard-code rep movsb prevents this use of intrinsics.
Therefore, for something like a memcpy, unless you are writing something that will be tied to a very specific piece of hardware and unless you are going to take the time to write a highly optimized memcpy function in assembly (or using C level intrinsics), you are far better off allowing the compiler to figure it out for you.

As a general memcpy() guide:
a) If the data being copied is tiny (less than maybe 20 bytes) and has a fixed size, let the compiler do it. Reason: Compiler can use normal mov instructions and avoid the startup overheads.
b) If the data being copied is small (less than about 4 KiB) and is guaranteed to be aligned, use rep movsb (if ERMSB is supported) or rep movsd (if ERMSB is not supported). Reason: Using an SSE or AVX alternative has a huge amount of "startup overhead" before it copies anything.
c) If the data being copied is small (less than about 4 KiB) and is not guaranteed to be aligned, use rep movsb. Reason: Using SSE or AVX, or using rep movsd for the bulk of it plus some rep movsb at the start or end, has too much overhead.
d) For all other cases use something like this:
mov edx,0
.again:
pushad
.nextByte:
pushad
popad
mov al,[esi]
pushad
popad
mov [edi],al
pushad
popad
inc esi
pushad
popad
inc edi
pushad
popad
loop .nextByte
popad
inc edx
cmp edx,1000
jb .again
Reason: This will be so slow that it will force programmers to find an alternative that doesn't involve copying huge globs of data; and the resulting software will be significantly faster because copying large globs of data was avoided.

Where is the L1 memory cache of Intel x86 processors documented?

I am trying to profile and optimize algorithms and I would like to understand the specific impact of the caches on various processors. For recent Intel x86 processors (e.g. Q9300), it is very hard to find detailed information about cache structure. In particular, most web sites (including Intel.com) that post processor specs do not include any reference to L1 cache. Is this because the L1 cache does not exist or is this information for some reason considered unimportant? Are there any articles or discussions about the elimination of the L1 cache?
[edit]
After running various tests and diagnostic programs (mostly those discussed in the answers below), I have concluded that my Q9300 seems to have a 32K L1 data cache. I still haven't found a clear explanation as to why this information is so difficult to come by. My current working theory is that the details of L1 caching are now being treated as trade secrets by Intel.

It is near impossible to find specs on Intel caches. When I was teaching a class on caches last year, I asked friends inside Intel (in the compiler group) and they couldn't find specs.
But wait!!! Jed, bless his soul, tells us that on Linux systems, you can squeeze lots of information out of the kernel:
grep . /sys/devices/system/cpu/cpu0/cache/index*/*
This will give you associativity, set size, and a bunch of other information (but not latency).
For example, I learned that although AMD advertises their 128K L1 cache, my AMD machine has a split I and D cache of 64K each.
Two suggestions which are now mostly obsolete thanks to Jed:
AMD publishes a lot more information about its caches, so you can at least got some information about a modern cache. For example, last year's AMD L1 caches delivered two words per cycle (peak).
The open-source tool valgrind has all sorts of cache models inside it, and it is invaluable for profiling and understanding cache behavior. It comes with a very nice visualization tool kcachegrind which is part of the KDE SDK.
For example: in Q3 2008, AMD K8/K10 CPUs use 64 byte cache lines, with a 64kB each L1I/L1D split cache. L1D is 2-way associative and exclusive with L2, with latency of 3 cycles. L2 cache is 16-way associative and latency is about 12 cycles.
AMD Bulldozer-family CPUs use a split L1 with a 16kiB 4-way associative L1D per cluster (2 per core).
Intel CPUs have kept L1 the same for a long time (from Pentium M to Haswell to Skylake, and presumably many generations after that): Split 32kB each I and D caches, with L1D being 8-way associative. 64 byte cache lines, matching the burst-transfer size of DDR DRAM. Load-use latency is ~4 cycles.
Also see the x86 tag wiki for links to more performance and microarchitectural data.

This Intel Manual: Intel® 64 and IA-32 Architectures Optimization Reference Manual has a decent discussion of cache considerations.
Page 46, Section 2.2.5.1 Intel® 64 and IA-32 Architectures Optimization Reference Manual
Even MicroSlop is waking up to the need for more tools to monitor cache usage and performance, and has a GetLogicalProcessorInformation() function example (...while blazing new trails in creating ridiculously long function names in the process) I think I'll code up.
UPDATE I: Hazwell increases cache load performance 2X, from Inside the Tock; Haswell's Architecture
If there were any doubt how critical it is to make the best possible use of cache, this presentation by Cliff Click, formerly of Azul, should dispel any and all doubt. In his words, "memory is the new disk!".
UPDATE II: SkyLake's significantly improved cache performance specifications.

You are looking at the consumer specifications, not the developer specifications. Here is the documentation you want. The cache sizes vary by processor family sub-models, so they typically are not in the IA-32 development manuals, but you can easily look them up on NewEgg and such.
Edit: More specifically: Chapter 10 of Volume 3A (Systems Programming Guide), Chapter 7 of the Optimization Reference Manual, and potentially something in the TLB page-caching manual, although I would assume that one is further out from the L1 than you care about.

I did some more investigating. There is a group at ETH Zurich who built a memory-performance evaluation tool which might be able to get information about the size at least (and maybe also associativity) of L1 and L2 caches. The program works by trying different read patterns experimentally and measuring the resulting throughput. A simplified version was used for the popular textbook by Bryant and O'Hallaron.

L1 caches exist on these platforms. This will almost definitly remain true until memory and front side bus speeds exceed the speed of the CPU, which is a very likely a long way off.
On Windows, you can use the GetLogicalProcessorInformation to get some level of cache information (size, line size, associativity, etc.) The Ex version on Win7 will give even more data, like which cores share which cache. CpuZ also gives this information.

Locality of Reference has a major impact on performance of some algorithms; The size and speed of L1, L2 (and on newer CPUs L3) cache obviously play a large part in this. Matrix multiplication is one such algorithm.

Intel Manual Vol. 2 specifies the following formula to compute cache size:
This Cache Size in Bytes
= (Ways + 1) * (Partitions + 1) * (Line_Size + 1) * (Sets + 1)
= (EBX[31:22] + 1) * (EBX[21:12] + 1) * (EBX[11:0] + 1) * (ECX + 1)
Where the Ways, Partitions, Line_Size and Sets are queried using cpuid with eax set to 0x04.
Providing the header file declaration
x86_cache_size.h:
unsigned int get_cache_line_size(unsigned int cache_level);
The implementation looks as follows:
;1st argument - the cache level
get_cache_line_size:
push rbx
;set line number argument to be used with CPUID instruction
mov ecx, edi
;set cpuid initial value
mov eax, 0x04
cpuid
;cache line size
mov eax, ebx
and eax, 0x7ff
inc eax
;partitions
shr ebx, 12
mov edx, ebx
and edx, 0x1ff
inc edx
mul edx
;ways of associativity
shr ebx, 10
mov edx, ebx
and edx, 0x1ff
inc edx
mul edx
;number of sets
inc ecx
mul ecx
pop rbx
ret
Which on my machine works as follows:
#include "x86_cache_size.h"
int main(void){
unsigned int L1_cache_size = get_cache_line_size(1);
unsigned int L2_cache_size = get_cache_line_size(2);
unsigned int L3_cache_size = get_cache_line_size(3);
//L1 size = 32768, L2 size = 262144, L3 size = 8388608
printf("L1 size = %u, L2 size = %u, L3 size = %u\n", L1_cache_size, L2_cache_size, L3_cache_size);
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio