Packing two DWORDs into a QWORD to save store bandwidth - performance

Imagine a load-store loop like the following which loads DWORDs from non-contiguous locations and stores them contiguously:
top:
mov eax, DWORD [rsi]
mov DWORD [rdi], eax
mov eax, DWORD [rdx]
mov DWORD [rdi + 4], eax
; unroll the above a few times
; increment rdi and rsi somehow
cmp ...
jne top
On modern Intel and AMD hardware, when running in-cache such a loop will usually bottleneck ones stores at one store per cycle. That's kind of wasteful, since that's only an IPC of 2 (one store, one load).
One idea that naturally arises is to combine two DWORD loads into a single QWORD store which is possible since the stores are contiguous. Something like this could work:
top:
mov eax, DWORD [rsi]
mov ebx, DWORD [rdx]
shl rbx, 32
or rax, rbx
mov QWORD [rdi]
Basically do the two loads and use two ALU ops to combine them into a single QWORD which we can store with a single store. Now we're bottlenecked on uops: 5 uops per 2 DWORDs - so 1.25 cycles per QWORD or 0.625 cycles per DWORD.
Already much better than the first option, but I can't help but think there is a better option for this shuffling - for example, we are wasting uop throughput by using plain loads - It feels like we should be able to combine at least some of the ALU ops with the loads with memory source operands, but I was mostly stymied on Intel: shl on memory only has a RMW form, and shlx and rolx don't micro-fuse.
It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.
I'm interested in scalar code, and code for both the base x86-64 instruction set and better versions if possible with useful extensions like BMI.

It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.
If wider loads are ok for correctness and performance (cache-line splits...), we can use shld
top:
mov eax, DWORD [rsi]
mov rbx, QWORD [rdx-4] ; unaligned(?) 64-bit load
shld rax, rbx, 32 ; 1 uop on Intel SnB-family, 0.5c recip throughput
mov QWORD [rdi], rax
MMX punpckldq mm0, [mem] micro-fuses on SnB-family (including Skylake).
top:
movd mm0, DWORD [rsi]
punpckldq mm0, QWORD [rdx] ; 1 micro-fused uop on Intel SnB-family
movq QWORD [rdi], mm0
; required after the loop, making it only worth-while for long-running loops
emms
punpckl instructions unfortunately have a vector-width memory operand, not half-width. This often spoils them for uses where they'd otherwise be perfect (especially the SSE2 version where the 16B memory operand must be aligned). But note that the MMX versions (with only a qword memory operand) don't have an alignment requirement.
You could also use the 128-bit AVX version, but that's even more likely to cross a cache line boundary and be slow. (Skylake does not optimize by loading only the required 8 bytes; a loop with an aligned mov + vpunckldq xmm1, xmm0, [cache_line-8] runs at 1 iter per 2 clocks vs. 1 iter per clock for aligned.) The AVX version is required to fault if the 16-byte load crosses into an unmapped page, so it couldn't just use a narrower load without extra support from the load port. :/
Such a frustrating and useless design decision (presumably made before load ports could zero-extend for free, and not fixed with AVX). At least we have movhps as a replacement for memory-source punpcklqdq, but narrower widths that actually shuffle can't be replaced.
To avoid CL-splits, you could also use a separate movd load and punpckldq, or SSE4.1 pinsrd. With this, there's no reason for MMX.
top:
movd xmm0, DWORD [rsi]
movd xmm1, DWORD [rdx] ; SSE2
punpckldq xmm0, xmm1
; or pinsrd xmm0, DWORD [rdx], 1 ; 2 uops not micro-fused
movq QWORD [rdi], xmm0
Obviously AVX2 vpgatherdd is a possibility, and may perform well on Skylake.

Related

How can I know which registers WinAPI functions use for arguments? [duplicate]

I'm writing a function in x86 assembly that should be callable from c code, and I'm wondering which registers i have to restore before i return to the caller.
Currently I'm only restoring esp and ebp, while the return value is in eax.
Are there any other registers I should be concerned about, or could I leave whatever pleases me in them?
Using Microsoft's 32 bit ABI (cdecl or stdcall or other calling conventions), EAX, EDX and ECX are scratch registers (call clobbered). The other general-purpose integer registers are call-preserved.
The condition codes in EFLAGS are call-clobbered. DF=0 is required on call/return so you can use rep movsb without a cld first. The x87 stack must be empty on call, or on return from a function that doesn't return an FP value. (FP return values go in st0, with the x87 stack empty other than that.) XMM6 and 7 are call-preserved, the rest are call-clobbered scratch registers.
Outside of Windows, most 32-bit calling conventions (including i386 System V on Linux) agree with this choice of EAX, EDX and ECX as call-clobbered, but all the xmm registers are call-clobbered.
For x64 under Windows, you only need to restore RBX, RBP, RDI, RSI, R12, R13, R14, and R15. XMM6..15 are call-preserved. (And you have to reserve 32 bytes of shadow space for use by the callee, whether or not there are any args that don't fit in registers.) xmm6..15 are call-preserved.
See https://en.wikipedia.org/wiki/X86_calling_conventions#Microsoft_x64_calling_convention for more details.
Other OSes use the x86-64 System V ABI (see figure 3.4), where the call-preserved integer registers are RBP, RBX, RSP, R12, R13, R14, and R15. All the XMM/YMM/ZMM registers are call-clobbered.
EFLAGS and the x87 stack are the same as in 32-bit conventions: DF=0, condition flags are clobbered, and x87 stack is empty. (x86-64 conventions return FP values in XMM0, so the x87 stack registers always need to be empty on call/return.)
For links to official calling convention docs, see https://stackoverflow.com/tags/x86/info
32-bit: EBX, ESI, EDI, EBP
64-bit Windows: RBX, RSI, RDI, RBP, R12-R15, XMM6-XMM15
64-bit Linux,BSD,Mac: RBX, RBP, R12-R15
For details see "Software optimization resources" by Agner Fog. Calling conventions are described in this pdf.
if you are unsure about the registers' situation, these instructions below could save the day easily.
PUSHA/PUSHAD -- Push all General Registers
POPA/POPAD -- Pop all General Registers
These instructions push and pop the general purpose and SI/ESI , DI/EDI registers in certain order.
The order for PUSHA/PUSHAD instruction is as follows.
Opcode Instruction Clocks Description
60 PUSHA 18 Push AX, CX, DX, BX, original SP, BP, SI, and DI
60 PUSHAD 18 Push EAX, ECX, EDX, EBX, original ESP, EBP ESI, and EDI
And the order for POPA/POPAD instruction is as follows. (in reverse order)
Opcode Instruction Clocks Description
61 POPA 24 Pop DI, SI, BP, SP, BX, DX, CX, and AX
61 POPAD 24 Pop EDI, ESI, EBP, ESP(***),EBX, EDX, ECX, and EAX
*** The ESP value is discarded instead of loaded into ESP.

Small branches in modern CPUs

How do modern CPUs like Kaby Lake handle small branches? (in code below it is the jump to label LBB1_67). From what I know the branch will not be harmful because the jump is inferior to the 16-bytes block size which is the size of the decoding window.
Or is it possible that due to some macro op fusion the branch will be completely elided?
sbb rdx, qword ptr [rbx - 8]
setb r8b
setl r9b
mov rdi, qword ptr [rbx]
mov rsi, qword ptr [rbx + 8]
vmovdqu xmm0, xmmword ptr [rbx + 16]
cmp cl, 18
je .LBB1_67
mov r9d, r8d
.LBB1_67: # in Loop: Header=BB1_63 Depth=1
vpcmpeqb xmm0, xmm0, xmmword ptr [rbx - 16]
vpmovmskb ecx, xmm0
cmp ecx, 65535
sete cl
cmp rdi, qword ptr [rbx - 32]
sbb rsi, qword ptr [rbx - 24]
setb dl
and dl, cl
or dl, r9b
There are no special cases for short branch distances in any x86 CPUs. Even unconditional jmp to the next instruction (architecturally a nop) needs correct branch prediction to be handled efficiently; if you put enough of those in a row you run out of BTB entries and performance falls off a cliff. Slow jmp-instruction
Fetch/decode is only a minor problem; yes a very short branch within the same cache line will still hit in L1i and probably uop cache. But it's unlikely that the decoders would special-case a predicted-taken forward jump and make use of pre-decode instruction-boundary finding from one block that included both the branch and the target.
When the instruction is being decoded to uops and fed into the front-end, register values aren't available; those are only available in the out-of-order execution back-end.
The major problem is that when the instructions after .LBB1_67: execute, the architectural state is different depending on whether the branch was taken or not.
And so is the micro-architectural state (RAT = Register Allocation Table).
Either:
r9 depends on the sbb/setl result (mov r9d, r8d didn't run)
r9 depends on the sbb/setb result (mov r9d, r8d did run)
Conditional branches are called "control dependencies" in computer-architecture terminology. Branch-prediction + speculative execution avoids turning control dependencies into data dependencies. If the je was predicted not taken, the setl result (the old value of r9) is overwritten by mov and is no longer available anywhere.
There's no way to recover from this after detecting a misprediction in the je (actually should have been taken), especially in the general case. Current x86 CPUs don't try to look for the fall-through path rejoining the taken path or figuring out anything about what it does.
If cl wasn't ready for a long time, so a mispredict wasn't discovered for a long time, many instructions after the or dl, r9b could have executed using the wrong inputs. In the general case the only way to reliably + efficiently recover is to discard all work done on instructions from the "wrong" path. Detecting that vpcmpeqb xmm0, [rbx - 16] for example still runs either way is hard, and not looked for. (Modern Intel, since Sandybridge, has a Branch Order Buffer (BOB) that snapshots the RAT on branches, allowing efficient rollback to the branch miss as soon as execution detects it while still allowing out-of-order execution on earlier instructions to continue during the rollback. Before that a branch miss had to roll back to the retirement state.)
Some CPUs for some non-x86 ISAs (e.g. PowerPC I think) have experimented with turning forward branches that skip exactly 1 instruction into predication (data dependency) instead of speculating past them. e.g. Dynamic Hammock Predication
for Non-predicated Instruction Set Architectures discusses this idea, and even deciding whether to predicate or not on a per-branch basis. If your branch-prediction history says this branch predicts poorly, predicating it instead could be good. (A Hammock branch is one that jumps forward over one or a couple instructions. Detecting the exactly 1 instruction case is trivial on an ISA with fixed-width instruction words, like a RISC, but hard on x86.)
In this case, x86 has a cmovcc instruction, an ALU select operation that produces one of the two inputs depending on a flag condition. cmove r9d, r8d instead of cmp/je would make this immune to branch mispredictions, but at the cost of introducing a data dependency on cl and r8d for instructions that use r9d. Intel CPU don't try to do this for you.
(On Broadwell and later Intel, cmov is only 1 uop, down from 2. cmp/jcc is 1 uop, and the mov itself is also 1 uop, so in the not-taken case cmov is also fewer uops for the front-end. And in the taken case, a taken branch can introduce bubbles in the pipeline even if predicted correctly, depending on how high throughput the code is: Whether queues between stages can absorb it.)
See gcc optimization flag -O3 makes code slower than -O2 for a case where CMOV is slower than a branch because introducing a data dependency is bad.

Do complex addressing modes have extra overhead for loads from memory?

Is there a difference in performance between these mov load instructions? Do the more complex addressing modes have extra overhead (latency or throughput) compared to the simple ones?
# AT&T syntax # Intel syntax:
movq (%rsi), %rax mov rax, [rsi]
movq (%rdi, %rsi), %rax mov rax, [rdi + rsi]
movq (%rdi, %rsi, 4), %rax mov rax, [rdi + rsi*4]
Yes, there is an overhead for "complex addressing" on recent Intel CPUs. The cost is one additional cycle of latency (e.g., 5 cycles for a normal GP load using complex addressing versus 4 cycles with simple addressing).
Simple addressing is anything of the form [reg + offset] where the immediate offset between 0 and 2047 inclusive.
Complex addressing is anything other than simple addressing.
In particular any addressing mode with two registers like your examples [rdi + rsi] or [rdi + rsi*4] are complex addressing and cost an extra cycle.
There is an exceptional case: if the index register1 is zeroed via a zeroing idiom (like xor edi, edi, but not like mov edi, 0) you don't pay the complex addressing penalty.
1 The index register is the one multiplied by 1, 2, 4 or 8, i.e., rsi in [rdi + rsi*4]. In the case neither register shows a multiplier, like [rdi + rsi] the multiplier is 1 and you'll have to check your assembler to see how to specify which is the index and which is the displacement. nasm seems to use the second register as the index.
Depending on which specific CPU; mostly "no, there's no extra overhead". However...
Most CPUs have out-of-order cores, which means they perform instruction in whatever order is fastest and not in the order the instructions are given. For this to work, one instruction (e.g. movq (%rdi, %rsi, 4), %rax) can't happen until things it depended on are finished (e.g. the values in rdi and rsi are known).
For example, these 2 instructions can occur in parallel (because the second instruction doesn't depend on the first):
movq (%rdi), %edi
movq (%rsi), %rax
And these 2 instructions can't occur in parallel (the second instruction has to wait until the first instruction completes):
movq (%rdi), %rdi
movq (%rdi, %rsi), %rax
Also note that the bottleneck for a piece of code may not be execution. If the bottleneck is instruction fetch then larger instructions will be worse; if the bottleneck is instruction decode then more complex instructions can be worse; if the bottleneck is data cache bandwidth then anything that reads/writes to memory can be worse, etc.
Basically; you can't look at individual instructions in isolation and decide if they're better/worse. You have to look at entire sequences of multiple instructions so that you can know about any dependencies on previous instructions (and their latencies); and you have to know what the bottleneck is (e.g. from performance monitoring tools); and if you know all this then you can make an "educated guess" that's only really useful for a small number of CPUs (because different CPUs have different characteristics).

Unexpected slowdown from inserting a nop in a loop, and from reading near a movnti store

I cannot understand why the first code has ~1 cycle per iteration and second has 2 cycle per iteration. I measured with Agner's tool and perf. According to IACA it should take 1 cycle, from my theoretical computations too.
This takes 1 cycle per iteration.
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
add rcx, 1
cmp rcx, n
jle .begin
And this takes 2 cycles per iteration. but why?
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
nop
add rcx, 1
cmp rcx, n
jle .begin
This final version takes ~27 cycles per iteration. But why? After all, there is no dependency chain.
.begin:
movnti [array], eax
mov rbx, [array+16]
add rcx, 1
cmp rcx, n
jle .begin
My CPU is IvyBridge.
movnti is 2 uops, and can't micro-fuse, according to Agner Fog's tables for IvyBridge.
So your first loop is 4 fused-domain uops, and can issue at one iteration per clock.
The nop is a 5th fused-domain uop (even though it doesn't take any execution ports, so it's 0 unfused-domain uops). This means the frontend can only issue the loop at one per 2 clocks.
See also the x86 tag wiki for more links to how CPUs work.
The 3rd loop is probably slow because mov rbx, [array+16] is probably loading from the same cache line that movnti evicts. This happens every time the fill-buffer it's storing into is flushed. (Not every movnti, apparently it can rewrite some bytes in the same fill-buffer.)

memcpy performance differences between 32 and 64 bit processes

We have Core2 machines (Dell T5400) with XP64.
We observe that when running 32-bit processes,
the performance of memcpy is on the order of
1.2GByte/s; however memcpy in a 64-bit process
achieves about 2.2GByte/s (or 2.4GByte/s
with the Intel compiler CRT's memcpy). While the
initial reaction might be to just explain this
away as due to the wider registers available
in 64-bit code, we observe that our own memcpy-like
SSE assembly code (which should be using 128-bit
wide load-stores regardless of 32/64-bitness of
the process) demonstrates similar upper limits on
the copy bandwidth it achieves.
My question is, what's this difference actually
due to ? Do 32-bit processes have to jump through
some extra WOW64 hoops to get at the RAM ? Is it something
to do with TLBs or prefetchers or... what ?
Thanks for any insight.
Also raised on Intel forums.
I think the following can explain it:
To copy data from memory to a register and back to memory, you do
mov eax, [address]
mov [address2], eax
This moves 32 bit (4 byte) from address to address2. The same goes with 64 bit in 64 bit mode
mov rax, [address]
mov [address2], rax
This moves 64 bit, 2 byte, from address to address2. "mov" itself, regardless of whether it is 64 bit or 32 bit has a latency of 0.5 and a throughput of 0.5 according to Intel's specs. Latency is how many clock cycles the instruction takes to travel through the pipeline and throughput is how long the CPU has to wait before accepting the same instruction again. As you can see, it can do two mov's per clock cycle, however, it has to wait half a clock cycle between two mov's, thus it can effectively only do one mov per clock cycle (or am I wrong here and misinterpret the terms? See PDF here for details).
Of course a mov reg, mem can be longer than 0.5 cycles, depending if the data is in 1st or 2nd level cache, or not in cache at all and needs to be grabbed from memory. However, the latency time of above ignores this fact (as the PDF states I linked above), it assumes all data necessary for the mov are present already (otherwise the latency will increase by how long it takes to fetch the data from wherever it is right now - this might be several clock cycles and is completely independent of the command being executed says the PDF on page 482/C-30).
What is interesting, whether the mov is 32 or 64 bit plays no role. That means unless the memory bandwidth becomes the limiting factor, 64 bit mov's are equally fast to 32 bit mov's, and since it takes only half as many mov's to move the same amount of data from A to B when using 64 bit, the throughput can (in theory) be twice as high (the fact that it's not is probably because memory is not unlimited fast).
Okay, now you think when using the larger SSE registers, you should get faster throughput, right? AFAIK the xmm registers are not 256, but 128 bit wide, BTW (reference at Wikipedia). However, have you considered latency and throughput? Either the data you want to move is 128 bit aligned or not. Depending on that, you either move it using
movdqa xmm1, [address]
movdqa [address2], xmm1
or if not aligned
movdqu xmm1, [address]
movdqu [address2], xmm1
Well, movdqa/movdqu has a latency of 1 and a throughput of 1. So the instructions take twice as long to be executed and the waiting time after the instructions is twice as long as a normal mov.
And something else we have not even taken into account is the fact that the CPU actually splits instructions into micro-ops and it can execute these in parallel. Now it starts getting really complicated... even too complicated for me.
Anyway, I know from experience loading data to/from xmm registers is much slower than loading data to/from normal registers, so your idea to speed up transfer by using xmm registers was doomed from the very first second. I'm actually surprised that in the end the SSE memmove is not much slower than the normal one.
I finally got to the bottom of this (and Die in Sente's answer was on the right lines, thanks)
In the below, dst and src are 512 MByte std::vector.
I'm using the Intel 10.1.029 compiler and CRT.
On 64bit both
memcpy(&dst[0],&src[0],dst.size())
and
memcpy(&dst[0],&src[0],N)
where N is previously declared const size_t N=512*(1<<20);
call
__intel_fast_memcpy
the bulk of which consists of:
000000014004ED80 lea rcx,[rcx+40h]
000000014004ED84 lea rdx,[rdx+40h]
000000014004ED88 lea r8,[r8-40h]
000000014004ED8C prefetchnta [rdx+180h]
000000014004ED93 movdqu xmm0,xmmword ptr [rdx-40h]
000000014004ED98 movdqu xmm1,xmmword ptr [rdx-30h]
000000014004ED9D cmp r8,40h
000000014004EDA1 movntdq xmmword ptr [rcx-40h],xmm0
000000014004EDA6 movntdq xmmword ptr [rcx-30h],xmm1
000000014004EDAB movdqu xmm2,xmmword ptr [rdx-20h]
000000014004EDB0 movdqu xmm3,xmmword ptr [rdx-10h]
000000014004EDB5 movntdq xmmword ptr [rcx-20h],xmm2
000000014004EDBA movntdq xmmword ptr [rcx-10h],xmm3
000000014004EDBF jge 000000014004ED80
and runs at ~2200 MByte/s.
But on 32bit
memcpy(&dst[0],&src[0],dst.size())
calls
__intel_fast_memcpy
the bulk of which consists of
004447A0 sub ecx,80h
004447A6 movdqa xmm0,xmmword ptr [esi]
004447AA movdqa xmm1,xmmword ptr [esi+10h]
004447AF movdqa xmmword ptr [edx],xmm0
004447B3 movdqa xmmword ptr [edx+10h],xmm1
004447B8 movdqa xmm2,xmmword ptr [esi+20h]
004447BD movdqa xmm3,xmmword ptr [esi+30h]
004447C2 movdqa xmmword ptr [edx+20h],xmm2
004447C7 movdqa xmmword ptr [edx+30h],xmm3
004447CC movdqa xmm4,xmmword ptr [esi+40h]
004447D1 movdqa xmm5,xmmword ptr [esi+50h]
004447D6 movdqa xmmword ptr [edx+40h],xmm4
004447DB movdqa xmmword ptr [edx+50h],xmm5
004447E0 movdqa xmm6,xmmword ptr [esi+60h]
004447E5 movdqa xmm7,xmmword ptr [esi+70h]
004447EA add esi,80h
004447F0 movdqa xmmword ptr [edx+60h],xmm6
004447F5 movdqa xmmword ptr [edx+70h],xmm7
004447FA add edx,80h
00444800 cmp ecx,80h
00444806 jge 004447A0
and runs at ~1350 MByte/s only.
HOWEVER
memcpy(&dst[0],&src[0],N)
where N is previously declared const size_t N=512*(1<<20); compiles (on 32bit) to a direct call to a
__intel_VEC_memcpy
the bulk of which consists of
0043FF40 movdqa xmm0,xmmword ptr [esi]
0043FF44 movdqa xmm1,xmmword ptr [esi+10h]
0043FF49 movdqa xmm2,xmmword ptr [esi+20h]
0043FF4E movdqa xmm3,xmmword ptr [esi+30h]
0043FF53 movntdq xmmword ptr [edi],xmm0
0043FF57 movntdq xmmword ptr [edi+10h],xmm1
0043FF5C movntdq xmmword ptr [edi+20h],xmm2
0043FF61 movntdq xmmword ptr [edi+30h],xmm3
0043FF66 movdqa xmm4,xmmword ptr [esi+40h]
0043FF6B movdqa xmm5,xmmword ptr [esi+50h]
0043FF70 movdqa xmm6,xmmword ptr [esi+60h]
0043FF75 movdqa xmm7,xmmword ptr [esi+70h]
0043FF7A movntdq xmmword ptr [edi+40h],xmm4
0043FF7F movntdq xmmword ptr [edi+50h],xmm5
0043FF84 movntdq xmmword ptr [edi+60h],xmm6
0043FF89 movntdq xmmword ptr [edi+70h],xmm7
0043FF8E lea esi,[esi+80h]
0043FF94 lea edi,[edi+80h]
0043FF9A dec ecx
0043FF9B jne ___intel_VEC_memcpy+244h (43FF40h)
and runs at ~2100MByte/s (and proving 32bit isn't somehow bandwidth limited).
I withdraw my claim that my own memcpy-like SSE code suffers from a
similar ~1300 MByte/limit in 32bit builds; I now don't have any problems
getting >2GByte/s on 32 or 64bit; the trick (as the above results hint)
is to use non-temporal ("streaming") stores (e.g _mm_stream_ps intrinsic).
It seems a bit strange that the 32bit "dst.size()" memcpy doesn't eventually
call the faster "movnt" version (if you step into memcpy there is the most
incredible amount of CPUID checking and heuristic logic e.g comparing number
of bytes to be copied with cache size etc before it goes anywhere near your
actual data) but at least I understand the observed behaviour now (and it's
not SysWow64 or H/W related).
Of course, you really need to look at the actual machine instructions that are being executed inside the innermost loop of the memcpy, by stepping into the machine code with a debugger. Anything else is just speculation.
My quess is that it probably doesn't have anything to do with 32-bit versus 64-bit per se; my guess is that the faster library routine was written using SSE non-temporal stores.
If the inner loop contains any variation of conventional load-store instructions,
then the destination memory must be read into the machine's cache, modified, and written back out. Since that read is totally unnecessary -- the bits being read are overwritten immediately -- you can save half the memory bandwidth by using the "non-temporal" write instructions, which bypass the caches. That way, the destination memory is just written making a one-way trip to the memory instead of a round trip.
I don't know the Intel compiler's CRT library, so this is just a guess. There's no particular reason why the 32-bit libCRT can't do the same thing, but the speedup you quote is in the ballpark of what I would expect just by converting the movdqa instructions to movnt...
Since memcpy is not doing any calculations, it's always bound by how fast you can read and write memory.
My off-the-cuff guess is that the 64 bit processes are using the processor's native 64-bit memory size, which optimizes the use of the memory bus.
Thanks for the positive feedback! I think I can partly explain what's going here.
Using the non-temporal stores for memcpy is definitely the fasted if you're only timing the memcpy call.
On the other hand, if you're benchmarking an application, the movdqa stores have the benefit that they leave the destination memory in cache. Or at least the part of it that fits into cache.
So if you're designing a runtime library and if you can assume that the application that called memcpy is going to use the destination buffer immediately after the memcpy call, then you'll want to provide the movdqa version. This effectively optimizes out the trip from memory back into the cpu that would follow the movntdq version, and all of the instructions following the call will run faster.
But on the other hand, if the destination buffer is large compared to the processor's cache, that optimization doesn't work and the movntdq version would give you faster application benchmarks.
So the idea memcpy would have multiple versions under the hood. When the destination buffer is small compared to the processor's cache, use movdqa, otherwise, then the destination buffer is large compared to the processor's cache, use movntdq. It sounds like this is what's happening in the 32-bit library.
Of course, none of this has anything to do with the differences between 32-bit and 64-bit.
My conjecture is that the 64-bit library just isn't as mature. The developers just haven't gotten around to providing both routines in that version of library yet.
I don't have a reference in front of me, so I'm not absolutely positive on the timings/instructions, but I can still give the theory. If you're doing a memory move under 32-bit mode, you'll do something like a "rep movsd" which moves a single 32-bit value every clock cycle. Under 64-bit mode, you can do a "rep movsq" which does a single 64-bit move every clock cycle. That instruction is not available to 32-bit code, so you'd be doing 2 x rep movsd (at 1 cycle a piece) for half the execution speed.
VERY much simplified, ignoring all the memory bandwidth/alignment issues, etc, but this is where it all begins...
Here's an example of a memcpy routine geared specifically for 64 bit architecture.
void uint8copy(void *dest, void *src, size_t n){
uint64_t * ss = (uint64_t)src;
uint64_t * dd = (uint64_t)dest;
n = n * sizeof(uint8_t)/sizeof(uint64_t);
while(n--)
*dd++ = *ss++;
}//end uint8copy()
The full article is here:
http://www.godlikemouse.com/2008/03/04/optimizing-memcpy-routines/

Resources