Related
I have a large number of 64 bit values in memory. Unfortunately they might not be aligned to 64 bit addresses. My goal is to change the endianess of all those values, i.e. swapping/reversing their bytes.
I know about the bswap instruction which swaps the bytes of a 32 or 64 bit register. But as it requires a register argument, I cannot pass it my memory address. Of course I can first load the memory into register, then swap, then write it back:
mov rax, qword [rsi]
bswap rax
mov qword [rsi], rax
But is that even correct, given that the address might be unaligned?
Another possibility is to do the swaps manually:
mov al, byte [rsi + 0]
mov bl, byte [rsi + 7]
mov byte [rsi + 0], bl
mov byte [rsi + 7], al
mov al, byte [rsi + 1]
mov bl, byte [rsi + 6]
mov byte [rsi + 1], bl
mov byte [rsi + 6], al
mov al, byte [rsi + 2]
mov bl, byte [rsi + 5]
mov byte [rsi + 2], bl
mov byte [rsi + 5], al
mov al, byte [rsi + 3]
mov bl, byte [rsi + 4]
mov byte [rsi + 3], bl
mov byte [rsi + 4], al
That's obviously a lot more instructions. But is it slower, too?
But all in all I'm still pretty inexperienced in x86-64, so I wonder: What is the fastest way to byte swap a 64 bit value in memory? Is one of the two options I described optimal? Or is there a completely different approach that is even faster?
PS: My real situation is a bit more complicated. I do have a large byte array, but it contains differently sized integers, all densely packed. Some other array tells me what size of integer to expect next. So this "description" could say "one 32 bit int, two 64 bit ints, one 16 bit int, then one 64 bit int again". I am just mentioning this here to tell you that (as far as I can tell), using SIMD instructions is not possible as I actually have to inspect the size of each integer before reading.
What is the fastest way to byte swap a 64 bit value in memory?
The mov/bswap/mov version and the movbe/mov are about the same on most Intel processors. Based on the µop count, it seems movbe decodes to mov + bswap, except on Atom. For Ryzen, movbe may be better. Manually swapping around bytes is much slower, except in certain edge cases where a large load/store is very slow, such as when it crosses a 4K boundary pre-Skylake.
pshufb is a reasonable option even to replace a single bswap, though that wastes half of the work the shuffle could do.
PS: My real situation is a bit more complicated. I do have a large byte array, but it contains differently sized integers, all densely packed.
In this general case, with sizes dynamically taken from an other data stream, a new big issue is branching on the size. Even in scalar code that can be avoided, by byte-reversing a 64bit block and shifting it right by 8 - size, then merging it with the un-reversed bytes, and advancing by size. That could be worked out, but it's a waste of time to try that, the SIMD version will be better.
A SIMD version could use pshufb and a table of a shuffle-masks indexed by a "size pattern", for example an 8-bit integer where every 2 bits indicates the size of an element. pshufb then reverses the elements that are wholly contained in the 16-byte window that it's looking at, and leave the rest alone (those unchanged bytes at the tail will be written back too, but that's OK). Then we advance by the number of bytes that was actually processed.
For maximum convenience, those size patterns (as well as corresponding byte-counts) should be supplied in such a way that the actual Endianness Flipper itself can consume exactly one of them per iteration, without anything heavy such as extracting a byte-unaligned sequence of 8 bits and determining dynamically how many bits to consume. That's also possible, but at a significantly higher cost. About 4x as slow in my test, limited by the loop-carried dependency through "extract 8 bits at current bit-index" through "find bit-index increment by table lookup" and then into the next iteration: about 16 cycles per iteration, though still in 60% of the time that equivalent scalar code took.
Using an unpacked (1 byte per size) representation would make the extraction easier (just an unaligned dword load), but requires packing the result to index the shuffle mask table with, for example with pext. That would be reasonable for Intel CPUs, but pext is extremely slow on AMD Ryzen. A alternative that is fine for both AMD and Intel would be to do the unaligned dword read, then extract the 8 interesting bits using a multiply/shift trick:
mov eax, [rdi]
imul eax, eax, 0x01041040
shr eax, 24
An extra trick that should be used, in the Convenient Input case at least (otherwise we're stuck with a 5x worse performance anyway and this trick won't be relevant), is reading the data for the next iteration before storing the result of the current iteration. Without that trick, the store will often "step on the toes" of the load of the next iteration (because we advance less than 16 bytes, so the load reads some of the bytes that the store left unchanged but had to write anyway), forcing a memory dependency between them which hold up the next iteration. The performance difference is large, about 3x.
Then the Endianness Flipper could look something like this:
void flipEndiannessSSSE3(char* buffer, size_t totalLength, uint8_t* sizePatterns, uint32_t* lengths, __m128i* masks)
{
size_t i = 0;
size_t j = 0;
__m128i data = _mm_loadu_si128((__m128i*)buffer);
while (i < totalLength) {
int sizepattern = sizePatterns[j];
__m128i permuted = _mm_shuffle_epi8(data, masks[sizepattern]);
size_t next_i = i + lengths[j++];
data = _mm_loadu_si128((__m128i*)&buffer[next_i]);
_mm_storeu_si128((__m128i*)&buffer[i], permuted);
i = next_i;
}
}
For example, Clang 10 with -O3 -march=haswell turns that into
test rsi, rsi
je .LBB0_3
vmovdqu xmm0, xmmword ptr [rdi]
xor r9d, r9d
xor r10d, r10d
.LBB0_2: # =>This Inner Loop Header: Depth=1
movzx eax, byte ptr [rdx + r10]
shl rax, 4
vpshufb xmm1, xmm0, xmmword ptr [r8 + rax]
mov eax, dword ptr [rcx + 4*r10]
inc r10
add rax, r9
vmovdqu xmm0, xmmword ptr [rdi + rax]
vmovdqu xmmword ptr [rdi + r9], xmm1
mov r9, rax
cmp rax, rsi
jb .LBB0_2
.LBB0_3:
ret
LLVM-MCA thinks that takes about 3.3 cycles per iteration, on my PC (4770K, tested with a uniform mix of 1, 2, 4 and 8 byte sized elements) it was a little slower, closer to 3.7 cycles per iteration, but that's still good: that's just under 1.2 cycles per element.
Let's say you want to find the first occurrence of a value1 in a sorted array. For small arrays (where things like binary search don't pay off), you can achieve this by simply counting the number of values less than that value: the result is the index you are after.
In x86 you can use adc (add with carry) for an efficient branch-free2 implementation of that approach (with the start pointer in rdi length in rsi and the value to search for in edx):
xor eax, eax
lea rdi, [rdi + rsi*4] ; pointer to end of array = base + length
neg rsi ; we loop from -length to zero
loop:
cmp [rdi + 4 * rsi], edx
adc rax, 0 ; only a single uop on Sandybridge-family even before BDW
inc rsi
jnz loop
The answer ends up in rax. If you unroll that (or if you have a fixed, known input size), only the cmp; adc pair of instructions get repeated, so the overhead approaches 2 simple instructions per comparison (and the sometimes fused load). Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?
However, this only works for unsigned comparisons, where the carry flag holds the result of the comparison. Is there any equivalently efficient sequence for counting signed comparisons? Unfortunately, there doesn't seem to be an "add 1 if less than" instruction: adc, sbb and the carry flag are special in that respect.
I am interested in the general case where the elements have no specific order, and also in this case where the array is sorted in the case the sortedness assumption leads to a simpler or faster implementation.
1 Or, if the value doesn't exist, the first greater value. I.e., this is the so called "lower bound" search.
2 Branch free approaches necessarily do the same amount of work each time - in this case examining the entire array, so this approach only make sense when the arrays are small and so the cost of a branch misprediction is large relative to the total search time.
PCMPGT + PADDD or PSUBD is probably a really good idea for most CPUs, even for small sizes, maybe with a simple scalar cleanup. Or even just purely scalar, using movd loads, see below.
For scalar integer, avoiding XMM regs, use SETCC to create a 0/1 integer from any flag condition you want. xor-zero a tmp register (potentially outside the loop) and SETCC into the low 8 of that, if you want to use 32 or 64-bit ADD instructions instead of only 8-bit.
cmp/adc reg,0 is basically a peephole optimization for the below / carry-set condition. AFAIK, there is nothing as efficient for signed-compare conditions. At best 3 uops for cmp/setcc/add, vs. 2 for cmp/adc. So unrolling to hide loop overhead is even more important.
See the bottom section of What is the best way to set a register to zero in x86 assembly: xor, mov or and? for more details about how to zero-extend SETCC r/m8 efficiently but without causing partial-register stalls. And see Why doesn't GCC use partial registers? for a reminder of partial-register behaviour across uarches.
Yes, CF is special for a lot of things. It's the only condition flag that has set/clear/complement (stc/clc/cmc) instructions1. There's a reason that bt/bts/etc. instructions set CF, and that shift instructions shift into it. And yes, ADC/SBB can add/sub it directly into another register, unlike any other flag.
OF can be read similarly with ADOX (Intel since Broadwell, AMD since Ryzen), but that still doesn't help us because it's strictly OF, not the SF!=OF signed-less-than condition.
This is typical for most ISAs, not just x86. (AVR and some others can set/clear any condition flag because they have an instruction that takes an immediate bit-position in the status register. But they still only have ADC/SBB for directly adding the carry flag to an integer register.)
ARM 32-bit can do a predicated addlt r0, r0, #1 using any condition-code, including signed less-than, instead of an add-with-carry with immediate 0. ARM does have ADC-immediate which you could use for the C flag here, but not in Thumb mode (where it would be useful to avoid an IT instruction to predicate an ADD), so you'd need a zeroed register.
AArch64 can do a few predicated things, including increment with cinc with arbitrary condition predicates.
But x86 can't. We only have cmovcc and setcc to turn conditions other than CF==1 into integers. (Or with ADOX, for OF==1.)
Footnote 1: Some status flags in EFLAGS like interrupts IF (sti/cli), direction DF (std/cld), and alignment-check (stac/clac) have set/clear instructions, but not the condition flags ZF/SF/OF/PF or the BCD-carry AF.
cmp [rdi + 4 * rsi], edx will un-laminate even on Haswell/Skylake because of the indexed addressing mode, and it it doesn't have a read/write destination register (so it's not like add reg, [mem].)
If tuning only for Sandybridge-family, you might as well just increment a pointer and decrement the size counter. Although this does save back-end (unfused-domain) uops for RS-size effects.
In practice you'd want to unroll with a pointer increment.
You mentioned sizes from 0 to 32, so we need to skip the loop if RSI = 0. The code in your question is just a do{}while which doesn't do that. NEG sets flags according to the result, so we can JZ on that. You'd hope that it could macro-fuse because NEG is exactly like SUB from 0, but according to Agner Fog it doesn't on SnB/IvB. So that costs us another uop in the startup if you really do need to handle size=0.
Using integer registers
The standard way to implement integer += (a < b) or any other flag condition is what compilers do (Godbolt):
xor edx,edx ; can be hoisted out of a short-running loop, but compilers never do that
; but an interrupt-handler will destroy the rdx=dl status
cmp/test/whatever ; flag-setting code here
setcc dl ; zero-extended to a full register because of earlier xor-zeroing
add eax, edx
Sometimes compilers (especially gcc) will use setcc dl / movzx edx,dl, which puts the MOVZX on the critical path. This is bad for latency, and mov-elimination doesn't work on Intel CPUs when they use (part of) the same register for both operands.
For small arrays, if you don't mind having only an 8-bit counter, you could just use 8-bit add so you don't have to worry about zero-extension inside the loop.
; slower than cmp/adc: 5 uops per iteration so you'll definitely want to unroll.
; requires size<256 or the count will wrap
; use the add eax,edx version if you need to support larger size
count_signed_lt: ; (int *arr, size_t size, int key)
xor eax, eax
lea rdi, [rdi + rsi*4]
neg rsi ; we loop from -length to zero
jz .return ; if(-size == 0) return 0;
; xor edx, edx ; tmp destination for SETCC
.loop:
cmp [rdi + 4 * rsi], edx
setl dl ; false dependency on old RDX on CPUs other than P6-family
add al, dl
; add eax, edx ; boolean condition zero-extended into RDX if it was xor-zeroed
inc rsi
jnz .loop
.return:
ret
Alternatively using CMOV, making the loop-carried dep chain 2 cycles long (or 3 cycles on Intel before Broadwell, where CMOV is 2 uops):
;; 3 uops without any partial-register shenanigans, (or 4 because of unlamination)
;; but creates a 2 cycle loop-carried dep chain
cmp [rdi + 4 * rsi], edx
lea ecx, [rax + 1] ; tmp = count+1
cmovl eax, ecx ; count = arr[i]<key ? count+1 : count
So at best (with loop unrolling and a pointer-increment allowing cmp to micro-fuse) this takes 3 uops per element instead of 2.
SETCC is a single uop, so this is 5 fused-domain uops inside the loop. That's much worse on Sandybridge/IvyBridge, and still runs at worse than 1 per clock on later SnB-family. (Some ancient CPUs had slow setcc, like Pentium 4, but it's efficient on everything we still care about.)
When unrolling, if you want this to run faster than 1 cmp per clock, you have two choices: use separate registers for each setcc destination, creating multiple dep chains for the false dependencies, or use one xor edx,edx inside the loop to break the loop-carried false dependency into multiple short dep chains that only couple the setcc results of nearby loads (probably coming from the same cache line). You'll also need multiple accumulators because add latency is 1c.
Obviously you'll need to use a pointer-increment so cmp [rdi], edx can micro-fuse with a non-indexed addressing mode, otherwise the cmp/setcc/add is 4 uops total, and that's the pipeline width on Intel CPUs.
There's no partial-register stall from the caller reading EAX after writing AL, even on P6-family, because we xor-zeroed it first. Sandybridge won't rename it separately from RAX because add al,dl is a read-modify-write, and IvB and later never rename AL separately from RAX (only AH/BH/CH/DH). CPUs other than P6 / SnB-family don't do partial-register renaming at all, only partial flags.
The same applies for the version that reads EDX inside the loop. But an interrupt-handler saving/restoring RDX with push/pop would destroy its xor-zeroed status, leading to partial-register stalls every iteration on P6-family. This is catastrophically bad, so that's one reason compilers never hoist the xor-zeroing. They usually don't know if a loop will be long-running or not, and won't take the risk. By hand, you'd probably want to unroll and xor-zero once per unrolled loop body, rather than once per cmp/setcc.
You can use SSE2 or MMX for scalar stuff
Both are baseline on x86-64. Since you're not gaining anything (on SnB-family) from folding the load into the cmp, you might as well use a scalar movd load into an XMM register. MMX has the advantage of smaller code-size, but requires EMMS when you're done. It also allows unaligned memory operands, so it's potentially interesting for simpler auto-vectorization.
Until AVX512, we only have comparison for greater-than available, so it would take an extra movdqa xmm,xmm instruction to do key > arr[i] without destroying key, instead of arr[i] > key. (This is what gcc and clang do when auto-vectorizing).
AVX would be nice, for vpcmpgtd xmm0, xmm1, [rdi] to do key > arr[i], like gcc and clang use with AVX. But that's a 128-bit load, and we want to keep it simple and scalar.
We can decrement key and use (arr[i]<key) = (arr[i] <= key-1) = !(arr[i] > key-1). We can count elements where the array is greater-than key-1, and subtract that from the size. So we can make do with just SSE2 without costing extra instructions.
If key was already the most-negative number (so key-1 would wrap), then no array elements can be less than it. This does introduce a branch before the loop if that case is actually possible.
; signed version of the function in your question
; using the low element of XMM vectors
count_signed_lt: ; (int *arr, size_t size, int key)
; actually only works for size < 2^32
dec edx ; key-1
jo .key_eq_int_min
movd xmm2, edx ; not broadcast, we only use the low element
movd xmm1, esi ; counter = size, decrement toward zero on elements >= key
;; pxor xmm1, xmm1 ; counter
;; mov eax, esi ; save original size for a later SUB
lea rdi, [rdi + rsi*4]
neg rsi ; we loop from -length to zero
.loop:
movd xmm0, [rdi + 4 * rsi]
pcmpgtd xmm0, xmm2 ; xmm0 = arr[i] gt key-1 = arr[i] >= key = not less-than
paddd xmm1, xmm0 ; counter += 0 or -1
;; psubd xmm1, xmm0 ; -0 or -(-1) to count upward
inc rsi
jnz .loop
movd eax, xmm1 ; size - count(elements > key-1)
ret
.key_eq_int_min:
xor eax, eax ; no array elements are less than the most-negative number
ret
This should be the same speed as your loop on Intel SnB-family CPUs, plus a tiny bit of extra overhead outside. It's 4 fuse-domain uops, so it can issue at 1 per clock. A movd load uses a regular load port, and there are at least 2 vector ALU ports that can run PCMPGTD and PADDD.
Oh, but on IvB/SnB the macro-fused inc/jnz requires port 5, while PCMPGTD / PADDD both only run on p1/p5, so port 5 throughput will be a bottleneck. On HSW and later the branch runs on port 6, so we're fine for back-end throughput.
It's worse on AMD CPUs where a memory-operand cmp can use an indexed addressing mode without a penalty. (And on Intel Silvermont, and Core 2 / Nehalem, where memory-source cmp can be a single uop with an indexed addressing mode.)
And on Bulldozer-family, a pair of integer cores share a SIMD unit, so sticking to integer registers could be an even bigger advantage. That's also why int<->XMM movd/movq has higher latency, again hurting this version.
Other tricks:
Clang for PowerPC64 (included in the Godbolt link) shows us a neat trick: zero or sign-extend to 64-bit, subtract, and then grab the MSB of the result as a 0/1 integer that you add to counter. PowerPC has excellent bitfield instructions, including rldicl. In this case, it's being used to rotate left by 1, and then zero all bits above that, i.e. extracting the MSB to the bottom of another register. (Note that PowerPC documentation numbers bits with MSB=0, LSB=63 or 31.)
If you don't disable auto-vectorization, it uses Altivec with a vcmpgtsw / vsubuwm loop, which I assume does what you'd expect from the names.
# PowerPC64 clang 9-trunk -O3 -fno-tree-vectorize -fno-unroll-loops -mcpu=power9
# signed int version
# I've added "r" to register names, leaving immediates alone, because clang doesn't have `-mregnames`
... setup
.LBB0_2: # do {
lwzu r5, 4(r6) # zero-extending load and update the address register with the effective-address. i.e. pre-increment
extsw r5, r5 # sign-extend word (to doubleword)
sub r5, r5, r4 # 64-bit subtract
rldicl r5, r5, 1, 63 # rotate-left doubleword immediate then clear left
add r3, r3, r5 # retval += MSB of (int64_t)arr[i] - key
bdnz .LBB0_2 # } while(--loop_count);
I think clang could have avoided the extsw inside the loop if it had used an arithmetic (sign-extending) load. The only lwa that updates the address register (saving an increment) seems to be the indexed form lwaux RT, RA, RB, but if clang put 4 in another register it could use it. (There doesn't seem to be a lwau instruction.) Maybe lwaux is slow or maybe it's a missed optimization. I used -mcpu=power9 so even though that instruction is POWER-only, it should be available.
This trick could sort of help for x86, at least for a rolled-up loop.
It takes 4 uops this way per compare, not counting loop overhead. Despite x86's pretty bad bitfield extract capabilities, all we actually need is a logical right-shift to isolate the MSB.
count_signed_lt: ; (int *arr, size_t size, int key)
xor eax, eax
movsxd rdx, edx
lea rdi, [rdi + rsi*4]
neg rsi ; we loop from -length to zero
.loop:
movsxd rcx, dword [rdi + 4 * rsi] ; 1 uop, pure load
sub rcx, rdx ; (int64_t)arr[i] - key
shr rcx, 63 ; extract MSB
add eax, ecx ; count += MSB of (int64_t)arr[i] - key
inc rsi
jnz .loop
ret
This doesn't have any false dependencies, but neither does 4-uop xor-zero / cmp / setl / add. The only advantage here is that this is 4 uops even with an indexed addressing mode. Some AMD CPUs may run MOVSXD through an ALU as well as a load port, but Ryzen has the same latency as for it as for regular loads.
If you have fewer than 64 iterations, you could do something like this if only throughput matters, not latency. (But you can probably still do better with setl)
.loop
movsxd rcx, dword [rdi + 4 * rsi] ; 1 uop, pure load
sub rcx, rdx ; (int64_t)arr[i] - key
shld rax, rcx, 1 ; 3 cycle latency
inc rsi / jnz .loop
popcnt rax, rax ; turn the bitmap of compare results into an integer
But the 3-cycle latency of shld makes this a showstopper for most uses, even though it's only a single uop on SnB-family. The rax->rax dependency is loop-carried.
There's a trick to convert a signed comparison to an unsigned comparison and vice versa by toggling the top bit
bool signedLessThan(int a, int b)
{
return ((unsigned)a ^ INT_MIN) < b; // or a + 0x80000000U
}
It works because the ranges in 2's complement are still linear, just with a swapped signed and unsigned space. So the simplest way may be XORing before comparison
xor eax, eax
xor edx, 0x80000000 ; adjusting the search value
lea rdi, [rdi + rsi*4] ; pointer to end of array = base + length
neg rsi ; we loop from -length to zero
loop:
mov ecx, [rdi + 4 * rsi]
xor ecx, 0x80000000
cmp ecx, edx
adc rax, 0 ; only a single uop on Sandybridge-family even before BDW
inc rsi
jnz loop
If you can modify the array then just do the conversion before checking
In ADX there's ADOX that uses carry from OF. Unfortunately signed comparison also needs SF instead of only OF, thus you can't use it like this
xor ecx, ecx
loop:
cmp [rdi + 4 * rsi], edx
adox rax, rcx ; rcx=0; ADOX is not available with an immediate operand
and must do some more bit manipulations to correct the result
In the case the array is guaranteed to be sorted, one could use cmovl with an "immediate" value representing the correct value to add. There are no immediates for cmovl, so you'll have to load them into registers beforehand.
This technique makes sense when unrolled, for example:
; load constants
mov r11, 1
mov r12, 2
mov r13, 3
mov r14, 4
loop:
xor ecx, ecx
cmp [rdi + 0], edx
cmovl rcx, r11
cmp [rdi + 4], edx
cmovl rcx, r12
cmp [rdi + 8], edx
cmovl rcx, r13
cmp [rdi + 12], edx
cmovl rcx, r14
add rax, rcx
; update rdi, test loop condition, etc
jcc loop
You have 2 uops per comparison, plus overhead. There is a 4-cycle (BDW and later) dependency chain between the cmovl instructions, but it is not carried.
One disadvantage is that you have to set up the 1,2,3,4 constants outside of the loop. It also doesn't work as well if not unrolled (you need to ammortize the add rax, rcx accumulation).
Assuming the array is sorted, you could make separate code branches for positive and negative needles. You will need a branch instruction at the very start, but after that, you can use the same branch-free implementation you would use for unsigned numbers. I hope that is acceptable.
needle >= 0:
go through the array in ascending order
begin by counting every negative array element
proceed with the positive numbers just like you would in the unsigned scenario
needle < 0:
go through the array in descending order
begin by skipping every positive array element
proceed with the negative numbers just like you would in the unsigned scenario
Unfortunately, in this approach you cannot unroll your loops.
An alternative would be to go through each array twice; once with the needle, and again to find the number of positive or negative elements (using a 'needle' that matches the minimum signed integer).
(unsigned) count the elements < needle
(unsigned) count the elements >= 0x80000000
add the results
if needle < 0, subtract the array length from the result
There's probably a lot to be optimized to my code below. I'm quite rusty at this.
; NOTE: no need to initialize eax here!
lea rdi, [rdi + rsi*4] ; pointer to end of array = base + length
neg rsi ; we loop from -length to zero
mov ebx, 80000000h ; minimum signed integer (need this in the loop too)
cmp edx, ebx ; set carry if needle negative
sbb eax, eax ; -1 if needle negative, otherwise zero
and eax, esi ; -length if needle negative, otherwise zero
loop:
cmp [rdi + 4 * rsi], edx
adc rax, 0 ; +1 if element < needle
cmp [rdi + 4 * rsi], ebx
cmc
adc rax, 0 ; +1 if element >= 0x80000000
inc rsi
jnz loop
I have a highly optimized function called repeatedly in the inner loop, written with SSE2/AVX2 accelerations. After some refinement, now I am approaching the theoretical best performance (based on instruction latency and throughput). However, the performance is not quite portable. The problem lies in that there are more than 16 __m128i/__256i variables. Of course, only 16 of them can be allocated in registers, and the remaining on stack. The function is more or less like the following,
void eval(size_t n, __m128i *rk /* other inputs */)
{
__m128i xmmk0 = rk[0];
// ...
__m128i xmmk6 = rk[6];
__m128i xmmk;
__m128i xmmk[Rounds - 6];
// copy rk[7] to r[Rounds] to xmmk
while (n >= 8) {
n -= 8;
__m128i xmm0 = /* initialize state xmm0 */
// do the same for xmm1 - xmm7
// round 0
xmm0 = /* a few instructions involving xmm0 and xmmk0 */;
// do the same for xmm1 - xmm7
// do the same for round 1 to 6, using xmmk1, ..., xmmk6
// round 7, copy xmmk[0] to a temporary __m128i variable
xmm0 = /* a few instructions involving xmm0 and xmmk[0] */;
// do the same for xmm1 - xmm7
// do the same for round 7 to Rounds, using xmmk[1], xmmk[Rounds - 7]
}
}
There are more than 16 __m128i variables involved. The best performance I can get is when the compiler allocate xmm0 to xmm7, the states, xmmk0 to xmmk6, the first 7 round constants, in registers, and use the remaining register as a temporary that load round constants for the remaining rounds. When written in a way similar to the above, GCC/clang are doing exactly that, but Intel ICPC allocate some of the xmm0 to xmm7 variables on stack and introduce some unnecessary memory movement. If instead, I write in a way similar to the following
__m128i xmmk[Rounds + 1]; // copy from input rk
// let compiler to figure out which of them are allocated on stack and which in registers,
Then GCC/ICPC are doing a good job of register allocation while clang fall into a situation similar to that of ICPC in previous case.
Of course, declaring a variable of type __m128i does not make it an register, nor does declaring it in a stack array prevent the compiler to allocate register for it.
I was able to write an ASM implementation which does exactly what I want. However, the actual function involves some compile time constants specified as template policies. So it is more desirable to implement them in C++ with intrinsics.
What I want to know is if there's a way to influence how these compilers perform register allocation. Usually the difference in performance is merely a few cycles, because of fast L1 cache. However, when highly optimized, when the inner loop takes only a hundred cycles, a dozen cycle difference due to unnecessary memory movement translate to a 20% overall performance difference. As far as I know, there's no way to make compilers doing things exactly as in hand written assembly. But it would be a lot helpful if I can at least give them some hints. For example, I know that latency of loading the round constants can be hidden by other instructions. And thus it is more desirable to allocate them on stack instead of the state variable xmm0 to xmm7.
gcc 5.3 with -O3 -mavx -mtune=haswell for x86-64 makes surprisingly bulky code to handle potentially-misaligned inputs for code like:
// convenient simple example of compiler input
// I'm not actually interested in this for any real program
void floatmul(float *a) {
for (int i=0; i<1024 ; i++)
a[i] *= 2;
}
clang uses unaligned load/store instructions, but gcc does a scalar intro/outro and an aligned vector loop: It peels off the first up-to-7 unaligned iterations, fully unrolling that into a sequence of
vmovss xmm0, DWORD PTR [rdi]
vaddss xmm0, xmm0, xmm0 ; multiply by two
vmovss DWORD PTR [rdi], xmm0
cmp eax, 1
je .L13
vmovss xmm0, DWORD PTR [rdi+4]
vaddss xmm0, xmm0, xmm0
vmovss DWORD PTR [rdi+4], xmm0
cmp eax, 2
je .L14
...
This seems pretty terrible, esp. for CPUs with a uop cache. I reported a gcc bug about this, with a suggestion for smaller/better code that gcc could use when peeling unaligned iterations. It's probably still not optimal, though.
This question is about what actually would be optimal with AVX. I'm asking about general-case solutions that gcc and other compilers could/should use. (I didn't find any gcc mailing list hits with discussion about this, but didn't spend long looking.)
There will probably be multiple answers, since what's optimal for -mtune=haswell will probably be different from what's optimal for -mtune=bdver3 (steamroller). And then there's the question of what's optimal when allowing instruction set extensions (e.g. AVX2 for 256b integer stuff, BMI1 for turning a count into a bitmask in fewer instructions).
I'm aware of Agner Fog's Optimizing Assembly guide, Section 13.5 Accessing unaligned data and partial vectors. He suggests either using unaligned accesses, doing an overlapping write at the start and/or end, or shuffling data from aligned accesses (but PALIGNR only takes an imm8 count, so 2x pshufb / por). He discounts VMASKMOVPS as not useful, probably because of how badly it performs on AMD. I suspect that if tuning for Intel, it's worth considering. It's not obvious how to generate the correct mask, hence the question title.
It might turn out that it's better to simply use unaligned accesses, like clang does. For short buffers, the overhead of aligning might kill any benefit from avoiding cacheline splits for the main loop. For big buffers, main memory or L3 as the bottleneck may hide the penalty for cacheline splits. If anyone has experimental data to back this up for any real code they've tuned, that's useful information too.
VMASKMOVPS does look usable for Intel targets. (The SSE version is horrible, with an implicit non-temporal hint, but the AVX version doesn't have that. There's even a new intrinsic to make sure you don't get the SSE version for 128b operands: _mm128_maskstore_ps) The AVX version is only a little bit slow on Haswell:
3 uops / 4c latency / 1-per-2c throughput as a load.
4 uops / 14c latency / 1-per-2c throughput as a 256b store.
4 uops / 13c latency / 1-per-1c throughput as a 128b store.
The store form is still unusably slow on AMD CPUs, both Jaguar (1 per 22c tput) and Bulldozer-family: 1 per 16c on Steamroller (similar on Bulldozer), or 1 per ~180c throughput on Piledriver.
But if we do want to use VMASKMOVPS, we need a vector with the high bit set in each element that should actually be loaded/stored. PALIGNR and PSRLDQ (for use on a vector of all-ones) only take compile-time-constant counts.
Notice that the other bits don't matter: it doesn't have to be all-ones, so scattering some set bits out to the high bits of the elements is a possibility.
You could turn an integer bitmask like (1 << (vector1.size() & 3)) - 1 into a vector mask on the fly with is there an inverse instruction to the movemask instruction in intel avx2? Or:
Load a mask for VMOVMASKPS from a window into a table. AVX2, or AVX1 with a few extra instructions or a larger table.
The mask can also be used for ANDPS in registers in a reduction that needs to count each element exactly once. As Stephen Canon points out in comments on the OP, pipelining loads can allow overlapping unaligned stores to work even for a rewrite-in-place function like the example I picked, so VMASKMOVPS is NOT the best choice here.
This should be good on Intel CPUs, esp. Haswell and later for AVX2.
Agner Fog's method for getting a pshufb mask actually provided an idea that is very efficient: do an unaligned load taking a window of data from a table. Instead of a giant table of masks, use an index as a way of doing a byte-shift on data in memory.
Masks in LSB-first byte order (as they're stored in memory), not the usual notation for {X3,X2,X1,X0} elements in a vector. As written, they line up with an aligned window including the start/end of the input array in memory.
start misalign count = 0: mask = all-ones (Aligned case)
start misalign count = 1: mask = {0,-1,-1,-1,-1,-1,-1,-1} (skip one in the first 32B)
...
start misalign count = 7: mask = {0, 0, 0, 0, 0, 0, 0,-1} (skip all but one in the first 32B)
end misalign count = 0: no trailing elements. mask = all-ones (Aligned case).
this is the odd case, not similar to count=1. A couple extra instructions for this special case is worth avoiding an extra loop iteration and a cleanup with a mask of all-zeros.
end misalign count = 1: one trailing element. mask = {-1, 0, 0, 0, 0, 0, 0, 0}
...
end misalign count = 7: seven trailing elems. mask = {-1,-1,-1,-1,-1,-1,-1, 0}
Untested code, assume there are bugs
section .data
align 32 ; preferably no cache-line boundaries inside the table
; byte elements, to be loaded with pmovsx. all-ones sign-extends
DB 0, 0, 0, 0, 0, 0, 0, 0
masktable_intro: ; index with 0..-7
DB -1, -1, -1, -1, -1, -1, -1, -1
masktable_outro: ; index with -8(aligned), or -1..-7
DB 0, 0, 0, 0, 0, 0, 0, 0
; the very first and last 0 bytes are not needed, since we avoid an all-zero mask.
section .text
global floatmul ; (float *rdi)
floatmul:
mov eax, edi
and eax, 0x1c ; 0x1c = 7 << 2 = 0b11100
lea rdx, [rdi + 4096 - 32] ; one full vector less than the end address (calculated *before* masking for alignment).
;; replace 4096 with rsi*4 if rsi has the count (in floats, not bytes)
and rdi, ~0x1c ; Leave the low 2 bits alone, so this still works on misaligned floats.
shr eax, 2 ; misalignment-count, in the range [0..7]
neg rax
vpmovsxbd ymm0, [masktable_intro + rax] ; Won't link on OS X: Need a separate LEA for RIP-relative
vmaskmovps ymm1, ymm0, [rdi]
vaddps ymm1, ymm1, ymm1 ; *= 2.0
vmaskmovps [rdi], ymm0, ymm1
;;; also prepare the cleanup mask while the table is still hot in L1 cache
; if the loop count known to be a multiple of the vector width,
; the alignment of the end will be the same as the alignment of the start
; so we could just invert the mask
; vpxor xmm1, xmm1, xmm1 ; doesn't need an execution unit
; vpcmpeqd ymm0, ymm1, ymm0
; In the more general case: just re-generate the mask from the one-past-the-end addr
mov eax, edx
xor ecx, ecx ; prep for setcc
and eax, 0x1c ; sets ZF when aligned
setz cl ; rcx=1 in the aligned special-case, else 0
shr eax, 2
lea eax, [rax + rcx*8] ; 1..7, or 8 in the aligned case
neg rax
vpmovsxbd ymm0, [masktable_outro + rax]
.loop:
add rdi, 32
vmovups ymm1, [rdi] ; Or vmovaps if you want to fault if the address isn't 4B-aligned
vaddps ymm1, ymm1, ymm1 ; *= 2.0
vmovups [rdi], ymm1
cmp rdi, rdx ; while( (p+=8) < (start+1024-8) )
jb .loop ; 5 fused-domain uops, yuck.
; use the outro mask that we generated before the loop for insn scheduling / cache locality reasons.
vmaskmov ymm1, ymm0, [rdi]
vaddps ymm1, ymm1, ymm1 ; *= 2.0
vmaskmovps [rdi], ymm0, ymm1
ret
; vpcmpeqd ymm1, ymm1, ymm1 ; worse way to invert the mask: dep-chain breaker but still needs an execution unit to make all-ones instead of all-zeros.
; vpxor ymm0, ymm0, ymm1
This does require a load from a table, which can miss in L1 cache, and 15B of table data. (Or 24B if the loop count is also variable, and we have to generate the end mask separately).
Either way, after the 4 instructions to generate the misalignment-count and the aligned start address, getting the mask only takes a single vpmosvsxbd instruction. (The ymm, mem form can't micro-fuse, so it's 2 uops). This requires AVX2.
Without AVX2:
vmovdqu from a 60B table of DWORDs (DD) instead of Bytes (DB). This would actually save an insn relative to AVX2: address & 0x1c is the index, without needing a right-shift by two. The whole table still fits in a cache line, but without room for other constants the algo might use.
Or:
2x vpmovsxbd into two 128b registers ([masktable_intro + rax] and [masktable_intro + rax + 4])
vinsertf128
Or: (more insns, and more shuffle-port pressure, but less load-port pressure, almost certainly worse)
vpmovsxbw into a 128b register
vpunpcklwd / vpunpckhwd into two xmm regs (src1=src2 for both)
vinsertf128
Overhead:
Integer ops: 5 uops at the start to get an index and align the start pointer. 7 uops to get the index for the end mask. Total of 12 GP-register uops beyond simply using unaligned, if the loop element count is a multiple of the vector width.
AVX2: Two 2-fused-domain-uop vector insns to go from [0..7] index in a GP reg to a mask in a YMM reg. (One for the start mask, one for the end mask). Uses a 24B table, accessed in an 8B window with byte granularity.
AVX: Six 1-fused-domain-uop vector insns (three at the start, three at the end). With RIP-relative addressing for the table, four of those instructions will be [base+index] and won't micro-fuse, so an extra two integer insns might be better.
The code inside the loop is replicated 3 times.
TODO: write another answer generating the mask on the fly, maybe as bytes in a 64b reg, then unpacking it to 256b. Maybe with a bit-shift, or BMI2's BZHI(-1, count)?
AVX-only: Unaligned accesses at the start/end, pipelining loads to avoid problems when rewriting in place.
Thanks to #StephenCanon for pointing out that this is better than VMASKMOVPS for anything that VMASKMOVPS could do to help with looping over unaligned buffers.
This is maybe a bit much to expect a compiler to do as a loop transformation, esp. since the obvious way can make Valgrind unhappy (see below).
section .text
global floatmul ; (float *rdi)
floatmul:
lea rdx, [rdi + 4096 - 32] ; one full vector less than the end address (calculated *before* masking for alignment).
;; replace 4096 with rsi*4 if rsi has the count (in floats, not bytes)
vmovups ymm0, [rdi] ; first vector
vaddps ymm0, ymm0, ymm0 ; *= 2.0
; don't store yet
lea rax, [rdi+32]
and rax, ~0x1c ; 0x1c = 7 << 2 = 0b11100 ; clear those bits.
vmovups ymm1, [rax] ; first aligned vector, for use by first loop iteration
vmovups [rdi], ymm0 ; store the first unaligned vector
vmovups ymm0, [rdx] ; load the *last* unaligned vector
.loop:
;; on entry: [rax] is already loaded into ymm1
vaddps ymm1, ymm1, ymm1 ; *= 2.0
vmovups [rax] ; vmovaps would fault if p%4 != 0
add rax, 32
vmovups ymm1, [rax]
cmp rax, rdx ; while( (p+=8) < (endp-8) );
jb .loop
; discard ymm1. It includes data from beyond the end of the array (aligned case: same as ymm0)
vaddps ymm0, ymm0, ymm0 ; the last 32B, which we loaded before the loop
vmovups [rdx], ymm0
ret
; End alignment:
; a[] = XXXX XXXX ABCD E___ _ = garbage past the end
; ^rdx
; ^rax ^rax ^rax ^rax(loop exit)
; ymm0 = BCDE
; ymm1 loops over ..., XXXX, ABCD, E___
; The last load off the end of the array includes garbage
; because we pipeline the load for the next iteration
Doing a load from the end of the array at the start of the loop seems a little weird, but hopefully it doesn't confuse the hardware prefetchers, or slow down getting the beginning of the array streaming from memory.
Overhead:
2 extra integer uops total (to set up the aligned-start). We're already using the end pointer for the normal loop structure, so that's free.
2 extra copies of the loop body (load/calc/store). (First and last iteration peeled).
Compilers probably won't be happy about emitting code like this, when auto-vectorizing. Valgrind will report accesses outside of array bounds, and does so by single-stepping and decoding instructions to see what they're accessing. So merely staying within the same page (and cache line) as the last element in the array isn't sufficient. Also note that if the input pointer isn't 4B-aligned, we can potentially read into another page and segfault.
To keep Valgrind happy, we could stop the loop two vector widths early, to do the special-case load of the unaligned last vector-width of the array. That would require duplicating the loop body an extra time (insignificant in this example, but it's trivial on purpose.) Or maybe avoid pipelining by having the intro code jump into the middle of the loop. (That may be sub-optimal for the uop-cache, though: (parts of) the loop body may end up in the uop cache twice.)
TODO: write a version that jumps into the loop mid-way.
In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this:
# a is a 64-byte aligned array of double
__m256d b0 = _mm256_broadcast_sd(&b[4*k+0]);
__m256d b1 = _mm256_broadcast_sd(&b[4*k+1]);
__m256d b2 = _mm256_broadcast_sd(&b[4*k+2]);
__m256d b3 = _mm256_broadcast_sd(&b[4*k+3]);
I compiled the code with gcc-4.8.2 on a Sandy Bridge machine. Hardware event counters (Intel PMU) suggests that the CPU actually issue 4 separate load from the L1 cache. Although at this point I'm not limited by L1 latency or bandwidth, I'm very interested to know if there is a way to load the 4 doubles with one 256-bit load (or two 128-bit loads) and shuffle them into 4 YMM registers. I looked through the Intel Intrinsics Guide but couldn't find a way to accomplish the shuffling required. Is that possible?
(If the premise that the CPU doesn't combine the 4 consecutive loads is actually wrong, please let me know.)
TL;DR: It's almost always best to just do four broadcast-loads using _mm256_set1_pd(). This very good on Haswell and later, where vbroadcastsd ymm,[mem] doesn't require an ALU shuffle operation, and usually also the best option for Sandybridge/Ivybridge (where it's a 2-uop load + shuffle instruction).
It also means you don't need to care about alignment at all, beyond natural alignment for a double.
The first vector is ready sooner than if you did a two-step load + shuffle, so out-of-order execution can potentially get started on the code using these vectors while the first one is still loading. AVX512 can even fold broadcast-loads into memory operands for ALU instructions, so doing it this way will allow a recompile to take slight advantage of AVX512 with 256b vectors.
(It's usually best to use set1(x), not _mm256_broadcast_sd(&x); If the AVX2-only register-source form of vbroadcastsd isn't available, the compiler can choose to store -> broadcast-load or to do two shuffles. You never know when inlining will mean your code will run on inputs that are already in registers.)
If you're really bottlenecked on load-port resource-conflicts or throughput, not total uops or ALU / shuffle resources, it might help to replace a pair of 64->256b broadcasts with a 16B->32B broadcast-load (vbroadcastf128/_mm256_broadcast_pd) and two in-lane shuffles (vpermilpd or vunpckl/hpd (_mm256_shuffle_pd)).
Or with AVX2: load 32B and use 4 _mm256_permute4x64_pd shuffles to broadcast each element into a separate vector.
Source Agner Fog's insn tables (and microarch pdf):
Intel Haswell and later:
vbroadcastsd ymm,[mem] and other broadcast-load insns are 1uop instructions that are handled entirely by a load port (the broadcast happens "for free").
The total cost of doing four broadcast-loads this way is 4 instructions. fused-domain: 4uops. unfused-domain: 4 uops for p2/p3. Throughput: two vectors per cycle.
Haswell only has one shuffle unit, on port5. Doing all your broadcast-loads with load+shuffle will bottleneck on p5.
Maximum broadcast throughput is probably with a mix of vbroadcastsd ymm,m64 and shuffles:
## Haswell maximum broadcast throughput with AVX1
vbroadcastsd ymm0, [rsi]
vbroadcastsd ymm1, [rsi+8]
vbroadcastf128 ymm2, [rsi+16] # p23 only on Haswell, also p5 on SnB/IvB
vunpckhpd ymm3, ymm2,ymm2
vunpcklpd ymm2, ymm2,ymm2
vbroadcastsd ymm4, [rsi+32] # or vaddpd ymm0, [rdx+something]
#add rsi, 40
Any of these addressing modes can be two-register indexed addressing modes, because they don't need to micro-fuse to be a single uop.
AVX1: 5 vectors per 2 cycles, saturating p2/p3 and p5. (Ignoring cache-line splits on the 16B load). 6 fused-domain uops, leaving only 2 uops per 2 cycles to use the 5 vectors... Real code would probably use some of the load throughput to load something else (e.g. a non-broadcast 32B load from another array, maybe as a memory operand to an ALU instruction), or to leave room for stores to steal p23 instead of using p7.
## Haswell maximum broadcast throughput with AVX2
vmovups ymm3, [rsi]
vbroadcastsd ymm0, xmm3 # special-case for the low element; compilers should generate this from _mm256_permute4x64_pd(v, 0)
vpermpd ymm1, ymm3, 0b01_01_01_01 # NASM syntax for 0x99
vpermpd ymm2, ymm3, 0b10_10_10_10
vpermpd ymm3, ymm3, 0b11_11_11_11
vbroadcastsd ymm4, [rsi+32]
vbroadcastsd ymm5, [rsi+40]
vbroadcastsd ymm6, [rsi+48]
vbroadcastsd ymm7, [rsi+56]
vbroadcastsd ymm8, [rsi+64]
vbroadcastsd ymm9, [rsi+72]
vbroadcastsd ymm10,[rsi+80] # or vaddpd ymm0, [rdx + whatever]
#add rsi, 88
AVX2: 11 vectors per 4 cycles, saturating p23 and p5. (Ignoring cache-line splits for the 32B load...). Fused-domain: 12 uops, leaving 2 uops per 4 cycles beyond this.
I think 32B unaligned loads are a bit more fragile in terms of performance than unaligned 16B loads like vbroadcastf128.
Intel SnB/IvB:
vbroadcastsd ymm, m64 is 2 fused-domain uops: p5 (shuffle) and p23 (load).
vbroadcastss xmm, m32 and movddup xmm, m64 are single-uop load-port-only. Interestingly, vmovddup ymm, m256 is also a single-uop load-port-only instruction, but like all 256b loads, it occupies a load port for 2 cycles. It can still generate a store-address in the 2nd cycle. This uarch doesn't deal well with cache-line splits for unaligned 32B-loads, though. gcc defaults to using movups / vinsertf128 for unaligned 32B loads with -mtune=sandybridge / -mtune=ivybridge.
4x broadcast-load: 8 fused-domain uops: 4 p5 and 4 p23. Throughput: 4 vectors per 4 cycles, bottlenecking on port 5. Multiple loads from the same cache line in the same cycle don't cause a cache-bank conflict, so this is nowhere near saturating the load ports (also needed for store-address generation). That only happens on the same bank of two different cache lines in the same cycle.
Multiple 2-uop instructions with no other instructions between is the worst case for the decoders if the uop-cache is cold, but a good compiler would mix in single-uop instructions between them.
SnB has 2 shuffle units, but only the one on p5 can handle shuffles that have a 256b version in AVX. Using a p1 integer-shuffle uop to broadcast a double to both elements of an xmm register doesn't get us anywhere, since vinsertf128 ymm,ymm,xmm,i takes a p5 shuffle uop.
## Sandybridge maximum broadcast throughput: AVX1
vbroadcastsd ymm0, [rsi]
add rsi, 8
one per clock, saturating p5 but only using half the capacity of p23.
We can save one load uop at the cost of 2 more shuffle uops, throughput = two results per 3 clocks:
vbroadcastf128 ymm2, [rsi+16] # 2 uops: p23 + p5 on SnB/IvB
vunpckhpd ymm3, ymm2,ymm2 # 1 uop: p5
vunpcklpd ymm2, ymm2,ymm2 # 1 uop: p5
Doing a 32B load and unpacking it with 2x vperm2f128 -> 4x vunpckh/lpd might help if stores are part of what's competing for p23.
In my matrix multiplication code I only have to use the broadcast once per kernel code but if you really want to load four doubles in one instruction and then broadcast them to four registers you can do it like this
#include <stdio.h>
#include <immintrin.h>
int main() {
double in[] = {1,2,3,4};
double out[4];
__m256d x4 = _mm256_loadu_pd(in);
__m256d t1 = _mm256_permute2f128_pd(x4, x4, 0x0);
__m256d t2 = _mm256_permute2f128_pd(x4, x4, 0x11);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);
_mm256_storeu_pd(out,broad1);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
_mm256_storeu_pd(out,broad2);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
_mm256_storeu_pd(out,broad3);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
_mm256_storeu_pd(out,broad4);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
}
Edit: Here is another solution based on Paul R's suggestion.
__m256 t1 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
__m256 t2 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);
Here is a variant built upon Z Boson's original answer (before edit), using two 128-bit loads instead of one 256-bit load.
__m256d b01 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+0]));
__m256d b23 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+2]));
__m256d b0101 = _mm256_permute2f128_pd(b01, b01, 0);
__m256d b2323 = _mm256_permute2f128_pd(b23, b23, 0);
__m256d b0000 = _mm256_permute_pd(b0101, 0);
__m256d b1111 = _mm256_permute_pd(b0101, 0xf);
__m256d b2222 = _mm256_permute_pd(b2323, 0);
__m256d b3333 = _mm256_permute_pd(b2323, 0xf);
In my case this is slightly faster than using one 256-bit load, possibly because the first permute can start before the second 128-bit load completes.
Edit: gcc compiles the two loads and the first 2 permutes into
vmovapd (%rdi),%xmm8
vmovapd 0x10(%rdi),%xmm4
vperm2f128 $0x0,%ymm8,%ymm8,%ymm1
vperm2f128 $0x0,%ymm4,%ymm4,%ymm2
Paul R's suggestion of using _mm256_broadcast_pd() can be written as:
__m256d b0101 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
__m256d b2323 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);
which compiles into
vbroadcastf128 (%rdi),%ymm6
vbroadcastf128 0x10(%rdi),%ymm11
and is faster than doing two vmovapd+vperm2f128 (tested).
In my code, which is bound by vector execution ports instead of L1 cache accesses, this is still slightly slower than 4 _mm256_broadcast_sd(), but I imagine that L1 bandwidth-constrained code can benefit greatly from this.