How do I broadcast the lowest word of a __m256i?

How do I broadcast the lowest word of a __m256i? - intrinsics

I am trying to write AVX2 code using intrinsics. Want to know how to use Intel intrinsics to broadcast the lowest word in an YMM to an entire YMM. I know that with assembly code I could just write
vpbroadcastw ymm1, xmm0
because the lowest word of ymm0 is also the lowest word of xmm0. I have a variable x which is a value in an YMM. But
_mm256_broadcastw_epi16((__m128i) x)
where x is an __m256i returns an error -- can't convert two things of different sizes.
rq_recip3_new.c:381:5: error: can’t convert a value of type ‘__m256i {aka __vector(4) long long int}’ to vector type ‘__vector(2) long long int’ which has different size
I don't think this matters but my machines use gcc 6.4.1 and 7.3 (Fedora 25 and Ubuntu LTS 16.04 respectively).

The following should work:
__m256i broadcast_word(__m256i x){
return _mm256_broadcastw_epi16(_mm256_castsi256_si128(x));
}
With intrinsics, _mm256_castsi256_si128 is the right way to cast from 256 to 128 bits.
With Godbolt Compiler Explorer this compiles to (gcc 7.3):
broadcast_word:
vpbroadcastw ymm0, xmm0
ret

Related

how to extract high 256 bits of 512 bits __m512i using avx512 intrinsic?

In AVX2 intrinsic programming, we can use _mm256_extracti128_si256 to extract high/low 128 bits of 256 bits, but I don't find such intrinsics function for 512 bits register to extract high/low 256 bits. How to extract high 256 bits of 512 bits __m512i using avx2 intrinsic?

There are no AVX2 intrinsics to operate on a type that's new in AVX-512, __m512i.
There are AVX-512 intrinsics for vextracti32x8 (_mm512_extracti32x8_epi32) and vextracti64x4 (_mm512_extracti64x4_epi64) and mask / maskz versions (which is why two different element-size versions of the instructions exist).
You can find them by searching in Intel's intrinsics guide for __m256i _mm512 (i.e. for _mm512 intrinsics that return an __m256i). Or you can search for vextracti64x4; the intrinsics guide search feature works on asm mnemonics. Or look through https://www.felixcloutier.com/x86/ which is scraped from Intel's vol.2 asm manual; search for extract quickly gets to some likely-looking asm instructions; each entry has a section for intrinsics (for instructions that have intrinsics).
There's also _mm512_castsi512_si256 for the low half, of course. That's a no-op (zero asm instructions), so there aren't different element-size versions of it with merge-masking or zero-masking.
__m256i low = _mm512_castsi512_si256 (v512);
__m256i high = _mm512_extracti32x8_epi32 (v512, 1);
There's also vshufi64x2 and similar to shuffle with 128-bit granularity, with an 8-bit immediate control mask to grab lanes from 2 source vectors. (Like shufps does for 4-byte elements). That and a cast can get a 256-bit vector from any combination of parts you want.

Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs?

Can a scaled 64bit/32bit division performed by the hardware 128bit/64bit division instruction, such as:
; Entry arguments: Dividend in EAX, Divisor in EBX
shl rax, 32 ;Scale up the Dividend by 2^32
xor rdx,rdx
and rbx, 0xFFFFFFFF ;Clear any garbage that might have been in the upper half of RBX
div rbx ; RAX = RDX:RAX / RBX
...be faster in some special cases than the scaled 64bit/32bit division performed by the hardware 64bit/32bit division instruction, such as:
; Entry arguments: Dividend in EAX, Divisor in EBX
mov edx,eax ;Scale up the Dividend by 2^32
xor eax,eax
div ebx ; EAX = EDX:EAX / EBX
By "some special cases" I mean unusual dividends and divisors.
I am interested in comparing the div instruction only.

You're asking about optimizing uint64_t / uint64_t C division to a 64b / 32b => 32b x86 asm division, when the divisor is known to be 32-bit. The compiler must of course avoid the possibility of a #DE exception on a perfectly valid (in C) 64-bit division, otherwise it wouldn't have followed the as-if rule. So it can only do this if it's provable that the quotient will fit in 32 bits.
Yes, that's a win or at least break-even. On some CPUs it's even worth checking for the possibility at runtime because 64-bit division is so much slower. But unfortunately current x86 compilers don't have an optimizer pass to look for this optimization even when you do manage to give them enough info that they could prove it's safe. e.g. if (edx >= ebx) __builtin_unreachable(); doesn't help last time I tried.
For the same inputs, 32-bit operand-size will always be at least as fast
16 or 8-bit could maybe be slower than 32 because they may have a false dependency writing their output, but writing a 32-bit register zero-extends to 64 to avoid that. (That's why mov ecx, ebx is a good way to zero-extend ebx to 64-bit, better than and a value that's not encodeable as a 32-bit sign-extended immediate, like harold pointed out). But other than partial-register shenanigans, 16-bit and 8-bit division are generally also as fast as 32-bit, or not worse.
On AMD CPUs, division performance doesn't depend on operand-size, just the data. 0 / 1 with 128/64-bit should be faster than worst-case of any smaller operand-size. AMD's integer-division instruction is only a 2 uops (presumably because it has to write 2 registers), with all the logic done in the execution unit.
16-bit / 8-bit => 8-bit division on Ryzen is a single uop (because it only has to write AH:AL = AX).
On Intel CPUs, div/idiv is microcoded as many uops. About the same number of uops for all operand-sizes up to 32-bit (Skylake = 10), but 64-bit is much much slower. (Skylake div r64 is 36 uops, Skylake idiv r64 is 57 uops). See Agner Fog's instruction tables: https://agner.org/optimize/
div/idiv throughput for operand-sizes up to 32-bit is fixed at 1 per 6 cycles on Skylake. But div/idiv r64 throughput is one per 24-90 cycles.
See also Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux for a specific performance experiment where modifying the REX.W prefix in an existing binary to change div r64 into div r32 made a factor of ~3 difference in throughput.
And Why does Clang do this optimization trick only from Sandy Bridge onward? shows clang opportunistically using 32-bit division when the dividend is small, when tuning for Intel CPUs. But you have a large dividend and a large-enough divisor, which is a more complex case. That clang optimization is still zeroing the upper half of the dividend in asm, never using a non-zero or non-sign-extended EDX.
I have failed to make the popular C compilers generate the latter code when dividing an unsigned 32-bit integer (shifted left 32 bits) by another 32-bit integer.
I'm assuming you cast that 32-bit integer to uint64_t first, to avoid UB and get a normal uint64_t / uint64_t in the C abstract machine.
That makes sense: Your way wouldn't be safe, it will fault with #DE when edx >= ebx. x86 division faults when the quotient overflows AL / AX / EAX / RAX, instead of silently truncating. There's no way to disable that.
So compilers normally only use idiv after cdq or cqo, and div only after zeroing the high half, unless you use an intrinsic or inline asm to open yourself up to the possibility of your code faulting. In C, x / y only faults if y = 0 (or for signed, INT_MIN / -1 is also allowed to fault1).
GNU C doesn't have an intrinsic for wide division, but MSVC has _udiv64. (With gcc/clang, division wider than 1 register uses a helper function which does try to optimize for small inputs. But this doesn't help for 64/32 division on a 64-bit machine, where GCC and clang just use the 128/64-bit division instruction.)
Even if there were some way to promise the compiler that your divisor would be big enough to make the quotient fit in 32 bits, current gcc and clang don't look for that optimization in my experience. It would be a useful optimization for your case (if it's always safe), but compilers won't look for it.
Footnote 1: To be more specific, ISO C describes those cases as "undefined behaviour"; some ISAs like ARM have non-faulting division instructions. C UB means anything can happen, including just truncation to 0 or some other integer result. See Why does integer division by -1 (negative one) result in FPE? for an example of AArch64 vs. x86 code-gen and results. Allowed to fault doesn't mean required to fault.

Can 128bit/64bit hardware unsigned division be faster in some cases than 64bit/32bit division on x86-64 Intel/AMD CPUs?
In theory, anything is possible (e.g. maybe in 50 years time Nvidia creates an 80x86 CPU that ...).
However, I can't think of a single plausible reason why a 128bit/64bit division would ever be faster than (not merely equivalent to) a 64bit/32bit division on x86-64.
I suspect this because I assume that the C compiler authors are very smart and so far I have failed to make the popular C compilers generate the latter code when dividing an unsigned 32-bit integer (shifted left 32 bits) by another 32-bit integer. It always compiles to the128bit/64bit div instruction. P.S. The left shift compiles fine to shl.
Compiler developers are smart, but compilers are complex and the C language rules get in the way. For example, if you just do a a = b/c; (with b being 64 bit and c being 32-bit) the language's rules are that c gets promoted to 64-bit before the division happens, so it ends up being a 64-bit divisor in some kind of intermediate language, and that makes it hard for the back-end translation (from intermediate language to assembly language) to tell that the 64-bit divisor could be a 32-bit divisor.

Why doesn't gcc or clang on ARM use "Division by Invariant Integers using Multiplication" trick?

There is a well-known trick to perform division by an invariant integer actually without doing division at all, but instead doing multiplication. This has been discussed on Stack Overflow as well in Perform integer division using multiplication and in Why does GCC use multiplication by a strange number in implementing integer division?
However, I recently tested the following code both on AMD64 and on ARM (Raspberry Pi 3 model B):
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main(int argc, char **argv)
{
volatile uint64_t x = 123456789;
volatile uint64_t y = 0;
struct timeval tv1, tv2;
int i;
gettimeofday(&tv1, NULL);
for (i = 0; i < 1000*1000*1000; i++)
{
y = (x + 999) / 1000;
}
gettimeofday(&tv2, NULL);
printf("%g MPPS\n", 1e3 / ( tv2.tv_sec - tv1.tv_sec +
(tv2.tv_usec - tv1.tv_usec) / 1e6));
return 0;
}
The code is horribly slow on ARM architectures. In contrast, on AMD64 it's extremely fast. I noticed that on ARM, it calls __aeabi_uldivmod, whereas on AMD64 it actually does not divide at all but instead does the following:
.L2:
movq (%rsp), %rdx
addq $999, %rdx
shrq $3, %rdx
movq %rdx, %rax
mulq %rsi
shrq $4, %rdx
subl $1, %ecx
movq %rdx, 8(%rsp)
jne .L2
The question is, why? Is there some particular feature on the ARM architecture that makes this optimization infeasible? Or is it just due to the rareness of the ARM architecture that optimizations like this haven't been implemented?
Before people start suggestions in their comments, I'll say I tried both gcc and clang, and also tried the -O2 and -O3 optimization levels.
On my AMD64 laptop, it gives 1181.35 MPPS, whereas on the Raspberry Pi it gives 5.50628 MPPS. This is over 2 orders of magnitude difference!

gcc only uses a multiplicative inverse for division of register-width or narrower. You're testing x86-64 against ARM32, so uint64_t gives x86-64 a huge advantage in this case.
Extended-precision multiply would probably be worth it on 32-bit CPUs with high-throughput multiply, like modern x86, and maybe also your Cortex-A7 ARM if it has a multiplier that's much better pipelined than its divider.
It would take multiple mul instructions to get the high half of a 64b x 64b => 128b full multiply result using only 32x32 => 64b as a building block. (IIRC ARM32 has this.)
However, this is not what gcc or clang choose to do, at any optimization level.
If you want to handicap your x86 CPU, compile 32-bit code using -m32. x86 gcc -O3 -m32 will use __udivdi3. I wouldn't call this "fair", though, because a 64-bit CPU is much faster at 64-bit arithmetic, and a Cortex-A7 doesn't have a 64-bit mode available.
OTOH, a 32-bit-only x86 CPU couldn't be much faster than current x86 CPUs in 32-bit mode; the main cost of the extra transistors that go unused in 32-bit mode is die area and power, not so much top-end clock speed. Some low-power-budget CPUs (like ULV laptop chips) might sustain max turbo for longer if they were designed without support for long mode (x86-64), but that's pretty minor.
So it might be interesting to benchmark 32-bit x86 vs. 32-bit ARM, just to learn something about the microarchitectures maybe. But if you care about 64-bit integer performance, definitely compile for x86-64 not x86-32.

Why gcc compile _mm256_permute2f128_ps to Vinsertf128 instruction?

This instruction is a part of an assembly out put of a C program (gcc -O2). According to the result I understand that ymm6 is source operand 1 that all of it, is cloned to ymm9 and then xmm1 is cloned to the ymm6[127-256] I read Intel manual but it uses Intel assembly syntax not At&t and I don't want to use Intel syntax. So ymm8, ymm2 and ymm6 here is SRC1. is this true?
vshufps $68, %ymm0, %ymm8, %ymm6
vshufps $68, %ymm4, %ymm2, %ymm1
Vinsertf128 $1, %xmm1, %ymm6, %ymm9
And the main question is why gcc changed the instruction
row0 = _mm256_permute2f128_ps(__tt0, __tt4, 0x20);
to
Vinsertf128 $1, %xmm1, %ymm6, %ymm9
and
row4 = _mm256_permute2f128_ps(__tt0, __tt4, 0x31);
to
Vperm2f128 $49, %ymm1, %ymm6, %ymm1
How could I ignore this optimization? I tried -O0 but doesn't work.

So ymm8, ymm2 and ymm6 here is SRC1. is this true?
Yes, the middle operand is always src1 in a 3-operand instruction in both syntaxes.
AT&T: op %src2, %src1, %dest
Intel: op dest, src1, src2
I don't want to use Intel syntax
Tough. The only really good documentation I know of for exactly what every instruction does is the Intel insn ref manual. I used to think AT&T syntax was better, because the $ and % decorators remove ambiguity. I do like that, but otherwise prefer the Intel syntax now. The rules for each are simple enough that you can easily mentally convert, or "think" in whichever one you're reading ATM.
Unless you're actually writing GNU C inline asm, you can just use gcc -masm=intel and objdump -Mintel to get GNU-flavoured asm using intel mnemonics, operand order, and so on. The assembler directives are still gas style, not NASM. Use http://gcc.godbolt.org/ to get nicely-formatted asm output for code with only the essential labels left in.
gcc and clang both have some understanding of what the intrinsics actually do, so internally they translate the intrinsic to some data movement. When it comes time to emit code, they see that said data movement can be done with vinsertf128, so they emit that.
On some CPUs (Intel SnB-family), both instructions have equal performance, but on AMD Bulldozer-family (which only has 128b ALUs), vinsertf128 is much faster than vperm2f128. (source: see Agner Fog's guides, and other links at the x86 tag wiki). They both take 6 bytes to encode, including the immediate, so there's no code-size difference. vinsertf128 is always a better choice than a vperm2f128 that does identical data movement.
gcc and clang don't have a "literal translation of intrinsics to instructions" mode, because it would take extra work to implement. If you care exactly which instructions the compiler uses, that's what inline asm is for.
Keep in mind that -O0 doesn't mean "no optimization". It still has to transform through a couple internal representations before emitting asm.

Examination of the instructions that bind to port 5 in the instruction analysis report shows that the instructions were broadcasts and vpermilps. The broadcasts can only execute on port 5, but replacing them with 128-bit loads followed by vinsertf128 instructions reduces the pressure on port 5 because vinsertf128 can execute on port 0. from IACA user guid

Unnecessary instructions generated for _mm_movemask_epi8 intrinsic in x64 mode

The intrinsic function _mm_movemask_epi8 from SSE2 is defined by Intel with the following prototype:
int _mm_movemask_epi8 (__m128i a);
This intrinsic function directly corresponds to the pmovmskb instruction, which is generated by all compilers.
According to this reference, the pmovmskb instruction can write the resulting integer mask to either a 32-bit or a 64-bit general purpose register in x64 mode. In any case, only 16 lower bits of the result can be nonzero, i.e. the result is surely within range [0; 65535].
Speaking of the intrinsic function _mm_movemask_epi8, its returned value is of type int, which as a signed integer of 32-bit size on most platforms. Unfortunately, there is no alternative function which returns a 64-bit integer in x64 mode. As a result:
Compiler usually generates pmovmskb instruction with 32-bit destination register (e.g. eax).
Compiler cannot assume that upper 32 bits of the whole register (e.g. rax) are zero.
Compiler inserts unnecessary instruction (e.g. mov eax, eax) to zero the upper half of 64-bit register, given that the register is later used as 64-bit value (e.g. as an index of array).
An example of code and generated assembly with such a problem can be seen in this answer. Also the comments to that answer contain some related discussion. I regularly experience this problem with MSVC2013 compiler, but it seems that it is also present on GCC.
The questions are:
Why is this happening?
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).

Why is this happening?
gcc's internal instruction definitions that tells it what pmovmskb does must be failing to inform it that the upper 32-bits of rax will always be zero. My guess is that it's treated like a function call return value, where the ABI allows a function returning a 32bit int to leave garbage in the upper 32bits of rax.
GCC does know about 32-bit operations in general zero-extending for free, but this missed optimization is widespread for intrinsics, also affecting scalar intrinsics like _mm_popcnt_u32.
There's also the issue of gcc (not) knowing that the actual result has set bits only in the low 16 of its 32-bit int result (unless you used AVX2 vpmovmskb ymm). So actual sign extension is unnecessary; implicit zero extension is totally fine.
Is there any way to reliably avoid generation of unnecessary instructions on popular compilers? In particular, when result is used as index, i.e. in x = array[_mm_movemask_epi8(xmmValue)];
No, other than fixing gcc. Has anyone reported this as a compiler missed-optimization bug?
clang doesn't have this bug. I added code to Paul R's test to actually use the result as an array index, and clang is still fine.
gcc always either zero or sign extends (to a different register in this case, perhaps because it wants to "keep" the 32-bit value in the bottom of RAX, not because it's optimizing for mov-elimination.
Casting to unsigned helps with GCC6 and later; it will use the pmovmskb result directly as part of an addressing mode, but also returning it results in a mov rax, rdx.
And with older GCC, at least gets it to use mov instead of movsxd or cdqe.
What is the approximate cost of unnecessary instructions like mov eax, eax on modern CPU architectures? Is there any chance that these instructions are completely eliminated by CPU internally and they do not actually occupy time of execution units (Agner Fog's instruction tables document mentions such a possibility).
mov same,same is never eliminated on SnB-family microarchitectures or AMD zen. mov ecx, eax would be eliminated. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details.
Even if it doesn't take an execution unit, it still takes a slot in the fused-domain part of the pipeline, and a slot in the uop-cache. And code-size. If you're close to the front-end 4 fused-domain uops per clock limit (pipeline width), then it's a problem.
It also costs an extra 1c of latency in the dep chain.
(Back-end throughput is not a problem, though. On Haswell and newer, it can run on port6 which has no vector execution units. On AMD, the integer ports are separate from the vector ports.)

gcc.godbolt.org is a great online resource for testing this kind of issue with different compilers.
clang seems to do the best with this, e.g.
#include <xmmintrin.h>
#include <cstdint>
int32_t test32(const __m128i v) {
int32_t mask = _mm_movemask_epi8(v);
return mask;
}
int64_t test64(const __m128i v) {
int64_t mask = _mm_movemask_epi8(v);
return mask;
}
generates:
test32(long long __vector(2)): # #test32(long long __vector(2))
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)): # #test64(long long __vector(2))
vpmovmskb eax, xmm0
ret
Whereas gcc generates an extra cdqe instruction in the 64-bit case:
test32(long long __vector(2)):
vpmovmskb eax, xmm0
ret
test64(long long __vector(2)):
vpmovmskb eax, xmm0
cdqe
ret

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio