how to extract high 256 bits of 512 bits __m512i using avx512 intrinsic?

how to extract high 256 bits of 512 bits __m512i using avx512 intrinsic? - avx512

In AVX2 intrinsic programming, we can use _mm256_extracti128_si256 to extract high/low 128 bits of 256 bits, but I don't find such intrinsics function for 512 bits register to extract high/low 256 bits. How to extract high 256 bits of 512 bits __m512i using avx2 intrinsic?

There are no AVX2 intrinsics to operate on a type that's new in AVX-512, __m512i.
There are AVX-512 intrinsics for vextracti32x8 (_mm512_extracti32x8_epi32) and vextracti64x4 (_mm512_extracti64x4_epi64) and mask / maskz versions (which is why two different element-size versions of the instructions exist).
You can find them by searching in Intel's intrinsics guide for __m256i _mm512 (i.e. for _mm512 intrinsics that return an __m256i). Or you can search for vextracti64x4; the intrinsics guide search feature works on asm mnemonics. Or look through https://www.felixcloutier.com/x86/ which is scraped from Intel's vol.2 asm manual; search for extract quickly gets to some likely-looking asm instructions; each entry has a section for intrinsics (for instructions that have intrinsics).
There's also _mm512_castsi512_si256 for the low half, of course. That's a no-op (zero asm instructions), so there aren't different element-size versions of it with merge-masking or zero-masking.
__m256i low = _mm512_castsi512_si256 (v512);
__m256i high = _mm512_extracti32x8_epi32 (v512, 1);
There's also vshufi64x2 and similar to shuffle with 128-bit granularity, with an 8-bit immediate control mask to grab lanes from 2 source vectors. (Like shufps does for 4-byte elements). That and a cast can get a 256-bit vector from any combination of parts you want.

Related

AVX512 exchange low 256 bits and high 256 bits in zmm register

Is there any AVX-512 intrinsics that exchange low 256 bits and high 256 bits in zmm register?
I have a 512bit zmm register with double values. What I want to do is swap zmm[0:255] and zmm[256:511].
__m512d a = {10, 20, 30, 40, 50, 60, 70, 80};
__m512d b = _some_AVX_512_intrinsic(a);
// GOAL: b to be {50, 60, 70, 80, 10, 20, 30, 40}
There is a function that works in the ymm register, but I couldn't find any permute function that works in the zmm register.

You're looking for vshuff64x2 which can shuffle in 128-bit chunks from 2 sources, using an immediate control operand. It's the AVX-512 version of vperm2f128 which you found, but AVX-512 has two versions: one with masking by 32-bit elements, one with masking by 64-bit elements. (The masking is finer-grained than the shuffle, so you can merge or zero on a per-double basis while doing this.) Also integer and FP versions of the same shuffles, like vshufi32x4.
The intrinsic is _mm512_shuffle_f64x2(a,a, _MM_SHUFFLE(1,0, 3,2))
Note that on Intel Ice Lake CPUs, storing in two 32-byte halves with vmovupd / vextractf64x4 mem, zmm, 1 might be nearly as efficient, if you're storing. The vextract can't micro-fuse the store-address and store-data uops, but no shuffle port is involved on Intel including Skylake-X. (Unlike Zen4 I think). And Intel Ice Lake and later can sustain 2x 32-byte stores per clock, vs. 1x 64-byte aligned store per clock, if both stores are to the same cache line. (It seems the store buffer can commit two stores to the same cache line if they're both at the head of the queue.)
If the data's coming from memory, loading an __m256 + vinsertf64x4 is cheap, especially on Zen4, but on Intel it's 2 uops, one load, one for any vector ALU port (p0 or p5). A merge-masked 256-bit broadcast might be cheaper if you the mask register can stay set across loop iterations. Like _mm512_mask_broadcast_f64x4(_mm512_castpd256_pd512(low), 0b11110000, _mm256_loadu_pd(addr+4)). That still takes an ALU uop on Skylake-X and Ice Lake, but it can micro-fuse with the load.
Other instructions that can do the same shuffle include valignq with a rotate count of 4 qwords (using the same vector for both inputs).
Or of course any variable-control shuffle like vpermpd, but unlike for __m256d (4 doubles), 8 elements is too wide for an arbitrary shuffle with an 8-bit control.
On existing AVX-512 CPUs, a 2-input shuffle like valignq or vshuff64x2 is equally efficient to vpermpd with a control vector, including on Zen4; it has wide shuffle units so it isn't super slow for lane-crossing stuff like Zen1 was. Maybe on Xeon Phi (KNL) it might be worth loading a control vector for vpermpd if you have to do this repeatedly and can't just load in 2 halves or store in 2 halves. (https://agner.org/optimize/ and https://uops.info/)

Store lower 16 bits of each AVX 32-bit element to memory

I have 8 integer values in an AVX value __m256i which are all capped at 0xffff, so the upper 16 bits are all zero.
Now I want to store these 8 values as 8 consecutive uint16_t values.
How can I write them to memory in this way? Can I somehow convert an __m256i value of 8 packed integers into a __m128i value that holds 8 packed shorts?
I am targeting AVX2 intrinsics, but if it can be done in AVX intrinsics, even better.

With AVX2, use _mm256_packus_epi32 + _mm256_permutex_epi64 to fix up the in-lane behaviour of packing two __m256i inputs, like #chtz said. Then you can store all 32 bytes of output from 64 bytes of input.
With AVX1, extract the high half of one vector and _mm_packus_epi32 pack into a __m128i. That would still cost 2 shuffle instructions but produce half the width of data output from them. (Although it's good on Zen1 where YMM registers get treated as 2x 128-bit halves anyway, and vextractf128 is cheaper on Zen1 than on CPUs where it's an actual shuffle.)
Of course, with only AVX1 you're unlikely to have integer data in a __m256i unless it was loaded from memory, in which case you should just do _mm_loadu_si128 in the first place. But with AVX2 it is probably worth doing 32 byte loads even though that means you need 2 shuffles per store instead of 1. Especially if any of your inputs aren't aligned by 16.

32 byte aligned allocator for aligned eigen map

I use the Eigen Map<> template class to reinterpret chunks of C++ arrays as
Eigen fixed sized arrays. It seems that the Eigen::allocator provided
16 bytes aligned allocation. What is the proper way to deal with AVX ?
Should I build my own allocator ?
using Block=typedef Eigen::Array<float,8, 1>;
using Map=Eigen::Map<BLOCK,Eigen::Aligned32>;
template <class T> using allocator=Eigen::aligned_allocator<T>;
std::vector<float,allocator<float> > X(10000);
Map myMap(&X[0]); //should be 32 bytes aligned for AVX
Thank you for your help

The documentation is outdated, internally Eigen::aligned_allocator is a wrapper around Eigen::internal::aligned_malloc that returns memory aligned as defined there:
EIGEN_MAX_ALIGN_BYTES - Must be a power of two, or 0. Defines an upper
bound on the memory boundary in bytes on which dynamically and
statically allocated data may be aligned by Eigen. If not defined, a
default value is automatically computed based on architecture,
compiler, and OS. This option is typically used to enforce binary
compatibility between code/libraries compiled with different SIMD
options. For instance, one may compile AVX code and enforce ABI
compatibility with existing SSE code by defining
EIGEN_MAX_ALIGN_BYTES=16. In the other way round, since by default AVX
implies 32 bytes alignment for best performance, one can compile SSE
code to be ABI compatible with AVX code by defining
EIGEN_MAX_ALIGN_BYTES=32.
So basically, if you compile with -mavx, you'll get 32 bytes aligned pointers.

Why is the register length static in any CPU

Why is the register length (in bits) that a CPU operates on not dynamically/manually/arbitrarily adjustable? Would it make the computer slower if it was adjustable this way?
Imagine you had an 8-bit integer. If you could adjust the CPU register length to 8 bits, the CPU would only have to go through the first 8 bits instead of extending the 8-bit integer to 64 bits and then going through all 64 bits.

At first I thought you were asking if it was possible to have a CPU with no definitive register size. That make no sense since the number and size of the registers is a physical property of the hardware and cannot be changed.
However some architecture let the programmer work on a smaller part of a register or to pair registers.
The x86 does both for example, with add al, 9 (uses only 8 bits of the 64-bit rax) and div rbx (pairs rdx:rax to form a 128-bit register).
The reason this scheme is not so diffuse is that it comes with a lot of trade-offs.
More registers means more bits needed to address them, simply put: longer instructions.
Longer instructions mean less code density, more complex decoders and less performance.
Furthermore most elementary operations, like the logic ones, addition and subtraction are already implemented as operating on a full register in a single cycle.
Finally, one execution unit can handle only one instruction at a time, we cannot issue eight 8-bit additions in a 64-bit ALU at the same time.
So there wouldn't be any improvement, nor in the latency nor in the throughput.
Accessing partial registers is useful for the programmer to fan-out the number of available registers, so for example if an algorithm works with 16-bit data, the programmer could use a single physical 64-bit register to store four items and operate on them independently (but not in parallel).
The ISAs that have variable length instructions can also benefit from using partial register because that usually means smaller immediate values, for example and instruction that set a register to a specific value usually have an immediate operand that matches the size of register being loaded (though RISC usually sign-extends or zero-extends it).

Architectures like ARM (presumably others as well) supports half precision floats. The idea is to do what you were speculating and #Margaret explained. With half precision floats, you can pack two float values in a single register, thereby introducing less bandwidth at a cost of reduced accuracy.
Reference:
[1] ARM
[2] GCC

Why is SSE aligned read + shuffle slower than unaligned read on some CPUs but not on others?

While trying to optimize misaligned reads necessary for my finite differences code, I changed unaligned loads like this:
__m128 pm1 =_mm_loadu_ps(&H[k-1]);
into this aligned read + shuffle code:
__m128 p0 =_mm_load_ps(&H[k]);
__m128 pm4 =_mm_load_ps(&H[k-4]);
__m128 pm1 =_mm_shuffle_ps(p0,p0,0x90); // move 3 floats to higher positions
__m128 tpm1 =_mm_shuffle_ps(pm4,pm4,0x03); // get missing lowest float
pm1 =_mm_move_ss(pm1,tpm1); // pack lowest float with 3 others
where H is 16 byte-aligned; and there also was similar change for H[k+1], H[k±3] and movlhps & movhlps optimization for H[k±2] (here's the full code of the loop).
I found that on my Core i7-930 optimization for reading H[k±3] appeared to be fruitful, while adding next optimization for ±1 slowed down my loop (by units of percent). Switching between ±1 and ±3 optimizations didn't change results.
At the same time, on Core 2 Duo 6300 and Core 2 Quad enabling both optimizations (for ±1 and ±3) boosted performance (by tens of percent), while for Core i7-4765T both of these slowed it down (by units of percent).
On Pentium 4 all attempts to optimize misaligned reads, including those with movlhps/movhlps lead to slowdown.
Why is it so different for different CPUs? Is it because of increase in code size so that the loop might not fit in some instruction cache? Or is it because some of CPUs are insensitive to misaligned reads, while others are much more sensitive? Or maybe such actions as shuffles are slow on some CPUs?

Every two years Intel comes out with a new microarchitecture. The number of execution units may change, instructions that previously could only execute in one execution unit may have 2 or 3 available in newer processors. The latency of instruction might change, as when a shuffle execution unit is added.
Intel goes into some detail in their Optimization Reference Manual, here's the link, below I've copied the relevant sections.
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
section 3.5.2.7 Floating-Point/SIMD Operands
The MOVUPD from memory instruction performs two 64-bit loads, but requires additional μops to adjust the address and combine the loads into a single register. This same functionality can be obtained using MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2, which uses fewer μops and can be packed into the trace cache more effectively. The latter alternative has been found to provide a several percent performance improvement in some cases. Its encoding requires more instruction bytes, but this is seldom an issue for the Pentium 4 processor. The store version of MOVUPD is complex and slow, so much so that the sequence with two MOVSD and a UNPCKHPD should always be used.
Assembly/Compiler Coding Rule 44. (ML impact, L generality) Instead of using MOVUPD XMMREG1, MEM for a unaligned 128-bit load, use MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2. If the additional register is not available, then use MOVSD XMMREG1, MEM; MOVHPD XMMREG1, MEM+8.
Assembly/Compiler Coding Rule 45. (M impact, ML generality) Instead of using MOVUPD MEM, XMMREG1 for a store, use MOVSD MEM, XMMREG1; UNPCKHPD XMMREG1, XMMREG1; MOVSD MEM+8, XMMREG1 instead.
section 6.5.1.2 Data Swizzling
Swizzling data from SoA to AoS format can apply to a number of application domains, including 3D geometry, video and imaging. Two different swizzling techniques can be adapted to handle floating-point and integer data. Example 6-3 illustrates a swizzle function that uses SHUFPS, MOVLHPS, MOVHLPS instructions.
The technique in Example 6-3 (loading 16 bytes, using SHUFPS and copying halves of XMM registers) is preferable over an alternate approach of loading halves of each vector using MOVLPS/MOVHPS on newer microarchitectures. This is because loading 8 bytes using MOVLPS/MOVHPS can create code dependency and reduce the throughput of the execution engine. The performance considerations of Example 6-3 and Example 6-4 often depends on the characteristics of each microarchitecture. For example, in Intel Core microarchitecture, executing a SHUFPS tend to be slower than a PUNPCKxxx instruction. In Enhanced Intel Core microarchitecture, SHUFPS and PUNPCKxxx instruction all executes with 1 cycle throughput due to the 128-bit shuffle execution unit. Then the next important consideration is that there is only one port that can execute PUNPCKxxx vs. MOVLHPS/MOVHLPS can execute on multiple ports. The performance of both techniques improves on Intel Core microarchitecture over previous microarchitectures due to 3 ports for executing SIMD instructions. Both techniques improves further on Enhanced Intel Core microarchitecture due to the 128-bit shuffle unit.

On older CPUs misaligned loads have a large performance penalty - they generate two bus read cycles and then there is some additional fix-up after the two read cycles. This means that misaligned loads are typically 2x or more slower than aligned loads. However with more recent CPUs (e.g. Core i7) the penalty for misaligned loads is almost negligible. So if you need so support old CPUs and new CPUs you'll probably want to handle misaligned loads differently for each.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio