Rotate all packed bytes in AVX2 / AVX-512 register with minimum instruction code - byte

Is it possible to do byte rotations using AVX2/AVX-512 instructions in less than 5 instructions?
Looking for answers in assembly code as I'm not familiar enough with intrinsics.
In AVX-512, direct rotations can be performed with a single instruction! Unfortunately WORD and BYTE data types are not covered. Shifts can be made down to WORD sizes, so that is a minimum of 3 instructions (2 shifts and 1 add/or/xor). The best I have come up with so far is a minimum of 5 instructions as listed below.
zmm0 is loaded with 64 different byte values
SL3 = memory location of 64 continuous bytes of f8h value (for masking)
SR5 = memory location of 64 continuous bytes of 07h value (for masking)
vpslld zmm1, zmm0, 3 ; shift DWORDS《 by 3 bits
vpandd zmm1, zmm1, SL3 ; clear low 3 bits, all bytes
vpsrl zmm2, zmm0, 5 ; shift DWORDS 》by 5 bits
vpandd zmm2, zmm2, SR5 ; clear high 5 bits, all bytes
vpord zmm1, zmm1, zmm2 ; combine left/right shifts for rotation result
Above is either a rotation left of 3 bits or a rotation right of 5 bits. And all 64 bytes yield correct results.
As I said, using WORD size (16 bits) a rotation is a minimum of 3 instructions. But due to needing 5 instructions for byte rotation in a zmm register, I only achieved a 50% performance increase going from 32 byte AVX2 implementation, to a 64 byte AVX-512 algorithm.

Related

AVX512 exchange low 256 bits and high 256 bits in zmm register

Is there any AVX-512 intrinsics that exchange low 256 bits and high 256 bits in zmm register?
I have a 512bit zmm register with double values. What I want to do is swap zmm[0:255] and zmm[256:511].
__m512d a = {10, 20, 30, 40, 50, 60, 70, 80};
__m512d b = _some_AVX_512_intrinsic(a);
// GOAL: b to be {50, 60, 70, 80, 10, 20, 30, 40}
There is a function that works in the ymm register, but I couldn't find any permute function that works in the zmm register.
You're looking for vshuff64x2 which can shuffle in 128-bit chunks from 2 sources, using an immediate control operand. It's the AVX-512 version of vperm2f128 which you found, but AVX-512 has two versions: one with masking by 32-bit elements, one with masking by 64-bit elements. (The masking is finer-grained than the shuffle, so you can merge or zero on a per-double basis while doing this.) Also integer and FP versions of the same shuffles, like vshufi32x4.
The intrinsic is _mm512_shuffle_f64x2(a,a, _MM_SHUFFLE(1,0, 3,2))
Note that on Intel Ice Lake CPUs, storing in two 32-byte halves with vmovupd / vextractf64x4 mem, zmm, 1 might be nearly as efficient, if you're storing. The vextract can't micro-fuse the store-address and store-data uops, but no shuffle port is involved on Intel including Skylake-X. (Unlike Zen4 I think). And Intel Ice Lake and later can sustain 2x 32-byte stores per clock, vs. 1x 64-byte aligned store per clock, if both stores are to the same cache line. (It seems the store buffer can commit two stores to the same cache line if they're both at the head of the queue.)
If the data's coming from memory, loading an __m256 + vinsertf64x4 is cheap, especially on Zen4, but on Intel it's 2 uops, one load, one for any vector ALU port (p0 or p5). A merge-masked 256-bit broadcast might be cheaper if you the mask register can stay set across loop iterations. Like _mm512_mask_broadcast_f64x4(_mm512_castpd256_pd512(low), 0b11110000, _mm256_loadu_pd(addr+4)). That still takes an ALU uop on Skylake-X and Ice Lake, but it can micro-fuse with the load.
Other instructions that can do the same shuffle include valignq with a rotate count of 4 qwords (using the same vector for both inputs).
Or of course any variable-control shuffle like vpermpd, but unlike for __m256d (4 doubles), 8 elements is too wide for an arbitrary shuffle with an 8-bit control.
On existing AVX-512 CPUs, a 2-input shuffle like valignq or vshuff64x2 is equally efficient to vpermpd with a control vector, including on Zen4; it has wide shuffle units so it isn't super slow for lane-crossing stuff like Zen1 was. Maybe on Xeon Phi (KNL) it might be worth loading a control vector for vpermpd if you have to do this repeatedly and can't just load in 2 halves or store in 2 halves. (https://agner.org/optimize/ and https://uops.info/)

Why is LOOP faster than DEC,JNZ on 8086?

My professor claimed that LOOP is faster on 8086 because only one instruction is fetched instead of two, like in dec cx, jnz. So I think we are saving time by avoiding the extra fetch and decode per iteration.
But earlier in the lecture, he also mentioned that LOOP does the same stuff as DEC, JNZ under the hood, and I presume that its decoding should also be more complex, so the speed difference should kind of balance out. Then, why is the LOOP instruction faster? I went through this post, and the answers there pertain to processors more modern than 8086, although one of the answers (and the page it links) does point out that on 8088 (closely related to 8086), LOOP is faster.
Later, the professor used the same reasoning to explain why rep string operations might be faster than LOOP + individual movement instructions, but since I was not entirely convinced with the previous approach, I asked this question here.
It's not decode that's the problem, it's usually fetch on 8086.
Starting two separate instruction-decode operations probably is more expensive than just fetching more microcode for one loop instruction. I'd guess that's what accounts for the numbers in the table below that don't include code-fetch bottlenecks.
Equally or more importantly, 8086 is often bottlenecked by memory access, including code-fetch. (8088 almost always is, breathing through a straw with it's 8-bit bus, unlike 8086's 16-bit bus).
dec cx is 1 byte, jnz rel8 is 2 bytes.
So 3 bytes total, vs. 2 for loop rel8.
8086 performance can be approximated by counting memory accesses and multiply by four, since its 6-byte instruction prefetch buffer allows it to overlap code-fetch with decode and execution of other instructions. (Except for very slow instructions like mul that would let the buffer fill up after at most three 2-byte fetches.)
See also Increasing Efficiency of binary -> gray code for 8086 for an example of optimizing something for 8086, with links to more resources like tables of instruction timings.
https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has instruction timings for 8086 (taken from Intel manuals I think, as cited in njuffa's answer), but those are only execution, when fetch isn't a bottleneck. (i.e. just decoding from the prefetch buffer.)
Decode / execute timings, not including fetch:
DEC Decrement
operand bytes 8088 186 286 386 486 Pentium
r8 2 3 3 2 2 1 1 UV
r16 1 3 3 2 2 1 1 UV
r32 1 3 3 2 2 1 1 UV
mem 2+d(0,2) 23+EA 15 7 6 3 3 UV
Jcc Jump on condition code
operand bytes 8088 186 286 386 486 Pentium
near8 2 4/16 4/13 3/7+m 3/7+m 1/3 1 PV
near16 3 - - - 3/7+m 1/3 1 PV
LOOP Loop control with CX counter
operand bytes 8088 186 286 386 486 Pentium
short 2 5/17 5/15 4/8+m 11+m 6/7 5/6 NP
So even ignoring code-fetch differences:
dec + taken jnz takes 3 + 16 = 19 cycles to decode / exec on 8086 / 8088.
taken loop takes 17 cycles to decode / exec on 8086 / 8088.
(Taken branches are slow on 8086, and discard the prefetch buffer; there's no branch prediction. IDK if those timings include any of that penalty, since they apparently don't for other instructions and non-taken branches.)
8088/8086 are not pipelined except for the code-prefetch buffer. Finishing execution of one instruction and starting decode / exec of the next take it some time; even the cheapest instructions (like mov reg,reg / shift / rotate / stc/std / etc.) take 2 cycles. Bizarrely more than nop (3 cycles).
I presume that its decoding should also be more complex
There's no reason that the decoding is more complex for the loop instruction.  This instruction has to do multiple things, but decoding is not at issue — it should decode as easily as JMP, since there's just the opcode and the one operand, the branch target, like JMP.
Saving one instruction's fetch & decode probably accounts for the speed improvement, since in execution they are effectively equivalent.
Looking at the "8086/8088 User's Manual: Programmer's and Hardware Reference" (Intel 1989) confirms that LOOP is marginally faster than the combination DEC CX; JNZ. DEC takes 3 clock cycles, JNZ takes 4 (not taken) or 16 (taken) cycles. So the combination requires 7 or 19 cycles. LOOP on the other hand requires 5 cycles (not taken) or 17 cycles (taken), for a saving of 2 cycles.
I do not see anything in the manual that describes why LOOP is faster. The faster instruction fetch due to the reduced number of opcode bytes seems like a reasonable hypothesis.
According to the "80286 and 80287 Programmer's Reference Manual" (Intel 1987), LOOP still has a slight advantage over the discrete replacement, in that it requires 8 cycles when taken and 4 cycles when not taken, while the combo requires 1 cycle more in both cases (DEC 2 cycles; JNZ 7 or 3 cycles).
The 8086 microcode has been disassembled, so one could theoretically take a look at the internal sequence of operations for both of these cases to establish exactly why LOOP is faster, if one is so inclined.

Store lower 16 bits of each AVX 32-bit element to memory

I have 8 integer values in an AVX value __m256i which are all capped at 0xffff, so the upper 16 bits are all zero.
Now I want to store these 8 values as 8 consecutive uint16_t values.
How can I write them to memory in this way? Can I somehow convert an __m256i value of 8 packed integers into a __m128i value that holds 8 packed shorts?
I am targeting AVX2 intrinsics, but if it can be done in AVX intrinsics, even better.
With AVX2, use _mm256_packus_epi32 + _mm256_permutex_epi64 to fix up the in-lane behaviour of packing two __m256i inputs, like #chtz said. Then you can store all 32 bytes of output from 64 bytes of input.
With AVX1, extract the high half of one vector and _mm_packus_epi32 pack into a __m128i. That would still cost 2 shuffle instructions but produce half the width of data output from them. (Although it's good on Zen1 where YMM registers get treated as 2x 128-bit halves anyway, and vextractf128 is cheaper on Zen1 than on CPUs where it's an actual shuffle.)
Of course, with only AVX1 you're unlikely to have integer data in a __m256i unless it was loaded from memory, in which case you should just do _mm_loadu_si128 in the first place. But with AVX2 it is probably worth doing 32 byte loads even though that means you need 2 shuffles per store instead of 1. Especially if any of your inputs aren't aligned by 16.

What percentage of the bits used for data in a 32kB (32,768 byte) direct-mapped write-back cache with a 64 byte cache line?

Encountered this problem and the solution said
"32 bit address bits, 64 byte line means we have 6 bits for the word address in the line that aren't in the tag, 32,768 bytes in the cache at 64 byte lines is 512 total lines, which means we have 12 bits of address for the cache index, write back means we need a dirty bit, and we always need a valid bit. So each line has 64*8=512 data bits, 32-6- 12=14 tag bits, and 2 flag bits: data/total bits = 512/(512+14+2)=512/528."
When I tried to solve the problem I got 32kB/64byte=512 lines in total, i.e. 2^9=512. In addition, a 64 byte cache line size, 1 word=4 bytes, is 64/4=16 words per line i.e. 2^4.
To my understanding the total amount of bits in a cache is given by total amount of entries/lines in the caches*(tag address + data)-> 2^9*((32-9-4+2)+16*32). Thus, the amount of data bits per cache line is 512 (16 words *32 bits per word), and the tag is 32-9-4+2=21 (the 9 is the cache index for direct mapped cache, the 4 is to address each word and the 2 is the valid bit and dirty bit)
Effectively, the answer should be 512/533 and not 512/528.
Correct?
512 lines = 9 bits not 12 as they claim, so you are right on this point.
However, they are right that 64 byte lines gives 6 bits for the block offset — though it is a byte offset, not word as they say.
So, 32-6-9=17 tag bits, then plus the 2 for dirty & valid.
FYI, there's nothing in the above problem that indicates a conversion from bytes to words. While it is true that there will be 16 x 32-bit words per line (i.e. 64 bytes per line) it is irrelevant: we should presume that the 32-bit address is a byte address unless otherwise stated. (It would be unusual to state cache size in bytes for a word (not byte) addressable machine; it would also be unusual for a 32-bit machine to be word addressable — some teaching architectures like LC-3 are word addressable, however, they are 16-bits; other word addressable machines have odd sizes like 12 or 18 or 36 bit words — though those pre-date caches!)

Purpose to set to 0 least significant bits in MMIX assembly with memory operations?

In the documentation to MMIX machine mmix-doc page 3 paragraph 4:
We use the notation to stand for a number consisting of
consecutive bytes starting at location . (The notation
means that the least significant t bits of k are set to
0, and only the least 64 bits of the resulting address are retained.
...
The notation M2t[k] is just a formal symbolism to express an address divisible by 2t.
This is confirmed just after the definition
All accesses to 2t-byte quantities by MMIX are aligned, in the
sense that the first byte is a multiple of 2t.
Most architectures, specially RISC ones, require a memory access to be aligned, this means that the address must be a multiple of the size accessed.
So, for example, reading a 64 bits word (an octa in MMIX notation) from memory require the address to be divisible by 8 because MMIX memory is byte addressable(1) and there are 8 bytes in an octa.
If all the possible data sizes are power of two we see a pattern emerge:
Multiples of Multiples of Multiples of
2 4 8
0000 0000 0000
0010 0100 1000
0100 1000
0110 1100
1000
1010
1100
1110
Multiples of 2 = 21 have the least bit always set to zero(2), multiples of 4 = 22 have the the two least bits set to zero, multiples of 8 = 23 have the three least bits set to zero and so on.
In general multiples of 2t have the least t bits set to zero.
You can formally prove this by induction over t.
A way to align a 64 bit number (the size of the MMIX address space) is to clear its lower t bits, this can be done by performing an AND operation with a mask of the form
11111...1000...0
\ / \ /
64 - t t
Such mask can be expressed as 264 - 2t.
264 is a big number for an example, lets pretend the address space is only 25.
Lets say we have the address 17h or 10111b in binary and lets say we want to align it to octas.
Octas are 8 bytes, 23 so we need to clear the lower 3 bits and preserve the other 2 bits.
The mask to use is 11000b or 18h in hexadecimal. This number is 25-23 = 32 - 8 = 24 = 18h.
If we perform the boolean AND between 17h and 18h we get 10h which is the aligned address.
This explains the notation k ∧ (264 − 2t) used short after, the "wedge" symbol ∧ is a logic AND.
So this notation just "pictures" the steps necessary to align the address k.
Note that the notation k ∨ (2t − 1) is also introduced, this is the complementary, ∨ is the OR and the whole effect is to have the lower t bits set to 1.
This is the greatest address occupied by an aligned access of size 2t.
The notation itself is used to explain the endianess.
If you wonder why aligned access are important, it has to do with hardware implementation.
Long story short the CPU interface to the memory has a predefined size despite the memory being byte addressable, say 64 bits.
So the CPU access the memory in blocks of 64 bits each one starting at an address multiple of 64 bits (i.e. aligned on 8 bytes).
Accessing an unaligned location may require the CPU to perform two access:
CPU reading an octa at address 2, we need bytes at 2, 3, 4 and 5.
Address 0 1 2 3 4 5 6 7 8 9 A B ...
\ / \ /
A B
CPU read octa at 0 (access A) and octa at 4 (access B), then combines the two reads.
RISC machine tends to avoid this complexity and entirely forbid unaligned access.
(1) Quoting: "If k is any unsigned octabyte, M[k] is a 1-byte
quantity".
(2) 20 = 1 is the only odd power of two, so you can guess that by removing it we only get even numbers.

Resources