Store lower 16 bits of each AVX 32-bit element to memory - intrinsics

I have 8 integer values in an AVX value __m256i which are all capped at 0xffff, so the upper 16 bits are all zero.
Now I want to store these 8 values as 8 consecutive uint16_t values.
How can I write them to memory in this way? Can I somehow convert an __m256i value of 8 packed integers into a __m128i value that holds 8 packed shorts?
I am targeting AVX2 intrinsics, but if it can be done in AVX intrinsics, even better.

With AVX2, use _mm256_packus_epi32 + _mm256_permutex_epi64 to fix up the in-lane behaviour of packing two __m256i inputs, like #chtz said. Then you can store all 32 bytes of output from 64 bytes of input.
With AVX1, extract the high half of one vector and _mm_packus_epi32 pack into a __m128i. That would still cost 2 shuffle instructions but produce half the width of data output from them. (Although it's good on Zen1 where YMM registers get treated as 2x 128-bit halves anyway, and vextractf128 is cheaper on Zen1 than on CPUs where it's an actual shuffle.)
Of course, with only AVX1 you're unlikely to have integer data in a __m256i unless it was loaded from memory, in which case you should just do _mm_loadu_si128 in the first place. But with AVX2 it is probably worth doing 32 byte loads even though that means you need 2 shuffles per store instead of 1. Especially if any of your inputs aren't aligned by 16.

Related

how to extract high 256 bits of 512 bits __m512i using avx512 intrinsic?

In AVX2 intrinsic programming, we can use _mm256_extracti128_si256 to extract high/low 128 bits of 256 bits, but I don't find such intrinsics function for 512 bits register to extract high/low 256 bits. How to extract high 256 bits of 512 bits __m512i using avx2 intrinsic?
There are no AVX2 intrinsics to operate on a type that's new in AVX-512, __m512i.
There are AVX-512 intrinsics for vextracti32x8 (_mm512_extracti32x8_epi32) and vextracti64x4 (_mm512_extracti64x4_epi64) and mask / maskz versions (which is why two different element-size versions of the instructions exist).
You can find them by searching in Intel's intrinsics guide for __m256i _mm512 (i.e. for _mm512 intrinsics that return an __m256i). Or you can search for vextracti64x4; the intrinsics guide search feature works on asm mnemonics. Or look through https://www.felixcloutier.com/x86/ which is scraped from Intel's vol.2 asm manual; search for extract quickly gets to some likely-looking asm instructions; each entry has a section for intrinsics (for instructions that have intrinsics).
There's also _mm512_castsi512_si256 for the low half, of course. That's a no-op (zero asm instructions), so there aren't different element-size versions of it with merge-masking or zero-masking.
__m256i low = _mm512_castsi512_si256 (v512);
__m256i high = _mm512_extracti32x8_epi32 (v512, 1);
There's also vshufi64x2 and similar to shuffle with 128-bit granularity, with an 8-bit immediate control mask to grab lanes from 2 source vectors. (Like shufps does for 4-byte elements). That and a cast can get a 256-bit vector from any combination of parts you want.

AVX512 exchange low 256 bits and high 256 bits in zmm register

Is there any AVX-512 intrinsics that exchange low 256 bits and high 256 bits in zmm register?
I have a 512bit zmm register with double values. What I want to do is swap zmm[0:255] and zmm[256:511].
__m512d a = {10, 20, 30, 40, 50, 60, 70, 80};
__m512d b = _some_AVX_512_intrinsic(a);
// GOAL: b to be {50, 60, 70, 80, 10, 20, 30, 40}
There is a function that works in the ymm register, but I couldn't find any permute function that works in the zmm register.
You're looking for vshuff64x2 which can shuffle in 128-bit chunks from 2 sources, using an immediate control operand. It's the AVX-512 version of vperm2f128 which you found, but AVX-512 has two versions: one with masking by 32-bit elements, one with masking by 64-bit elements. (The masking is finer-grained than the shuffle, so you can merge or zero on a per-double basis while doing this.) Also integer and FP versions of the same shuffles, like vshufi32x4.
The intrinsic is _mm512_shuffle_f64x2(a,a, _MM_SHUFFLE(1,0, 3,2))
Note that on Intel Ice Lake CPUs, storing in two 32-byte halves with vmovupd / vextractf64x4 mem, zmm, 1 might be nearly as efficient, if you're storing. The vextract can't micro-fuse the store-address and store-data uops, but no shuffle port is involved on Intel including Skylake-X. (Unlike Zen4 I think). And Intel Ice Lake and later can sustain 2x 32-byte stores per clock, vs. 1x 64-byte aligned store per clock, if both stores are to the same cache line. (It seems the store buffer can commit two stores to the same cache line if they're both at the head of the queue.)
If the data's coming from memory, loading an __m256 + vinsertf64x4 is cheap, especially on Zen4, but on Intel it's 2 uops, one load, one for any vector ALU port (p0 or p5). A merge-masked 256-bit broadcast might be cheaper if you the mask register can stay set across loop iterations. Like _mm512_mask_broadcast_f64x4(_mm512_castpd256_pd512(low), 0b11110000, _mm256_loadu_pd(addr+4)). That still takes an ALU uop on Skylake-X and Ice Lake, but it can micro-fuse with the load.
Other instructions that can do the same shuffle include valignq with a rotate count of 4 qwords (using the same vector for both inputs).
Or of course any variable-control shuffle like vpermpd, but unlike for __m256d (4 doubles), 8 elements is too wide for an arbitrary shuffle with an 8-bit control.
On existing AVX-512 CPUs, a 2-input shuffle like valignq or vshuff64x2 is equally efficient to vpermpd with a control vector, including on Zen4; it has wide shuffle units so it isn't super slow for lane-crossing stuff like Zen1 was. Maybe on Xeon Phi (KNL) it might be worth loading a control vector for vpermpd if you have to do this repeatedly and can't just load in 2 halves or store in 2 halves. (https://agner.org/optimize/ and https://uops.info/)

Why is the register length static in any CPU

Why is the register length (in bits) that a CPU operates on not dynamically/manually/arbitrarily adjustable? Would it make the computer slower if it was adjustable this way?
Imagine you had an 8-bit integer. If you could adjust the CPU register length to 8 bits, the CPU would only have to go through the first 8 bits instead of extending the 8-bit integer to 64 bits and then going through all 64 bits.
At first I thought you were asking if it was possible to have a CPU with no definitive register size. That make no sense since the number and size of the registers is a physical property of the hardware and cannot be changed.
However some architecture let the programmer work on a smaller part of a register or to pair registers.
The x86 does both for example, with add al, 9 (uses only 8 bits of the 64-bit rax) and div rbx (pairs rdx:rax to form a 128-bit register).
The reason this scheme is not so diffuse is that it comes with a lot of trade-offs.
More registers means more bits needed to address them, simply put: longer instructions.
Longer instructions mean less code density, more complex decoders and less performance.
Furthermore most elementary operations, like the logic ones, addition and subtraction are already implemented as operating on a full register in a single cycle.
Finally, one execution unit can handle only one instruction at a time, we cannot issue eight 8-bit additions in a 64-bit ALU at the same time.
So there wouldn't be any improvement, nor in the latency nor in the throughput.
Accessing partial registers is useful for the programmer to fan-out the number of available registers, so for example if an algorithm works with 16-bit data, the programmer could use a single physical 64-bit register to store four items and operate on them independently (but not in parallel).
The ISAs that have variable length instructions can also benefit from using partial register because that usually means smaller immediate values, for example and instruction that set a register to a specific value usually have an immediate operand that matches the size of register being loaded (though RISC usually sign-extends or zero-extends it).
Architectures like ARM (presumably others as well) supports half precision floats. The idea is to do what you were speculating and #Margaret explained. With half precision floats, you can pack two float values in a single register, thereby introducing less bandwidth at a cost of reduced accuracy.
Reference:
[1] ARM
[2] GCC

256 bit fixed point arithmetic, the future?

Just some silly musings, but if computers were able to efficiently calculate 256 bit arithmetic, say if they had a 256 bit architecture, I reckon we'd be able to do away with floating point. I also wonder, if there'd be any reason to progress past 256 bit architecture? My basis for this is rather flimsy, but I'm confident that you'll put me straight if I'm wrong ;) Here's my thinking:
You could have a 256 bit type that used the 127 or 128 bits for integers, 127 or 128 bits for fractional values, and then of course a sign bit. If you had hardware that was capable of calculating, storing and moving such big numbers with no problems, I reckon you'd be set to handle any calculation you'd come across.
One example: If you were working with lengths, and you represented all values in meters, then the minimum value (2^-128 m) would be smaller than the planck length, and the biggest value (2^127 m) would be bigger than the diameter of the observable universe. Imagine calculating light-years of distances with a precision smaller than a planck length?
Ok, that's only one example, but I'm struggling to think of any situations that could possibly warrant bigger and smaller numbers than that. Any thoughts? Are there possible problems with fixed point arithmetic that I haven't considered? Are there issues with creating a 256 bit architecture?
SIMD will make narrow types valuable forever. If you can do a 256bit add, you can do eight 32bit integer adds in parallel on the same hardware (by not propagating carry across element boundaries). Or you can do thirty-two 8bit adds.
Hardware multiplier circuits are a lot more expensive to make wider, so it's not a good assumption to assume that a 256b X 256b multiplier will be practical to build.
Even besides SIMD considerations, memory bandwidth / cache footprint is a huge deal.
So 4B float will continue to be excellent for being precise enough to be useful, but small enough to pack many elements into a big vector, or in cache.
Floating-point also allows a much wider range of numbers by using some of its bits as an exponent. With mantissa = 1.0, the range of IEEE binary64 double goes from 2-1022 to 21023, for "normal" numbers (53-bit mantissa precision over the whole range, only getting worse for denormals (gradual underflow)). Your proposal only handles numbers from about 2-127 (with 1 bit of precision) to 2127 (with 256b of precision).
Floating point has the same number of significant figures at any magnitude (until you get into denormals very close to zero), because the mantissa is fixed width. Normally this is a useful property, especially when multiplying or dividing. See Fixed Point Cholesky Algorithm Advantages for an example of why FP is good. (Subtracting two nearby numbers is a problem, though...)
Even though current SIMD instruction sets already have 256b vectors, the widest element width is 64b for add. AVX2's widest multiply is 32bit * 32bit => 64bit.
AVX512DQ has a 64b * 64b -> 64b (low half) vpmullq, which may show up in Skylake-E (Purley Xeon).
AVX512IFMA introduces a 52b * 52b + 64b => 64bit integer FMA. (VPMADD52LUQ low half and VPMADD52HUQ high half.) The 52 bits input precision is clearly so they can use the FP mantissa multiplier hardware, instead of requiring separate 64bit integer multipliers. (A full vector width of 64bit full-multipliers would be even more expensive than vpmullq. A compromise design like this even for 64bit integers should be a big hint that wide multipliers are expensive). Note that this isn't part of baseline AVX512F either, and may show up in Cannonlake, based on a Clang git commit.
Supporting arbitrary-precision adds/multiplies in SIMD (for crypto applications like RSA) is possible if the instruction set is designed for it (which Intel SSE/AVX isn't). Discussion on Agner Fog's recent proposal for a new ISA included an idea for SIMD add-with-carry.
For actually implementing 256b math on 32 or 64-bit hardware, see https://locklessinc.com/articles/256bit_arithmetic/ and https://gmplib.org/. It's really not that bad considering how rarely it's needed.
Another big downside to building hardware with very wide integer registers is that even if the upper bits are usually unused, out-of-order execution hardware needs to be able to handle the case where it is used. This means a much larger physical register file compared to an architecture with 64-bit registers (which is bad, because it needs to be very fast and physically close to other parts of the CPU, and have many read ports). e.g. Intel Haswell has 168-entry PRFs for integer and FP/SIMD.
The FP register file already has 256b registers, so I guess if you were going to do something like this, you'd do it with execution units that used the SIMD vector registers as inputs/outputs, not by widening the integer registers. But the FP/SIMD execution units aren't normally connected to the integer carry flag, so you might need a separate SIMD-carry register for 256b add.
Intel or AMD already could have implemented an instruction / execution unit for adding 128b or 256b integers in xmm or ymm registers, but they haven't. (The max SIMD element width even for addition is 64-bit. Only shuffles operate on the whole register as a unit, and then only with byte-granularity or wider.)
128 bit computers. It is also about addressing memory and when we run out 64-bits when addressing memory. Currently there are servers with 4TB memory. That requires about 42 bits (2^42 > 4 x 10^12). If we assume that memory prices halves every second year then we need one bit more every second year. We still have 22 bits left so at least 2 * 22 years and it is likely that memory prices are not dropping that fast -> more than 50 years when we run out of 64-bits addressing capabilities.

Why do bytes exist? Why don't we just use bits?

A byte consists of 8 bits on most systems.
A byte typically represents the smallest data type a programmer may use. Depending on language, the data types might be called char or byte.
There are some types of data (booleans, small integers, etc) that could be stored in fewer bits than a byte. Yet using less than a byte is not supported by any programming language I know of (natively).
Why does this minimum of using 8 bits to store data exist? Why do we even need bytes? Why don't computers just use increments of bits (1 or more bits) rather than increments of bytes (multiples of 8 bits)?
Just in case anyone asks: I'm not worried about it. I do not have any specific needs. I'm just curious.
because at the hardware level memory is naturally organized into addressable chunks. Small chunks means that you can have fine grained things like 4 bit numbers; large chunks allow for more efficient operation (typically a CPU moves things around in 'chunks' or multiple thereof). IN particular larger addressable chunks make for bigger address spaces. If I have chunks that are 1 bit then an address range of 1 - 500 only covers 500 bits whereas 500 8 bit chunks cover 4000 bits.
Note - it was not always 8 bits. I worked on a machine that thought in 6 bits. (good old octal)
Paper tape (~1950's) was 5 or 6 holes (bits) wide, maybe other widths.
Punched cards (the newer kind) were 12 rows of 80 columns.
1960s:
B-5000 - 48-bit "words" with 6-bit characters
CDC-6600 -- 60-bit words with 6-bit characters
IBM 7090 -- 36-bit words with 6-bit characters
There were 12-bit machines; etc.
1970-1980s, "micros" enter the picture:
Intel 4004 - 4-bit chunks
8008, 8086, Z80, 6502, etc - 8 bit chunks
68000 - 16-bit words, but still 8-bit bytes
486 - 32-bit words, but still 8-bit bytes
today - 64-bit words, but still 8-bit bytes
future - 128, etc, but still 8-bit bytes
Get the picture? Americans figured that characters could be stored in only 6 bits.
Then we discovered that there was more in the world than just English.
So we floundered around with 7-bit ascii and 8-bit EBCDIC.
Eventually, we decided that 8 bits was good enough for all the characters we would ever need. ("We" were not Chinese.)
The IBM-360 came out as the dominant machine in the '60s-70's; it was based on an 8-bit byte. (It sort of had 32-bit words, but that became less important than the all-mighty byte.
It seemed such a waste to use 8 bits when all you really needed 7 bits to store all the characters you ever needed.
IBM, in the mid-20th century "owned" the computer market with 70% of the hardware and software sales. With the 360 being their main machine, 8-bit bytes was the thing for all the competitors to copy.
Eventually, we realized that other languages existed and came up with Unicode/utf8 and its variants. But that's another story.
Good way for me to write something late on night!
Your points are perfectly valid, however, history will always be that insane intruder how would have ruined your plans long before you were born.
For the purposes of explanation, let's imagine a ficticious machine with an architecture of the name of Bitel(TM) Inside or something of the like. The Bitel specifications mandate that the Central Processing Unit (CPU, i.e, microprocessor) shall access memory in one-bit units. Now, let's say a given instance of a Bitel-operated machine has a memory unit holding 32 billion bits (our ficticious equivalent of a 4GB RAM unit).
Now, let's see why Bitel, Inc. got into bankruptcy:
The binary code of any given program would be gigantic (the compiler would have to manipulate every single bit!)
32-bit addresses would be (even more) limited to hold just 512MB of memory. 64-bit systems would be safe (for now...)
Memory accesses would be literally a deadlock. When the CPU has got all of those 48 bits it needs to process a single ADD instruction, the floppy would have already spinned for too long, and you know what happens next...
Who the **** really needs to optimize a single bit? (See previous bankruptcy justification).
If you need to handle single bits, learn to use bitwise operators!
Programmers would go crazy as both coffee and RAM get too expensive. At the moment, this is a perfect synonym of apocalypse.
The C standard is holy and sacred, and it mandates that the minimum addressable unit (i.e, char) shall be at least 8 bits wide.
8 is a perfect power of 2. (1 is another one, but meh...)
In my opinion, it's an issue of addressing. To access individual bits of data, you would need eight times as many addresses (adding 3 bits to each address) compared to using accessing individual bytes. The byte is generally going to be the smallest practical unit to hold a number in a program (with only 256 possible values).
Some CPUs use words to address memory instead of bytes. That's their natural data type, so 16 or 32 bits. If Intel CPUs did that it would be 64 bits.
8 bit bytes are traditional because the first popular home computers used 8 bits. 256 values are enough to do a lot of useful things, while 16 (4 bits) are not quite enough.
And, once a thing goes on for long enough it becomes terribly hard to change. This is also why your hard drive or SSD likely still pretends to use 512 byte blocks. Even though the disk hardware does not use a 512 byte block and the OS doesn't either. (Advanced Format drives have a software switch to disable 512 byte emulation but generally only servers with RAID controllers turn it off.)
Also, Intel/AMD CPUs have so much extra silicon doing so much extra decoding work that the slight difference in 8 bit vs 64 bit addressing does not add any noticeable overhead. The CPU's memory controller is certainly not using 8 bits. It pulls data into cache in long streams and the minimum size is the cache line, often 64 bytes aka 512 bits. Often RAM hardware is slow to start but fast to stream so the CPU reads kilobytes into L3 cache, much like how hard drives read an entire track into their caches because the drive head is already there so why not?
First of all, C and C++ do have native support for bit-fields.
#include <iostream>
struct S {
// will usually occupy 2 bytes:
// 3 bits: value of b1
// 2 bits: unused
// 6 bits: value of b2
// 2 bits: value of b3
// 3 bits: unused
unsigned char b1 : 3, : 2, b2 : 6, b3 : 2;
};
int main()
{
std::cout << sizeof(S) << '\n'; // usually prints 2
}
Probably an answer lies in performance and memory alignment, and the fact that (I reckon partly because byte is called char in C) byte is the smallest part of machine word that can hold a 7-bit ASCII. Text operations are common, so special type for plain text have its gain for programming language.
Why bytes?
What is so special about 8 bits that it deserves its own name?
Computers do process all data as bits, but they prefer to process bits in byte-sized groupings. Or to put it another way: a byte is how much a computer likes to "bite" at once.
The byte is also the smallest addressable unit of memory in most modern computers. A computer with byte-addressable memory can not store an individual piece of data that is smaller than a byte.
What's in a byte?
A byte represents different types of information depending on the context. It might represent a number, a letter, or a program instruction. It might even represent part of an audio recording or a pixel in an image.
Source

Resources