How would I define the __builtin_blendvps256 GCC intrinsic in Ada using GNAT? - gcc

I am trying to define a library in Ada (built on GNAT specifically) for x86 ISA extensions. (This question is specific to AVX/AVX2).
Here is some example code below:
-- 256-bit Vector of Single Precision Floating Point Numbers
type Vector_256_Float_32 is array (0 .. 7) of IEEE_Float_32 with
Alignment => 32, Size => 256, Object_Size => 256;
pragma Machine_Attribute (Vector_256_Float_32, "vector_type");
pragma Machine_Attribute (Vector_256_Float_32, "may_alias");
-- 256-bit Vector of 32-bit Signed Integers
type Vector_256_Integer_32 is array (0 .. 7) of Integer_32 with
Alignment => 32, Size => 256, Object_Size => 256;
pragma Machine_Attribute (Vector_256_Integer_32, "vector_type");
pragma Machine_Attribute (Vector_256_Integer_32, "may_alias");
-- 256-bit Vector of 32-bit Unsigned Integers
type Vector_256_Unsigned_32 is array (0 .. 7) of Unsigned_32 with
Alignment => 32, Size => 256, Object_Size => 256;
pragma Machine_Attribute (Vector_256_Unsigned_32, "vector_type");
pragma Machine_Attribute (Vector_256_Unsigned_32, "may_alias");
function vblendvps
(Left, Right, Mask : Vector_256_Float_32)
return Vector_256_Float_32 with
Inline_Always => True, Convention => Intrinsic, Import => True,
External_Name => "__builtin_ia32_blendvps256";
For the sake of education, I want to know how to do this in assembly.
I have tried to define the vblendvps function using the Asm function from System.Machine_Code. However, as I am not knowledgeable about assembly programming, I am struggling to find resources on how to do this.
This is what I have so far:
with System.Machine_Code; use System.Machine_Code;
function vblendvps
(Left, Right, Mask : Vector_256_Float_32)
return Vector_256_Float_32
is
result : Vector_256_Float_32;
begin
Asm
(Template => "vblendvps %3, %0, %1, %2",
Outputs => Vector_256_Float_32'Asm_Output ("=g", result),
Inputs =>
(Vector_256_Float_32'Asm_Input ("g", Left),
Vector_256_Float_32'Asm_Input ("g", Right),
Vector_256_Unsigned_32'Asm_Input ("g", Mask)));
return result;
end vblendvps;
When compiling the complete code, I get
Error: too many memory references for `vblendvps'
I believe this means that I need to move the arguments from memory to registers, but I am not sure. If there are some helpful references that explain every instruction, I would greatly appreciate that. (I had quite some trouble looking up the arguments to vblendvps).
My understanding is that the instruction is of the form (from ymm registers in my case)
vblendvps RESULT, LEFT, RIGHT, MASK
Please let me know how I would do this. Even if it is not in Ada, I'm sure I can figure out how to translate it.

Related

Delphi Radix Sort supporting negative integers

Lately I used the great sorting algorythm made by Rebuilder from Habr.com. It served me well, but it sorts only positive integers, and recently I run into the need to sort negatives as well. For now I use QuickSort, but since the array is very large (10k+ elements), I wonder if one could modify RadixSort for this task.
There is the procedure as for now. Comment translation is mine, sorry if I get something wrong.
procedure RSort(var m: array of Longword);
//--------------------------------------------------
procedure Sort_step(var source, dest, offset: array of Longword; const num: Byte);
var i,temp : Longword;
k : Byte;
begin
for i := High(source) downto 0 do
begin
temp := source[i];
k := temp SHR num;
dec(offset[k]);
dest[offset[k]] := temp;
end;
end;
//--------------------------------------------------
// Объявляем массив корзин первым, для выравнивания на стеке
// Creating the bin array first for aligning at the stack
var s : array[0..3] of array[0..255] of Longword;
i,k : longword;
// Смещение байт внутри переменной k
// Byte offset inside of variable k
offset : array[0..3] of byte absolute k;
m_temp : array of Longword;
begin
SetLength(m_temp, Length(m));
// Быстрая очистка корзин
// Quick bin clear
FillChar(s[0], 256 * 4 * SizeOf(Longword), 0);
// Заполнение корзин
// Filling bins
for i := 0 to High(m) do
begin
k := m[i];
Inc(s[0,offset[0]]);
Inc(s[1,offset[1]]);
Inc(s[2,offset[2]]);
Inc(s[3,offset[3]]);
end;
// Пересчёт смещений для корзин
// Recalculating bin offsets
for i := 1 to 255 do
begin
Inc(s[0,i], s[0,i-1]);
Inc(s[1,i], s[1,i-1]);
Inc(s[2,i], s[2,i-1]);
Inc(s[3,i], s[3,i-1]);
end;
// Вызов сортировки по байтам от младших к старшим
// Sorting by byte, from least to most
Sort_step(m, m_temp, s[0], 0);
Sort_step(m_temp, m, s[1], 8);
Sort_step(m, m_temp, s[2], 16);
Sort_step(m_temp, m, s[3], 24);
SetLength(m_temp, 0);
end;
Link: https://habr.com/ru/post/484224/
I found some helpful advice on the Internet, including StackOverflow, but I met two problems:
There are too many different solutions and I can't choose the optimal one for Delphi.
I lack the knowledge and skill to implement them correctly. I've tried some and get wrong results.
So, could someone modify the given function and explain to me what they did and why?
A simple approach is to complement the sign bit somewhere during the process. Note this only affects the most significant "digit" (usually a byte for fast radix sort). The code to handle the most significant "digit" could be handled in a separate loop than the code that handles the other "digits".
The simplest approach would be to make an initial pass to complement the sign bit of every element in the array, do the radix sort, then a final pass complement the sign bit of every element again.

Faster lookup tables using AVX2

I'm trying to speed up an algorithm which performs a series of lookup tables. I'd like to use SSE2 or AVX2. I've tried using the _mm256_i32gather_epi32 command but it is 31% slower. Does anyone have any suggestions to any improvements or a different approach?
Timings:
C code = 234
Gathers = 340
static const int32_t g_tables[2][64]; // values between 0 and 63
template <int8_t which, class T>
static void lookup_data(int16_t * dst, T * src)
{
const int32_t * lut = g_tables[which];
// Leave this code for Broadwell or Skylake since it's 31% slower than C code
// (gather is 12 for Haswell, 7 for Broadwell and 5 for Skylake)
#if 0
if (sizeof(T) == sizeof(int16_t)) {
__m256i avx0, avx1, avx2, avx3, avx4, avx5, avx6, avx7;
__m128i sse0, sse1, sse2, sse3, sse4, sse5, sse6, sse7;
__m256i mask = _mm256_set1_epi32(0xffff);
avx0 = _mm256_loadu_si256((__m256i *)(lut));
avx1 = _mm256_loadu_si256((__m256i *)(lut + 8));
avx2 = _mm256_loadu_si256((__m256i *)(lut + 16));
avx3 = _mm256_loadu_si256((__m256i *)(lut + 24));
avx4 = _mm256_loadu_si256((__m256i *)(lut + 32));
avx5 = _mm256_loadu_si256((__m256i *)(lut + 40));
avx6 = _mm256_loadu_si256((__m256i *)(lut + 48));
avx7 = _mm256_loadu_si256((__m256i *)(lut + 56));
avx0 = _mm256_i32gather_epi32((int32_t *)(src), avx0, 2);
avx1 = _mm256_i32gather_epi32((int32_t *)(src), avx1, 2);
avx2 = _mm256_i32gather_epi32((int32_t *)(src), avx2, 2);
avx3 = _mm256_i32gather_epi32((int32_t *)(src), avx3, 2);
avx4 = _mm256_i32gather_epi32((int32_t *)(src), avx4, 2);
avx5 = _mm256_i32gather_epi32((int32_t *)(src), avx5, 2);
avx6 = _mm256_i32gather_epi32((int32_t *)(src), avx6, 2);
avx7 = _mm256_i32gather_epi32((int32_t *)(src), avx7, 2);
avx0 = _mm256_and_si256(avx0, mask);
avx1 = _mm256_and_si256(avx1, mask);
avx2 = _mm256_and_si256(avx2, mask);
avx3 = _mm256_and_si256(avx3, mask);
avx4 = _mm256_and_si256(avx4, mask);
avx5 = _mm256_and_si256(avx5, mask);
avx6 = _mm256_and_si256(avx6, mask);
avx7 = _mm256_and_si256(avx7, mask);
sse0 = _mm_packus_epi32(_mm256_castsi256_si128(avx0), _mm256_extracti128_si256(avx0, 1));
sse1 = _mm_packus_epi32(_mm256_castsi256_si128(avx1), _mm256_extracti128_si256(avx1, 1));
sse2 = _mm_packus_epi32(_mm256_castsi256_si128(avx2), _mm256_extracti128_si256(avx2, 1));
sse3 = _mm_packus_epi32(_mm256_castsi256_si128(avx3), _mm256_extracti128_si256(avx3, 1));
sse4 = _mm_packus_epi32(_mm256_castsi256_si128(avx4), _mm256_extracti128_si256(avx4, 1));
sse5 = _mm_packus_epi32(_mm256_castsi256_si128(avx5), _mm256_extracti128_si256(avx5, 1));
sse6 = _mm_packus_epi32(_mm256_castsi256_si128(avx6), _mm256_extracti128_si256(avx6, 1));
sse7 = _mm_packus_epi32(_mm256_castsi256_si128(avx7), _mm256_extracti128_si256(avx7, 1));
_mm_storeu_si128((__m128i *)(dst), sse0);
_mm_storeu_si128((__m128i *)(dst + 8), sse1);
_mm_storeu_si128((__m128i *)(dst + 16), sse2);
_mm_storeu_si128((__m128i *)(dst + 24), sse3);
_mm_storeu_si128((__m128i *)(dst + 32), sse4);
_mm_storeu_si128((__m128i *)(dst + 40), sse5);
_mm_storeu_si128((__m128i *)(dst + 48), sse6);
_mm_storeu_si128((__m128i *)(dst + 56), sse7);
}
else
#endif
{
for (int32_t i = 0; i < 64; i += 4)
{
*dst++ = src[*lut++];
*dst++ = src[*lut++];
*dst++ = src[*lut++];
*dst++ = src[*lut++];
}
}
}
You're right that gather is slower than a PINSRD loop on Haswell. It's probably nearly break-even on Broadwell. (See also the x86 tag wiki for perf links, especially Agner Fog's insn tables, microarch pdf, and optimization guide)
If your indices are small, or you can slice them up, pshufb can be used as parallel LUT with 4bit indices. It gives you sixteen 8bit table entries, but you can use stuff like punpcklbw to combine two vectors of byte results into one vector of 16bit results. (Separate tables for high and low halves of the LUT entries, with the same 4bit indices).
This kind of technique gets used for Galois Field multiplies, when you want to multiply every element of a big buffer of GF16 values by the same value. (e.g. for Reed-Solomon error correction codes.) Like I said, taking advantage of this requires taking advantage of special properties of your use-case.
AVX2 can do two 128b pshufbs in parallel, in each lane of a 256b vector. There is nothing better until AVX512F: __m512i _mm512_permutex2var_epi32 (__m512i a, __m512i idx, __m512i b). There are byte (vpermi2b in AVX512VBMI), word (vpermi2w in AVX512BW), dword (this one, vpermi2d in AVX512F), and qword (vpermi2q in AVX512F) element size versions. This is a full cross-lane shuffle, indexing into two concatenated source registers. (Like AMD XOP's vpperm).
The two different instructions behind the one intrinsic (vpermt2d / vpermi2d) give you a choice of overwriting the table with the result, or overwriting the index vector. The compiler will pick based on which inputs are reused.
Your specific case:
*dst++ = src[*lut++];
The lookup-table is actually src, not the variable you've called lut. lut is actually walking through an array which is used as a shuffle-control mask for src.
You should make g_tables an array of uint8_t for best performance. The entries are only 0..63, so they fit. Zero-extending loads into full registers are as cheap as normal loads, so it just reduces the cache footprint. To use it with AVX2 gathers, use vpmovzxbd. The intrinsic is frustratingly difficult to use as a load, because there's no form that takes an int64_t *, only __m256i _mm256_cvtepu8_epi32 (__m128i a) which takes a __m128i. This is one of the major design flaws with intrinsics, IMO.
I don't have any great ideas for speeding up your loop. Scalar code is probably the way to go here. The SIMD code shuffles 64 int16_t values into a new destination, I guess. It took me a while to figure that out, because I didn't find the if (sizeof...) line right away, and there are no comments. :( It would be easier to read if you used sane variable names, not avx0... Using x86 gather instructions for elements smaller than 4B certainly requires annoying masking. However, instead of pack, you could use a shift and OR.
You could make an AVX512 version for sizeof(T) == sizeof(int8_t) or sizeof(T) == sizeof(int16_t), because all of src will fit into one or two zmm registers.
If g_tables was being used as a LUT, AVX512 could do it easily, with vpermi2b. You'd have a hard time with out AVX512, though, because a 64 byte table is too big for pshufb. Using four lanes (16B) of pshufb for each input lane could work: Mask off indices outside 0..15, then indices outside 16..31, etc, with pcmpgtb or something. Then you have to OR all four lanes together. So this sucks a lot.
possible speedups: design the shuffle by hand
If you're willing to design a shuffle by hand for a specific value of g_tables, there are potential speedups that way. Load a vector from src, shuffle it with a compile-time constant pshufb or pshufd, then store any contiguous blocks in one go. (Maybe with pextrd or pextrq, or even better movq from the bottom of the vector. Or even a full-vector movdqu).
Actually, loading multiple src vectors and shuffling between them is possible with shufps. It works fine on integer data, with no slowdowns except on Nehalem (and maybe also on Core2). punpcklwd / dq / qdq (and the corresponding punpckhwd etc) can interleave elements of vectors, and give different choices for data movement than shufps.
If it doesn't take too many instructions to construct a few full 16B vectors, you're in good shape.
If g_tables can take on too many possible values, it might be possible to JIT-compile a custom shuffle function. This is probably really hard to do well, though.

Change array type: 8bit type to 6bit type

I have two types and two arrays of that types in file.ads
type Ebit is mod 2**8;
type Sbit is mod 2**6;
type Data_Type is array (Positive range <>) of Ebit;
type Changed_Data_Type is array (Positive range <>) of Sbit;
and function:
function ChangeDataType (D : in Data_Type) return Changed_Data_Type
with
Pre => D'Length rem 3 = 0 and D'Last < Positive'Last / 4,
Post => ChangeDataType'Result'Length = 4 * (D'Length / 3)
Ok i can understand all of this.
For example we have arrays of:
65, 66, 65, 65, 66, 65 in 8bit values function should give to us 16, 20, 9, 1, 16, 20, 9, 1 in 6bit values.
I dont know how i can build a 6bit table from 8 bit table.
My idea of sollutions is for example taking bit by bit from type:
fill all bites in 6bit type to 0 (propably default)
if first bit (2**1) is 1 set bit (2**1) in 6bit type to 1;
and do some iterations
But i dont know how to do this, always is a problem with types. Is this good idea or i can do this with easier way? I spend last nigt to try write this but without success.
Edit:
I wrote some code, its working but i have problem with array initialization.
function ChangeDataType (D: in Data_Type) return Changed_Data_Type
is
length: Natural := (4*(D'Length / 3));
ER: Changed_Data_type(length);
Temp: Ebit;
Temp1: Ebit;
Temp2: Ebit;
Actual: Ebit;
n: Natural;
k: Natural;
begin
n := 0;
k := 0;
Temp := 2#00000000#;
Temp1 := 2#00000000#;
Temp2 := 2#00000000#;
Array_loop:
for k in D'Range loop
case n is
when 0 =>
Actual := D(k);
Temp1 := Actual / 2**2;
ER(k) := Sbit(Temp1);
Temp := Actual * ( 2**4);
n := 2;
when 2 =>
Actual := D(k);
Temp1 := Actual / 2**4;
Temp2 := Temp1 or Temp;
ER(k) := Sbit(Temp2);
Temp := Actual * ( 2**2);
n := 4;
when 4 =>
Actual := D(k);
Temp1 := Actual / 2**6;
Temp2 := Temp1 or Temp;
ER(k) := Sbit(Temp2);
n := 6;
when 6 =>
Temp1 := Actual * ( 2**2);
Temp2 := Actual / 2**2;
ER(k) := Sbit(Temp2);
n := 0;
when others =>
n := 0;
end case;
end loop Array_Loop;
return ER;
end;
IF I understand what you're asking... it's that you want to re-pack the same 8-bit data into 6-bit values such that the "leftover" bits of the first EBit become the first bits (highest or lowest?) of the second Sbit.
One way you can do this - at least for fixed size arrays, e.g. your 6 words * 8 bits, 8 words * 6 bits example, is by specifying the exact layout in memory for each array type, using packing, and representation aspects (or pragmas, before Ada-2012) which are nicely described here.
I haven't tested the following, but it may serve as a starting point.
type Ebit is mod 2**8;
type Sbit is mod 2**6;
for Ebit'Size use 8;
for Sbit'Size use 6;
type Data_Type is array (1 .. 6) of Ebit
with Alignment => 0; -- this should pack tightly
type Changed_Data_Type is array (1 .. 8) of Sbit
with Alignment => 0;
Then you can instantiate the generic Unchecked_Conversion function with the two array types, and use that function to convert from one array to the other.
with Ada.Unchecked_Conversion;
function Change_Type is new Ada.Unchecked_Conversion(Data_Type, Changed_Data_Type);
declare
Packed_Bytes : Changed_Data_Type := Change_Type(Original_Bytes);
begin ...
In terms of code generated, it's not slow, because Unchecked_Conversion doesn't do anything, except tell the compile-time type checking to look the other way.
I view Unchecked_Conversion like the "I meant to do that" look my cat gives me after falling off the windowledge. Again...
Alternatively, if you wish to avoid copying, you can declare Original_Bytes as aliased, and use a similar trick with access types and Unchecked_Access to overlay both arrays on the same memory (like a Union in C). I think this is what DarkestKhan calls "array overlays" in a comment below. See also section 3 of this rather dated page which describes the technique further. It notes the overlaid variable must not only be declared aliased but also volatile so that accesses to one view aren't optimised into registers, but reflect any changes made via the other view. Another approach to overlays is in the Ada Wikibook here.
Now this may be vulnerable to endian-ness considerations, i.e. it may work on some platforms but not others. The second reference above gives an example of a record with exact bit-alignment of its members : we can at least take the Bit_Order aspect, as in with Alignment => 0, Bit_Order => Low_Order_First; for the arrays above...
-- code stolen from "Rationale" ... see link above p.11
type RR is record
Code: Opcode;
R1: Register;
R2: Register;
end record
with Alignment => 2, Bit_Order => High_Order_First;
for RR use record
Code at 0 range 0 .. 7;
R1 at 1 range 0 .. 3;
R2 at 1 range 4 .. 7;
end record;
One thing that's not clear to me is if there's a formulaic way to specify the exact layout of each element in an array, as is done in a record here - or even if there's a potential need to. If necessary, one workaround would be to replace the arrays above with records. But I'd love to see a better answer if there is one.

How to design a decoder that will have extra outputs?

For an application I am creating I would like to use a decoder that helps write to one of 42 registers. In order to account for all possible registers, I need a 6 bit input since the ceiling of lg(42) is 6.
However, this will create a 6 to 64 decoder, leaving me with an extra 12 outputs that I do not know how to handle. I know that in VHDL I can write a case statement for it:
case input is
when "000000" => output <= reg0;
when "000001" => output <= reg1;
.
.
.
when others => output <= ???;
end case;
Hopefully everything else will be designed so that an input > 41 does not occur, but how should the code be written to handle that case? Is there a way to handle it without stopping the application some how? Or, as an alternative, is there a way to write a decoder that has only 42 outputs?
An easier way to write this is:
type regs_type is array (integer range <>) of std_logic_vector(7 downto 0);
signal regs : regs_type (0 to 41) := (others => (others => '0'));
...
output <= regs(to_integer(unsigned(input));
Assuming 'input' is an std_logic_vector, and that your registers are 8-bits wide.
Then use the regs array for your registers 0-41. I suppose if you wanted to be explicit about registers 42+, you could create an array of size 64, and leave the upper elements unconnected, but I believe the above code would achieve the same thing.
If your registers actually have meaningful names, not just reg0 etc, you can have a separate block of code connecting these to the regs array, example:
regs(0) <= setup_reg;
regs(1) <= data_out;
and so on. If I was doing it this way, I would have defined constants for the regs index values, example:
constant SETUP_REG_ADDRESS : integer := 0;
constant DATA_OUT_ADDRESS : integer := 1;
...
regs(SETUP_REG_ADDRESS) <= setup_reg;
regs(DATA_OUT_ADDRESS) <= data_out;
Alternatively, if you wanted to keep the case statement, you could write your others clause as
when others => output <= (others => '-');
This 'don't care' value allows the tools to do whatever is the most efficient in these cases that you believe to be unreachable anyway. If you were concerned about something undefined being assigned to output if input somehow did exceed 41, you could always replace the '-' with a '0'.

Why do x64 projects use a default packing alignment of 16?

If you compile the following code in a x64 project in VS2012 without any /Zp flags:
#pragma pack(show)
then the compiler will spit out:
value of pragma pack(show) == 16
If the project uses Win32 then, the compiler will spit out:
value of pragma pack(show) == 8
What I don't understand is that the largest natural alignment of any type (ie: long long and pointer) in Win64 is 8. So why not just make the default alignment 8 for x64?
Somewhat related to that, why would anyone ever use /Zp16?
EDIT:
Here's an example to show what I'm talking about. Even though pointers have a natural alignment of 8 bytes for x64, Zp1 can force them to a 1 byte boundary.
struct A
{
char a;
char* b;
}
// Zp16
// Offset of a == 0
// Offset of b == 8
// Zp1
// Offset of a == 0
// Offset of b == 1
Now if we take an example that uses SSE:
struct A
{
char a;
char* b;
__m128 c; // uses declspec(align(16)) in xmmintrinsic.h
}
// Zp16
// Offset of a == 0
// Offset of b == 8
// Offset of c == 16
// Zp1
// Offset of a == 0
// Offset of b == 1
// Offset of c == 16
If __m128 were truly a builtin type, then I'd expect the offset to be 9 with Zp1. But since it uses __declspec(align(16)) in its definition in xmmintrinsic.h, that trumps any Zp settings.
So here's my question worded a little differently: is there a type for 'c' that has a natural alignment of 16B but will have an offset of 9 in the previous example?
The MSDN page here includes the following relevant information about your question "why not make the default alignment 8 for x64?":
Writing applications that use the latest processor instructions introduces some new constraints and issues. In particular, many new instructions require that data must be aligned to 16-byte boundaries. Additionally, by aligning frequently used data to the cache line size of a specific processor, you improve cache performance. For example, if you define a structure whose size is less than 32 bytes, you may want to align it to 32 bytes to ensure that objects of that structure type are efficiently cached.
Why do x64 projects use a default packing alignment of 16?
On x64 the floating point is performed in the SSE unit. You state that the largest type has alignment 8. But that is not correct. Some of the SSE intrinsic types, for example __m128, have alignment of 16.

Resources