Am I using _mm_srl_epi32 wrong? - visual-studio-2005

Am I using _mm_srl_epi32 wrong? - visual-studio-2005

In this small code example:
__m128i twos = _mm_set_epi32(2,3,1,2);
__m128i foo = _mm_set_epi32(128,128,128,128);
__m128i shifted = _mm_srl_epi32(foo,twos);
"shifted" is full of zeroes, while I expect it two be full of four 32-bit integers with the values 32,16,64, and 32, respectively. Am I using the intrinsic wrong?

Yes, you are using it incorrectly. The second argument to _mm_srl_epi32() specifies the amount of bits to shift the first argument by, but it isn't a vectored argument as you might expect, allowing you to shift each 32-bit integer by a different number of bits. Instead, the 128-bit argument is truncated to 64 bits, and the resulting count is used to determine the number of bits to shift; the same shift amount is used for all 4 integers in the first argument. In your case, the lower 64 bits are 0x0000000100000010, which evaluates to a very large positive number. This results in all of the elements of foo getting flushed to zero as all of the bits are shifted out.
A good place to find all of the little details on every instruction out there is Intel's AVX Programmer's Reference. While the title may be somewhat of a misnomer, the document contains descriptions of all SSE/SSE2/.../AVX/AVX2 instructions and descriptions of their intrinsics available in Intel's C++ compiler (which are typically also available in gcc and others). Searching for _mm_srl_epi32 in the document yields a clear explanation on exactly what the instruction does.

Related

how can i further understand the compilation process used by gcc?

I was trying to reverse engineer some psp programs developed using the free
pspsdk
https://sourceforge.net/projects/minpspw/
I noticed that i created a function to see how MIPS handles more than 4 arguments (a0-a4).
Everyone i know has told me that they get passed onto the stack.
To my surprise, that 5th argument was actually passed to register t0 and to compiler didn't even use the stack!
it also inlined a function without even having used a jal or jump to it. (obvious optimization).
Altough there was indeed a space a memory and you could double check by using print with function pointer argument. That actual code that was executed was automatically inlined without the need of a function call instruction.
^^ but that doesn't really benefit me for a reverse engineer attempt...
there is a man page for this version of gcc. and it takes seconds to install if anyone is able to provide it's man for compilation if there is one.
It's so long i don't even know how to reference information reliably

How arguments are passed is specified by the ABI (application binary interface). So you have to find respective documents.
Moreover, there is more than one such ABI, namely n32 and n64. In the case of mips-gcc, some of the decisions are commented in the GCC sources like in ./gcc/config/mips/mips.h
/* This structure has to cope with two different argument allocation
schemes. Most MIPS ABIs view the arguments as a structure, of which
the first N words go in registers and the rest go on the stack. If I
< N, the Ith word might go in Ith integer argument register or in a
floating-point register. For these ABIs, we only need to remember
the offset of the current argument into the structure.
The EABI instead allocates the integer and floating-point arguments
separately. The first N words of FP arguments go in FP registers,
the rest go on the stack. Likewise, the first N words of the other
arguments go in integer registers, and the rest go on the stack. We
need to maintain three counts: the number of integer registers used,
the number of floating-point registers used, and the number of words
passed on the stack.
We could keep separate information for the two ABIs (a word count for
the standard ABIs, and three separate counts for the EABI). But it
seems simpler to view the standard ABIs as forms of EABI that do not
allocate floating-point registers.
So for the standard ABIs, the first N words are allocated to integer
registers, and mips_function_arg decides on an argument-by-argument
basis whether that argument should really go in an integer register,
or in a floating-point one. */
There are more such comments in the mips backend. Search for "cumulative" or "CUMULATIVE" in mips.c and mips.h.

instrinsic _mm512_round_ps is missing for AVX512

I'm missing the intrinsic _mm512_round_ps for AVX512 (it is only available for KNC). Any idea why this is not available?
What would be a good workaround?
apply _mm256_round_ps to upper and lower half and fuse the results?
use _mm512_add_round_ps with one argument being zero?
Thanks!

TL:DR: AVX512F
__m512 nearest_integer = _mm512_roundscale_ps(input_vec, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC);
related: AVX512DQ _mm512_reduce_pd or _ps will subtract the integer part (and a specified number of leading fraction bits), range-reducing your input to only the fractional part. asm docs for vreducepd have the most detail.
The EVEX prefix allows overriding the default rounding direction {er} and setting suppress-all-exceptions {sae}, for FP instructions. (This is what the ..._round_ps() versions of intrinsics are for.) But it doesn't have a "round to integer" option; you still need a separate asm instruction for that.
vroundps xy, xy/mem, imm8 didn't get upgraded to AVX512. Actually it did: the same opcode has a new mnemonic for the EVEX version, using the high 4 bits of the immediate that are reserved in the SSE and VEX encodings.
vrndscaleps xyz, xyz/mem/m32broadcast, imm8 is available in ss/sd/ps/pd flavours. The high 4 bits of the imm8 specify the number of fraction bits to round to. In these terms, rounding to the nearest integer is rounding to 0 fraction bits. Rounding to nearest 0.5 would be rounding to 1 fraction bit. It's the same as scaling by 2^M, rounding to nearest integer, then scaling back down (done without overflow).
I think the field is unsigned, so you can't use M=-1 to round to an even number. The ISA ref manual doesn't mention signedness, so I'm leaning towards unsigned being the most likely.
The low 4 bits of the field specify the rounding mode like with roundps. As usual, the PD version of the instruction has the diagram (because it's alphabetically first).
With the upper 4 bits = 0, it behaves the same as roundps: they use the same encoding for the low 4 bits. It's not a coincidence that the instructions have the same opcode, just different prefixes.
(I'm curious if SSE or VEX roundpd on an AVX512 CPU would actually scale based on the upper 4 bits; it says they're "reserved" not "ignored". But probably not.)
__m512 _mm512_roundscale_ps( __m512 a, int imm); is the no-frills intrinsic. See Intel's intrinsic finder
The merge-masking + SAE-override version is __m512 _mm512_mask_roundscale_round_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae);. There's nothing you can do with the sae operand that roundscale can't already do with its imm8, though, so it's a bit pointless.
You can use the _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC and so on constants documented for _mm_round_pd / _mm256_round_pd, to round up, down, or truncate towards zero, or the usual nearest with even-as-tiebreak that's the IEEE default rounding mode. Or _MM_FROUND_CUR_DIRECTION to use whatever the current mode is. _MM_FROUND_NO_EXC suppresses setting the inexact exception bit in the MXCSR.
You might be wondering why vrndscaleps needs any immediate bits to specify rounding direction when you could just use the EVEX prefix to override the rounding direction with vrndscaleps zmm0 {k1}, zmm1, {rz-sae} (Or whatever the right syntax is; NASM doesn't seem to be accepting any of the examples I found.)
The answer is that explicit rounding is only available with 512-bit vectors or with scalars, and only for register operands. (It repurposes 3 EVEX bits used to set vector length (if AVX512VL is supported), and to distinguish between broadcast memory operands vs. vector. EVEX bits are overloaded based on context to pack more functionality into limited space.)
So having the rounding-control in the imm8 makes it possible to do vrndscaleps zmm0{k1}, [rdi]{m32bcst}, imm8 to broadcast a float from memory, round it, and merge that into an existing register according to mask register k1. All in a single instruction which decodes to probably 3 uops on SKX, assuming it's the same as vroundps. (http://agner.org/optimize/).

Any byte sequences that can not be present in valid x86 code?

Any byte sequences that can not be present in valid x86 code?
I'm looking for a byte sequence (or sequences), to inject into an x86 program compiled using GCC, that can not show up in the binary as a by product of compilation.
The reason is that I want these byte sequences to act as "labels", so that I can recognize them later during inspection.
Is it possible to construct patterns of bytes, so that, searching through the binary, these patterns will not show up except with very small probability (I prefer probability zero). In other words, I want to minimize the number of false positives!

There are sequences that today are not a valid encoding of any instruction.
Rather than digging in the opcode table present in the Intel Manual 2 you can exploit two facts of the x86 architecture:
The maximum instruction length is 15 bytes.
You can repeat prefixes.
These should also be more stable across generations than reserved opcodes.
The sequence 666666666666666666666666666666 (15 operand-size override prefixes, but any prefix will do) will generate an #UD exception because it is invalid.
For what it's worth, there is a specific instruction that fulfills the role of invalid instruction: ud2.
It's presence in a binary module is possible but its more idiomatic than an invalid encoding and it is standard, for example Linux uses it to mark a bug for if ud2 is the execution flow, the code behind it cannot be valid.
That said, if I got you right, that's not going to be useful to you.
You want to skip the process of decoding the instructions and scan the code section of the binary instead.
There is no guarantee that the code section will contain only code, for example ARM compilers generate literal pools - that's definitively uncommon on x86 though.
However the compilers usually align functions to a specific boundary (usually 16 bytes), this can be done in several ways - like stretching the previous function or with a mere padding.
This padding can be a sequence of bytes of any value - hence arbitrary bytes can be present in the code section.
Long story short, there is no universal byte sequence that appear with probability zero in the code section.
Everything that it's not in the execution flow can have any value.
We will deal with probability later, for now lets assume the 66..66h appears rarely enough in an executable.
You can't just use it directly, as 66..66h can be part of two instructions and thus be a valid sequence:
mov rax, 6666666666666666h
db 66h, 66h, 66h , 66h
db 66h, 66h, 66h
nop
is valid.
This is due to the immediate operands of instructions - the biggest immediate can be 8 bytes in length (as today), so the sequence must be lengthen to 15 + 8 = 23 bytes.
If you really want to be safe again future features, you can use a sequence of 14 + 15 = 29 bytes (for the 15-byte instruction length limit).
It's possible to find 23/29 bytes of value 66h in the code section or in the whole binary.
But how probable is that?
If the bytes in a binary were uniformly random then the probability would be astronomically small: 256-23 = 2-184.
Well, the point is that the bytes in a binary are not uniformly random.
You can open a file with an embedded icon to confirm that.
You can make the probability arbitrarily small by stretching the sequence - it's up to you to find a compromise between the length and an acceptable number of false positives.
It's unclear what you want to do but here some advice:
Most, if not all, building tools support generating a map file.
It is a file with all the symbols/names and their addresses.
If you could use actual labels (with a prefix and a random suffix) you'd collect them easily after the build.
Most output formats can be enriched with meta-information.
You can add an ELF/PE section with a table of offsets to the locations you want to mark.

Addition efficiency proportional to size of operands?

Question that just popped into my head, and I don't think I've seen an answer on here. Is the time taken by a binary addition algorithm, proportional to the size of the operands?
Obviously, adding 1101011010101010101101010 and 10110100101010010101 is going to take longer than 1 + 1, but my question refers more to the smaller values. Is there a negligible difference, no difference, a theoretical difference?
At what point, with these sorts of rudimentary calculations should we start looking into more efficient methods of calculation? ie: Exponentiation by squaring with large exponents for calculating huge powers.

How we see the binary patterns...
1101011010101010101101010 (big)
10110100101010010101 (medium)
1 (small)
How a 32bit computer sees the binary patterns...
00000001101011010101010101101010 32bit,
00000000000010110100101010010101 32bit,
00000000000000000000000000000001 i'm lovin it
On a 32bit system, all the above numbers will take the same time (no. of CPU instructions) to be added. As all of them fit within the basic computational block i.e. the 32bit CPU register.
How a 16bit computer sees the binary patterns...
1
+1 = ?
0000000000000001 i'm lovin it
0000000000000001 i'm lovin it
00000001101011010101010101101010
+00000000000010110100101010010101 = ?
00000001101011010101010101101010 too BIG for me!
00000000000010110100101010010101 too BIG for me!
On a 16bit system, as the larger numbers will NOT fit in a 16bit register, it will need an additional pass(to add the significant bits that remain after the first 16LSBs are added).
Step1: ADD Least significant bits
0101010101101010
0100101010010101
Step2: ADD the rest (remember carry bit from previous operation)
000000000000000C
0000000110101101
0000000000001011
We can start thinking of optimising the mathematical operations on
numbers once the numbers no longer fit in the basic computation unit
of the system i.e. the CPU-register.
Modern hardware architectures are developed keeping this in mind and support SIMD instructions. Compilers will often employ them (SSE on x86, NEON on ARM) when they see such a case being made i.e. 128bit decryption logic being run on a 32bit system.
Also instead of checking ONLY the size of the operands, the size of the result also determines whether the system can accomplish the mathematical operation within one step. Not only the operands involved, but the operation being performed needs to be taken into consideration as well.
For example, on a 32bit system, adding two 30bit numbers can be definitely carried out using the regular operations as the result is guaranteed to NOT exceed a 32bit register. But multiplying the same two 30bit numbers may result in a number that does NOT fit within 32bits.
In the absence of such a guarantee of being able to store the result in a single computational unit, to ensure validity of the result for all possible values, the architecture(and the compiler) must :
go the long way i.e. multi-step mathematical operations
or
employ SIMD optimisations
or
define and implement custom mechanisms
(like register-pairs EDX:EAX to hold the result on x86)

In practice, there's no (or completely negligible) difference between adding different integers that fit in the processor words as that should always be a fixed-time operation.
In theory, the complexity for adding two unsigned integers should be O(log(n)) where n is the bigger of the two. As such, you need to go pretty high before mere additions become a problem.
As for where exactly to draw the line between simple and complex algorithms for computing numbers, I don't have an exact answer. However, the GMP library comes to mind. From what I understand, they've carefully chosen their algorithms and under what circumstances to use each in terms of performance. You may want to look into what they did.

I somewhat disagree with the above answers. It very much depends on the context.
For simple integer arithmetic (for loop counters etc), then on 64bit machines that computation will be done using 64bit general purpose registers (RSI/RCX/etc). In those cases, there is no difference in speed between an 8bit or 64bit addition.
If however you are processing arrays of integers, and assuming the compiler has been able to optimise the code nicely, then yes, smaller is faster (but not for the reason you think).
In the AVX2 instruction set, you have access to 4 integer addition instructions:
__m256i _mm256_add_epi8 (__m256i a, __m256i b); // 32 x 8bit
__m256i _mm256_add_epi16(__m256i a, __m256i b); // 16 x 16bit
__m256i _mm256_add_epi32(__m256i a, __m256i b); // 8 x 32bit
__m256i _mm256_add_epi64(__m256i a, __m256i b); // 4 x 64bit
You'll notice that all of them operate on 256bits at a time, which means you can process 4 integer additions if you're using 64bit, compared to 32 additions if you are using 8bit integers. (As mentioned above, you'd need to make sure you have enough precision). They all take the same number of clock cycles to compute - 1clk.
There are also other effects of using smaller data types, which are mainly better CPU cache usage, and a reduced number of memory reads/writes.
However, back to your original question on bit-by-bit computation. Prior to the new AVX-512 instruction set, it might not have seemed a little silly. However, the new instruction set contains a ternary logic instruction. With this instruction, it is possible to compute 512 additions on numbers of any bit length fairly easily.
inline __m512i add(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0x96);
}
inline __m512i adc(__m512i x, __m512i x, __m512i carry_in)
{
return _mm512_ternarylogic_epi32(carry_in, y, x, 0xE8);
}
__m512i A[NUM_BITS];
__m512i B[NUM_BITS];
__m512i RESULT[NUM_BITS];
__m512i CARRY = _mm512_setzero_ps();
for(int i = 0; i < NUM_BITS; ++i)
{
RESULT[i] = add(A[i], B[i], CARRY);
CARRY = adc(A[i], B[i], CARRY);
}
In this particular example (which to be honest, probably has very limited real world usage!), The time it takes to perform the 512 additions, is indeed directly proportional to NUM_BITS.

Data type compatibility with NEON intrinsics

I am working on ARM optimizations using the NEON intrinsics, from C++ code. I understand and master most of the typing issues, but I am stuck on this one:
The instruction vzip_u8 returns a uint8x8x2_t value (in fact an array of two uint8x8_t). I want to assign the returned value to a plain uint16x8_t. I see no appropriate vreinterpretq intrinsic to achieve that, and simple casts are rejected.

Some definitions to answer clearly...
NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide).
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
uint16x8_t is a type which requires 128-bit storage thus it needs to be in an quadword register.
ARM NEON Intrinsics has a definition called vector array data type in ARM® C Language Extensions:
... for use in load and store operations, in
table-lookup operations, and as the result type of operations that return a pair of vectors.
vzip instruction
... interleaves the elements of two vectors.
vzip Dd, Dm
and has an intrinsic like
uint8x8x2_t vzip_u8 (uint8x8_t, uint8x8_t)
from these we can conclude that uint8x8x2_t is actually a list of two random numbered doubleword registers, because vzip instructions doesn't have any requirement on order of input registers.
Now the answer is...
uint8x8x2_t can contain non-consecutive two dualword registers while uint16x8_t is a data structure consisting of two consecutive dualword registers which first one has an even index (D0-D31 -> Q0-Q15).
Because of this you can't cast vector array data type with two double word registers to a quadword register... easily.
Compiler may be smart enough to assist you, or you can just force conversion however I would check the resulting assembly for correctness as well as performance.

You can construct a 128 bit vector from two 64 bit vectors using the vcombine_* intrinsics. Thus, you can achieve what you want like this.
#include <arm_neon.h>
uint8x16_t f(uint8x8_t a, uint8x8_t b)
{
uint8x8x2_t tmp = vzip_u8(a,b);
uint8x16_t result;
result = vcombine_u8(tmp.val[0], tmp.val[1]);
return result;
}

I have found a workaround: given that the val member of the uint8x8x2_t type is an array, it is therefore seen as a pointer. Casting and deferencing the pointer works ! [Whereas taking the address of the data raises an "address of temporary" warning.]
uint16x8_t Value= *(uint16x8_t*)vzip_u8(arg0, arg1).val;
It turns out that this compiles and executes as should (at least in the case I have tried). I haven't looked at the assembly code so I cannot grant it is implemented properly (I mean just keeping the value in a register instead of writing/read to/from memory.)

I was facing the same kind of problem, so I introduced a flexible data type.
I can now therefore define the following:
typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.

Its a bug in GCC(now fixed) on 4.5 and 4.6 series.
Bugzilla link http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48252
Please take the fix from this bug and apply to gcc source and rebuild it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio