How are Mathematical Equality Operators Handled at the Machine-Code Level - machine-code

So I wanted to ask a rather existential question today, and it's one that I feel as though most programmers skip over and just accept as something that works, without really asking the question of "how" it works. The question is rather simple: how is the >= operator compiled down to machine code, and what does that machine code look like? Down at the very bottom, it must be a greater than test, mixed with an "is equal" test. But how is this actually implemented? Thinking about it seems rather paradoxical, because at the very bottom there cannot be a > or == test. There needs to be something else. I want to know what this is.
How do computers test for equality and greater than at the fundamental level?

Indeed there is no > or == test as such. Instead, the lowest level comparison in assembler works by binary subtraction.
On x86, the opcode for integer comparisons is CMP. It is really the one instruction to rule them all. How it works is described for example in 80386 Programmer's reference manual:
CMP subtracts the second operand from the first but, unlike the SUB instruction, does not store the result; only the flags are changed.
CMP is typically used in conjunction with conditional jumps and the SETcc instruction. (Refer to Appendix D for the list of signed and unsigned flag tests provided.) If an operand greater than one byte is compared to an immediate byte, the byte value is first sign-extended.
Basically, CMP A, B (In Intel operand ordering) calculates A - B, and then discards the result. However, in an x86 ALU, arithmetic operations set condition flags inside the flag register of the CPU based on the result of the operation. The flags relevant to arithmetic operations are
Bit Name Function
0 CF Carry Flag -- Set on high-order bit carry or borrow; cleared
otherwise.
6 ZF Zero Flag -- Set if result is zero; cleared otherwise.
7 SF Sign Flag -- Set equal to high-order bit of result (0 is
positive, 1 if negative).
11 OF Overflow Flag -- Set if result is too large a positive number
or too small a negative number (excluding sign-bit) to fit in
destination operand; cleared otherwise.
For example if the result of calculation is zero, the Zero Flag ZF is set. CMP A, B executes A - B and discards the result. The result of subtraction is 0 iff A == B. Thus the ZF will be set only when the operands are equal, cleared otherwise.
Carry flag CF would be set iff the unsigned subtraction would result in borrow, i.e. A - B would be < 0 if A and B are considered unsigned numbers and A < B.
Sign flag is set whenever the MSB bit of the result is set. This means that the result as a signed number is considered negative in 2's complement. However, if you consider the 8-bit subtraction 01111111 (127) - 10000000 (-128), the result is 11111111, which interpreted as a 8-bit signed 2's complement number is -1, even though 127 - (-128) should be 255. A signed integer overflow happened The sign flag alone doesn't alone tell which of the signed quantities was greater - theOF overflow flag tells whether a signed overflow happened in the previous arithmetic operation.
Now, depending on the place where this is used, a Byte Set on Condition SETcc or a Jump if Condition is Met Jcc instruction is used to decode the flags and act on them. If the boolean value is used to set a variable, then a clever compiler would use SETcc; Jcc would be a better match for an if...else.
Now, there are 2 choices for >=: either we want a signed comparison or an unsigned comparison.
int a, b;
bool r1, r2;
unsigned int c, d;
r1 = a >= b; // signed
r2 = c >= d; // unsigned
In Intel assembly the names of conditions for unsigned inequality use the words above and below; conditions for signed equality use the words greater and less. Thus, for r2 the compiler could decide to use Set on Above or Equal, i.e. SETAE, which sets the target byte to 1 if (CF=0). For r1 the result would be decoded by SETGE - Set Byte on Greater or Equal, which means (SF=OF) - i.e. the result of subtraction interpreted as a 2's complement is positive without overflow, or negative with overflow happening.
Finally an example:
#include <stdbool.h>
bool gte_unsigned(unsigned int a, unsigned int b) {
return a >= b;
}
The resulting optimized code on x86-64 Linux is:
cmp edi, esi
setae al
ret
Likewise for signed comparison
bool gte_signed(int a, int b) {
return a >= b;
}
The resulting assembly is
cmp edi, esi
setge al
ret

Here's a simple C function:
bool lt_or_eq(int a, int b)
{
return (a <= b);
}
On x86-64, GCC compiles this to:
.file "lt_or_eq.c"
.text
.globl lt_or_eq
.type lt_or_eq, #function
lt_or_eq:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
cmpl -8(%rbp), %eax
setle %al
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size lt_or_eq, .-lt_or_eq
The important part is the cmpl -8(%rbp), %eax; setle %al; sequence. Basically, it's using the cmp instruction to compare the two arguments numerically, and set the state of the zero flag and the carry flag based on that comparison. It then uses setle to decide whether to to set the %al register to 0 or 1, depending on the state of those flags. The caller gets the return value from the %al register.

First the computer needs to figure out the type of the data. In a language like C, this would be at compile time, python would dispatch to different type specific tests at run time. Assuming we are coming from a compiled language, and that we know the values that we are comparing are integers, The complier would make sure that the valuses are in registers and then issue:
SUBS r1, r2
BGE #target
subtracting the registers, and then checking for zero/undflow. These instructions are built in operation on the CPU. (Which I'm assuming here is ARM-like there are many variations).

Related

Mathematically find the value that is closest to 0

Is there a way to determine mathematically if a value is closer to 0 than another?
For example closerToZero(-2, 3) would return -2.
I tried by removing the sign and then compared the values for the minimum, but I would then be assigning the sign-less version of the initial numbers.
a and b are IEEE-754 compliant floating-point doubles (js number)
(64 bit => 1 bit sign 11 bit exponent 52 bit fraction)
min (a,b) => b-((a-b)&((a-b)>>52));
result = min(abs(a), abs(b));
// result has the wrong sign ...
The obvious algorithm is to compare the absolute values, and use that to select the original values.
If this absolutely needs to be branchless (e.g. for crypto security), be careful with ? : ternary. It often compiles to branchless asm but that's not guaranteed. (I assume that's why you tagged branch-prediction? If it was just out of performance concerns, the compiler will generally make good decisions.)
In languages with fixed-with 2's complement integers, remember that abs(INT_MIN) overflows a signed result of the same width as the input. In C and C++, abs() is inconveniently designed to return an int and it's undefined behaviour to call it with the most-negative 2's complement integer, on 2's complement systems. On systems with well-defined wrapping signed-int math (like gcc -fwrapv, or maybe Java), signed abs(INT_MIN) would overflow back to INT_MIN, giving wrong results if you do a signed compare because INT_MIN is maximally far from 0.
Make sure you do an unsigned compare of the abs results so you correctly handle INT_MIN. (Or as #kaya3 suggests, map positive integers to negative, instead of negative to positive.)
Safe C implementation that avoids Undefined Behaviour:
unsigned absu(int x) {
return x<0? 0U - x : x;
}
int minabs(int a, int b) {
return absu(a) < absu(b) ? a : b;
}
Note that < vs. <= actually matters in minabs: that decides which one to select if their magnitudes are equal.
0U - x converts x to unsigned before a subtract from 0 which can overflow. Converting negative signed-integer types to unsigned is well-defined in C and C++ as modulo reduction (unlike for floats, UB IIRC). On 2's complement machines that means using the same bit-pattern unchanged.
This compiles nicely for x86-64 (Godbolt), especially with clang. (GCC avoids cmov even with -march=skylake, ending up with a worse sequence. Except for the final select after doing both absu operations, then it uses cmovbe which is 2 uops instead of 1 for cmovb on Intel CPUs, because it needs to read both ZF and CF flags. If it ended up with the opposite value in EAX already, it could have used cmovb.)
# clang -O3
absu:
mov eax, edi
neg eax # sets flags like sub-from-0
cmovl eax, edi # select on signed less-than condition
ret
minabs:
mov ecx, edi
neg ecx
cmovl ecx, edi # inlined absu(a)
mov eax, esi
mov edx, esi
neg edx
cmovl edx, esi # inlined absu(b)
cmp ecx, edx # compare absu results
cmovb eax, edi # select on unsigned Below condition.
ret
Fully branchless with both GCC and clang, with optimization enabled. It's a safe bet that other ISAs will be the same.
It might auto-vectorize decently, but x86 doesn't have SIMD unsigned integer compares until AVX512. (You can emulate by flipping the high bit to use signed integer pcmpgtd).
For float / double, abs is cheaper and can't overflow: just clear the sign bit, then use that to select the original.

Is this assembly code sorting numbers in ascending order?

I am trying to figure out what this assembly code is performing.
.equ SIZE =128
.equ TABLE_L =$60
.equ TABLE_H =$00
.def A =r13
.def B =r14
.def cnt2 =r15
.def cnt1 =r16
.def endL =r17
.def endH =r18
Outer:
mov ZL, endL
mov ZH, endH
mov cnt2, cnt1
inner_loop: ld A, Z
ld B, -Z
cp A, B
brlo L1
st Z, A
std Z+1, B
L1: dec cnt2
brne inner_loop
dec cnt1
brne Outer
ret
table:
I believe it may be sorting numbers in ascending order, but I am not sure. The table is left blank as I am not sure what values are stored there. I am trying to figure out what the code does based only on the code.
Yeah, looks like a simple Bubble Sort (without the "early out" optimization that bloats the code by checking if any swaps have happened; if you want better almost-sorted-input performance, use InsertionSort). See Bubble Sort: An Archaeological Algorithmic Analysis for a look at different forms of Bubble Sort, including this where the range of elements scanned decreases by 1 each outer iteration.
It has an interesting advantage for code-size on AVR vs. other simple sort, which is that the accesses are all local so they can all use the Z register, and don't have to do add-with-carry for address calculation. (Although Insertion Sort should be similar.)
A is loaded from the higher address.
cp A,B / brlo skips the swap if the unsigned lower element is already at the higher address, so it's sorting the lowest (unsigned) elements to the end of the array. That's descending order.
The stores (if they happen) are to the same two locations the loads were from, so this is indeed a swap, not some buggy nonsense.

algorithm of addressing a triangle matrices memory using assembly

I was doing a project in ASM about pascal triangle using NASM
so in the project you need to calculate pascal triangle from line 0 to line 63
my first problem is where to store the results of calculation -> memory
second problem what type of storage I use in memory, to understand what I mean I have 3 way first declare a full matrices so will be like this way
memoryLabl: resd 63*64 ; 63 rows of 64 columns each
but the problem in this way that half of matrices is not used that make my program not efficient so let's go the second method is available
which is declare for every line a label for memory
for example :
line0: dd 1
line1: dd 1,1
line2: dd 1,2,1 ; with pre-filled data for example purposes
...
line63: resd 64 ; reserve space for 64 dword entries
this way of doing it is like do it by hand,
some other from the class try to use macro as you can see here
but i don't get it
so far so good
let's go to the last one that i have used
which is like the first one but i use a triangle matrices , how is that,
by declaring only the amount of memory that i need
so to store line 0 to line 63 line of pascal triangle, it's give me a triangle matrices because every new line I add a cell
I have allocate 2080 dword for the triangle matrices how is that ??
explain by 2080 dword:
okey we have line0 have 1 dword /* 1 number in first line */
line1 have 2 dword /* 2 numbers in second line */
line2 have 3 dword /* 3 numbers in third line */
...
line63 have 64 dword /* 64 numbers in final line*/
so in the end we have 2080 as the sum of them
I have give every number 1 dword
okey now we have create the memory to store results let's start calculation
first# in pascal triangle you have all the cells in row 0 have value 1
I will do it in pseudo code so you understand how I put one in all cells of row 0:
s=0
for(i=0;i<64;i++):
s = s+i
mov dword[x+s*4],1 /* x is addresses of triangle matrices */
second part in pascal triangle is to have the last row of each line equal to 1
I will use pseudo code to make it simple
s=0
for(i=2;i<64;i++):
s = s+i
mov dword[x+s*4],1
I start from i equal to 2 because i = 0 (i=1) is line0 (line1) and line0 (line1)is full because is hold only one (tow) value as I say in above explanation
so the tow pseudo code will make my rectangle look like in memory :
1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
...
1 1
now come the hard part is the calculation using this value in triangle to fill all the triangle cells
let's start with the idea here
let's take cell[line][row]
we have cell[2][1] = cell[1][0]+cell[1][1]
and cell[3][1]= cell[2][0]+cell[2][1]
cell[3][2]= cell[2][1]+cell[2][2]
in **general** we have
cell[line][row]= cell[line-1][row-1]+cell[line-1][row]
my problem I could not break this relation using ASM instruction because i have a
triangle matrices which weird to work with can any one help me to break it using a relation or very basic pseudo code or asm code ?
TL:DR: you just need to traverse the array sequentially, so you don't have to work out the indexing. See the 2nd section.
To random access index into a (lower) triangular matrix, row r starts after a triangle of size r-1. A triangle of size n has n*(n+1)/2 total elements, using Gauss's formula for the sum of numbers from 1 to n-1. So a triangle of size r-1 has (r-1)*r/2 elements. Indexing a column within a row is of course trivial, once we know the address of the start of a row.
Each DWORD element is 4 bytes wide, and we can take care of that scaling as part of the multiply, because lea lets us shift and add as well as put the result in a different register. We simplify n*(n-1)/2 elements * 4 bytes / elem to n*(n-1) * 2 bytes.
The above reasoning works for 1-based indexing, where row 1 has 1 element. We have to adjust for that if we want zero-based indexing by adding 1 to row indices before the calculation, so we want the size of a triangle
with r+1 - 1 rows, thus r*(r+1)/2 * 4 bytes. It helps to put the linear array index into a triangle to quickly double-check the formula
0
4 8
12 16 20
24 28 32 36
40 44 48 52 56
60 64 68 72 76 80
84 88 92 96 100 104 108
The 4th row, which we're calling "row 3", starts 24 bytes from the start of the whole array. That's (3+1)*(3+1-1) * 2 = (3+1)*3 * 2; yes the r*(r+1)/2 formula works.
;; given a row number in EDI, and column in ESI (zero-extended into RSI)
;; load triangle[row][col] into eax
lea ecx, [2*rdi + 2]
imul ecx, edi ; ecx = r*(r+1) * 2 bytes
mov eax, [triangle + rcx + rsi*4]
This assuming 32-bit absolute addressing is ok (32-bit absolute addresses no longer allowed in x86-64 Linux?). If not, use a RIP-relative LEA to get the triangle base address in a register, and add that to rsi*4. x86 addressing modes can only have 3 components when one of them is a constant. But that is the case here for your static triangle, so we can take full advantage by using a scaled index for the column, and base as our calculated row offset, and the actual array address as the displacement.
Calculating the triangle
The trick here is that you only need to loop over it sequentially; you don't need random access to a given row/column.
You read one row while writing the one below. When you get to the end of a row, the next element is the start of the next row. The source and destination pointers will get farther and farther from each other as you go down the rows, because the destination is always 1 whole row ahead. And you know the length of a row = row number, so you can actually use the row counter as the offset.
global _start
_start:
mov esi, triangle ; src = address of triangle[0,0]
lea rdi, [rsi+4] ; dst = address of triangle[1,0]
mov dword [rsi], 1 ; triangle[0,0] = 1 special case: there is no source
.pascal_row: ; do {
mov rcx, rdi ; RCX = one-past-end of src row = start of dst row
xor eax, eax ; EAX = triangle[row-1][col-1] = 0 for first iteration
;; RSI points to start of src row: triangle[row-1][0]
;; RDI points to start of dst row: triangle[row ][0]
.column:
mov edx, [rsi] ; tri[r-1, c] ; will load 1 on the first iteration
add eax, edx ; eax = tri[r-1, c-1] + tri[r-1, c]
mov [rdi], eax ; store to triangle[row, col]
add rdi, 4 ; ++dst
add rsi, 4 ; ++src
mov eax, edx ; becomes col-1 src value for next iteration
cmp rsi, rcx
jb .column ; }while(src < end_src)
;; RSI points to one-past-end of src row, i.e. start of next row = src for next iteration
;; RDI points to last element of dst row (because dst row is 1 element longer than src row)
mov dword [rdi], 1 ; [r,r] = 1 end of a row
add rdi, 4 ; this is where dst-src distance grows each iteration
cmp rdi, end_triangle
jb .pascal_row
;;; triangle is constructed. Set a breakpoint here to look at it with a debugger
xor edi,edi
mov eax, 231
syscall ; Linux sys_exit_group(0), 64-bit ABI
section .bss
; you could just as well use resd 64*65/2
; but put a label on each row for debugging convenience.
ALIGN 16
triangle:
%assign i 0
%rep 64
row %+ i: resd i + 1
%assign i i+1
%endrep
end_triangle:
I tested this and it works: correct values in memory, and it stops at the right place. But note that integer overflow happens before you get down to the last row. This would be avoided if you used 64-bit integers (simple change to register names and offsets, and don't forget resd to resq). 64 choose 32 is 1832624140942590534 = 2^60.66.
The %rep block to reserve space and label each row as row0, row1, etc. is from my answer to the question you linked about macros, much more sane than the other answer IMO.
You tagged this NASM, so that's what I used because I'm familiar with it. The syntax you used in your question was MASM (until the last edit). The main logic is the same in MASM, but remember that you need OFFSET triangle to get the address as an immediate, instead of loading from it.
I used x86-64 because 32-bit is obsolete, but I avoided too many registers, so you can easily port this to 32-bit if needed. Don't forget to save/restore call-preserved registers if you put this in a function instead of a stand-alone program.
Unrolling the inner loop could save some instructions copying registers around, as well as the loop overhead. This is a somewhat optimized implementation, but I mostly limited it to optimizations that make the code simpler as well as smaller / faster. (Except maybe for using pointer increments instead of indexing.) It took a while to make it this clean and simple. :P
Different ways of doing the array indexing would be faster on different CPUs. e.g. perhaps use an indexed addressing mode (relative to dst) for the loads in the inner loop, so only one pointer increment is needed. But if you want it to run fast, SSE2 or AVX2 vpaddd could be good. Shuffling with palignr might be useful, but probably also unaligned loads instead of some of the shuffling, especially with AVX2 or AVX512.
But anyway, this is my version; I'm not trying to write it the way you would, you need to write your own for your assignment. I'm writing for future readers who might learn something about what's efficient on x86. (See also the performance section in the x86 tag wiki.)
How I wrote that:
I started writing the code from the top, but quickly realized that off-by-one errors were going to be tricky, and I didn't want to just write it the stupid way with branches inside the loops for special cases.
What ended up helping was writing the comments for the pre and post conditions on the pointers for the inner loop. That made it clear I needed to enter the loop with eax=0, instead of with eax=1 and storing eax as the first operation inside the loop, or something like that.
Obviously each source value only needs to be read once, so I didn't want to write an inner loop that reads [rsi] and [rsi+4] or something. Besides, that would have made it harder to get the boundary condition right (where a non-existant value has to read as 0).
It took some time to decide whether I was going to have an actual counter in a register for row length or row number, before I ended up just using an end-pointer for the whole triangle. It wasn't obvious before I finished that using pure pointer increments / compares was going to save so many instructions (and registers when the upper bound is a build-time constant like end_triangle), but it worked out nicely.

How can I write an interpreter for 'eq' for Hack Assembly language?

I am reading and studying The Elements of Computing Systems but I am stuck at one point. Sample chapter skip the next 5 instruction s can be found here.
Anyway, I am trying to implement a Virtual Machine (or a byte code to assembly translator) but I am stuck at skip the next 5 instruction one point.
You can find the assembly notation here.
The goal is to implement a translator that will translate a specific byte code to this assembly code.
An example I have done successfully is for the byte code
push constant 5
which is translated to:
#5
D=A
#256
M=D
As I said, the assembly language for Hack is found in the link I provided but basically:
#5 // Load constant 5 to Register A
D=A // Assign the value in Reg A to Reg D
#256// Load constant 256 to Register A
M=D // Store the value found in Register D to Memory Location[A]
Well this was pretty straight forward. By definition memory location 256 is the top of the stack. So
push constant 5
push constant 98
will be translated to:
#5
D=A
#256
M=D
#98
D=A
#257
M=D
which is all fine..
I also want to give one more example:
push constant 5
push constant 98
add
is translated to:
#5
D=A
#256
M=D
#98
D=A
#257
M=D
#257 // Here starts the translation for 'add' // Load top of stack to A
D=M // D = M[A]
#256 // Load top of stack to A
A=M // A = M[A]
D=D+A
#256
M=D
I think it is pretty clear.
However I have no idea how I can translate the byte code
eq
to Assembly. Definition for eq is as follows:
Three of the commands (eq, gt, lt) return Boolean values. The VM
represents true and false as 􏰁-1 (minus one, 0xFFFF) and 0 (zero,
0x0000), respectively.
So I need to pop two values to registers A and D respectively, which is quite easy. But how am I supposed to create an Assembly code that will check against the values and push 1 if the result is true or 0 if the result is false?
The assembly code supported for Hack Computer is as follows:
I can do something like:
push constant 5
push constant 6
sub
which will hold the value 0 if 2 values pushed to the stack are equal or !0 if not but how does that help? I tried using D&A or D&M but that did not help much either..
I can also introduce a conditional jump but how am I supposed to know what instruction to jump to? Hack Assembly code does not have something like "skip the next 5 instructions" or etc..
[edit by Spektre] target platform summary as I see it
16bit Von Neumann architecture (address is 15 bits with 16 bit Word access)
Data memory 32KW (Read/Write)
Instruction (Program) memory 32KW (Read only)
native 16 bit registers A,D
general purpose 16 bit registers R0-R15 mapped to Data memory at 0x0000 - 0x000F
these are most likely used also for: SP(R0),LCL(R1),ARG(R2),This(R3),That(R4)
Screen is mapped to Data memory at 0x4000-0x5FFF (512x256 B/W pixels 8KW)
Keyboard is mapped to Data memory at 0x6000 (ASCII code if last hit key?)
It appears there is another chapter which more definitively defines the Hack CPU. It says:
The Hack CPU consists of the ALU specified in chapter 2 and three
registers called data register (D), address register (A), and program
counter (PC). D and A are general-purpose 16-bit registers that can be
manipulated by arithmetic and logical instructions like A=D-1 , D=D|A
, and so on, following the Hack machine language specified in chapter
4. While the D-register is used solely to store data values, the contents of the A-register can be interpreted in three different ways,
depending on the instruction’s context: as a data value, as a RAM
address, or as a ROM address
So apparently "M" accesses are to RAM locations controlled by A. There's the indirect addressing I was missing. Now everything clicks.
With that confusion cleared up, now we can handle OP's question (a lot more easily).
Let's start with implementing subroutine calls with the stack.
; subroutine calling sequence
#returnaddress ; sets the A register
D=A
#subroutine
0 ; jmp
returnaddress:
...
subroutine: ; D contains return address
; all parameters must be passed in memory locations, e.g, R1-R15
; ***** subroutine entry code *****
#STK
AM=M+1 ; bump stack pointer; also set A to new SP value
M=D ; write the return address into the stack
; **** subroutine entry code end ***
<do subroutine work using any or all registers>
; **** subroutine exit code ****
#STK
AM=M-1 ; move stack pointer back
A=M ; fetch entry from stack
0; jmp ; jmp to return address
; **** subroutine exit code end ****
The "push constant" instruction can easily be translated to store into a dynamic location in the stack:
#<constant> ; sets A register
D=A ; save the constant someplace safe
#STK
AM=M+1 ; bump stack pointer; also set A to new SP value
M=D ; write the constant into the stack
If we wanted to make a subroutine to push constants:
pushR2: ; value to push in R2
#R15 ; save return address in R15
M=D ; we can't really use the stack,...
#R2 ; because we are pushing on it
D=M
#STK
AM=M+1 ; bump stack pointer; also set A to new SP value
M=D ; write the return address into the stack
#R15
A=M
0 ; jmp
And to call the "push constant" routine:
#<constant>
D=A
#R2
M=D
#returnaddress ; sets the A register
D=A
#pushR2
0 ; jmp
returnaddress:
To push a variable value X:
#X
D=M
#R2
M=D
#returnaddress ; sets the A register
D=A
#pushR2
0 ; jmp
returnaddress:
A subroutine to pop a value from the stack into the D register:
popD:
#R15 ; save return address in R15
M=D ; we can't really use the stack,...
#STK
AM=M-1 ; decrement stack pointer; also set A to new SP value
D=M ; fetch the popped value
#R15
A=M
0 ; jmp
Now, to do the "EQ" computation that was OP's original request:
EQ: ; compare values on top of stack, return boolean in D
#R15 ; save return address
M=D
#EQReturn1
D=A
#PopD
0; jmp
#EQReturn1:
#R2
M=D ; save first popped value
#EQReturn2
D=A
#PopD
0; jmp
#EQReturn2:
; here D has 2nd popped value, R2 has first
#R2
D=D-M
#EQDone
equal; jmp
#AddressOfXFFFF
D=M
EQDone: ; D contains 0 or FFFF here
#R15
A=M ; fetch return address
0; jmp
Putting it all together:
#5 ; push constant 5
D=A
#R2
M=D
#returnaddress1
D=A
#pushR2
0 ; jmp
returnaddress1:
#X ; now push X
D=M
#R2
M=D
#returnaddress2
D=A
#pushR2
0 ; jmp
returnaddress2:
#returnaddress3 ; pop and compare the values
D=A
#EQ
0 ; jmp
returnaddress3:
At this point, OP can generate code to push D onto the stack:
#R2 ; push D onto stack
M=D
#returnaddress4
D=A
#pushR2
0 ; jmp
returnaddress4:
or he can generate code to branch on the value of D:
#jmptarget
EQ ; jmp
As I wrote in last comment there is a branch less way so you need to compute the return value from operands directly
Lets take the easy operation like eq for now
if I get it right eq a,d is something like a=(a==d)
true is 0xFFFF and false is 0x0000
So this if a==d then a-d==0 this can be used directly
compute a=a-d
compute OR cascade of all bits of a
if the result is 0 return 0
if the result is 1 return 0xFFFF
this can be achieved by table or by 0-OR_Cascade(a)
the OR cascade
I do not see any bit shift operations in your description
so you need to use a+a instead of a<<1
and if shift right is needed then you need to implement divide by 2
So when I summarize this eq a,d could look like this:
a=a-d;
a=(a|(a>>1)|(a>>2)|...|(a>>15))&1
a=0-a;
you just need to encode this into your assembly
as you do not have division or shift directly supported may be this may be better
a=a-d;
a=(a|(a<<1)|(a<<2)|...|(a<<15))&0x8000
a=0-(a>>15);
the lower and greater comparison are much more complicated
you need to compute the carry flag of the substraction
or use sign of the result (MSB of result)
if you limit the operands to 15 bit then it is just the 15th bit
for full 16 bit operands you need to compute the 16th bit of result
for that you need to know quite a bit of logic circuits and ALU summation principles
or divide the values to 8 bit pairs and do 2x8 bit substraction cascade
so a=a-d will became:
sub al,dl
sbc ah,dh
and the carry/sign is in the 8th bit of result which is accessible

Shift elements to the left of a SIMD register based on boolean mask [duplicate]

This question already has answers here:
AVX2 what is the most efficient way to pack left based on a mask?
(6 answers)
Closed 6 months ago.
This question is related to this: Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector
I would like to create an optimal function with this signature:
__m256i PackLeft(__m256i inputVector, __m256i boolVector);
The desired behaviour is that on an input of 64bit int like this:
inputVector = {42, 17, 13, 3}
boolVector = {true, false, true, false}
It masks all values that have false in the boolVector and then repacks the values that remain to the left. On the output above, the return value should be:
{42, 13, X, X}
... Where X is "I don't care".
An obvious way to do this is the use _mm_movemask_epi8 to get a 8 byte int out of the bool vector, look up the shuffle mask in a table and then do a shuffle with the mask.
However, I would like to avoid a lookup table if possible. Is there a faster solution?
This is covered quite well by Andreas Fredriksson in his 2015 GDC talk: https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf
Starting on slide 104, he covers how to do this using only SSSE3 and then using just SSE2.
Just saw this problem - perhaps u have already solved it, but am still writing the logic for other programmers who may need to handle this situation.
The solution (in Intel ASM format) is given below. It consists of three steps :
Step 0 : convert the 8 bit mask into a 64 bit mask, with each set bit in the original mask represented as a 8 set bits in the expanded mask.
Step 1 : Use this expanded mask to extract the relevant bits from the source data
Step 2: Since you require the data to be left packed, we shift the output by appropriate number of bits.
The code is as below :
; Step 0 : convert the 8 bit mask into a 64 bit mask
xor r8,r8
movzx rax,byte ptr mask_pattern
mov r9,rax ; save a copy of the mask - avoids a memory read in Step 2
mov rcx,8 ; size of mask in bit count
outer_loop :
shr al,1 ; get the least significant bit of the mask into CY
setnc dl ; set DL to 0 if CY=1, else 1
dec dl ; if mask lsb was 1, then DL is 1111, else it sets to 0000
shrd r8,rdx,8
loop outer_loop
; We get the mask duplicated in R8, except it now represents bytewise mask
; Step 1 : we extract the bits compressed to the lowest order bit
mov rax,qword ptr data_pattern
pext rax,rax,r8
; Now we do a right shift, as right aligned output is required
popcnt r9,r9 ; get the count of bits set in the mask
mov rcx,8
sub cl,r9b ; compute 8-(count of bits set to 1 in the mask)
shl cl,3 ; convert the count of bits to count of bytes
shl rax,cl
;The required data is in RAX
Trust this helps

Resources