RISCV: how the branch intstructions are calculated? - cpu

I am trying to understand how modern CPU works. I am focused on RISC-V. there are a few types of branches:
BEQ
BNE
BLT
BGE
BLTU
BGEU
I use a venus simulator to test this and also I am trying to simulate it as well and so far so good it works, but I cannot understand, how are branches calculated.
From what I have read, the ALU unit has just one signal output - ZERO (apart from its math output) which is active whenever the output is zero. But just how can I determine if the branch should be taken or not based just on the ZERO output? And how are they calculated?
Example code:
addi t0, zero, 9
addi t1, zero, 10
blt t0, t1, end
end:
Example of branches:
BEQ - subtract 2 numbers, if ZERO is active, branch
BNE - subtract 2 numbers, if ZERO is not active, branch
BLT - and here I am a little bit confused; should I subtract and then look at the sign bit, or what?
BGE / BGEU - and how to differentiate these? What math instructions should I use?

Yes, the ZERO output gives you equal / not-equal. You can also use XOR instead of SUB for equality comparisons if that runs faster (ready earlier in a partial clock cycle) and/or uses less power (fewer transistors switching).
Fun fact: MIPS only has eq / ne and signed-compare-against-zero branch conditions, all of which can be tested fast without carry propagation or any other cascading bits. That mattered because it checked branch conditions in the first half cycle of exec, in time to forward to fetch, keeping branch latency down to 1 cycle which the branch-delay slot hid on classic MIPS pipelines. For other conditions, like blt between two registers, you need slt and branch on that. RISC-V has true hardware instructions for blt between two registers, vs. MIPS's bltz against zero only.
Why use an ALU with only a zero output? That makes it unusable for comparisons other than exact equality.
You need other outputs to determine GT / GE / LE / LT (and their unsigned equivalents) from a subtract result.
For unsigned conditions, all you need is zero and a carry/borrow (unsigned overflow) flag.
The sign bit of the result on its own is not sufficient for signed conditions because signed overflow is possible: (-1) - (-2) = +1 : -1 > -2 (signbit clear) but (8-bit wraparound) 0x80 - 0x7F = +1 (signbit also clear) but -128 < 127. The sign bit of a number on its own is only useful if comparing against zero.
If you widen the result (by sign-extending the inputs and doing one more bit of add/sub) that makes signed overflow impossible so that 33rd bit is a signed-less-than result directly.
You can also get a signed-less-than result from signed_overflow XOR signbit instead of actually widening + adding. You might also want an ALU output for signed overflow, if RISC-V has any architectural way for software to check for signed-integer overflow.
Signed-overflow can be computed by looking at the carry in and carry out from the MSB (the sign bit). If those differ, you have overflow. i.e. SF = XOR of those two carries. See also http://teaching.idallen.com/dat2343/10f/notes/040_overflow.txt for a detailed look at unsigned carry vs. signed overflow with 2-bit and 4-bit examples.
In CPUs with a FLAGS register (e.g. x86 and ARM), those ALU outputs actually go into a special register with named bits. You can look at an x86 manual for conditional-jump instructions to see how condition names like l (signed less-than) or b (unsigned below) map to those flags:
signed conditions:
jl (aka RISC-V blt) : Jump if less (SF≠ OF). That's output signbit not-equal to Overflow Flag, from a subtract / cmp
jle : Jump if less or equal (ZF=1 or SF≠ OF).
jge (aka RISC-V bge) : Jump if greater or equal (SF=OF).
jg (aka RISC-V bgt) : Jump short if greater (ZF=0 and SF=OF).
If you decide to have your ALU just produce a "signed-less-than" output instead of separate SF and OF outputs, that's fine. SF==OF is just !(SF != OF).
(x86 also has some mnemonic synonyms for the same opcode, like jl = jnge. There are "only" 16 FLAGS predicates, including OF=0 alone (test for overflow, not a compare result), and the parity flag. You only care about the actual signed/unsigned compare conditions.)
If you think through some example cases, like testing that INT_MAX > INT_MIN you'll see why these conditions make sense, like that example I showed above for 8-bit numbers.
unsigned:
jb (aka RISC-V bltu) : Jump if below (CF=1). That's just testing the carry flag.
jae (aka RISC-V bgeu) : Jump short if above or equal (CF=0).
ja (aka RISC-V bgtu) : Jump short if above (CF=0 and ZF=0).
(Note that x86 subtract sets CF = borrow output, so 1 - 2 sets CF=1. Some other ISAs (e.g. ARM) invert the carry flag for subtract. When implementing RISC-V this will all be internal to the CPU, not architecturally visible to software.)
I don't know if RISC-V actually has all of these different branch conditions, but x86 does.
There might be simpler ways to implement a signed or unsigned comparator than doing subtraction at all.
But if you already have an add/subtract ALU and want to piggyback on that then you might just want it to generate Carry and Signed-less-than outputs as well as Zero.
That way you don't need a separate sign-flag output, or to grab the MSB of the integer result. It's just one extra XOR gate inside the ALU to combine those two things.

You don't have to do subtraction to compare two (signed or unsigned) numbers.
You can use cascaded 7485 chip for example.
With this chip you can do all Branch computation without doing any subtraction.

Related

Invalid division, swears at comma when compiling [duplicate]

In x86 assembly, most instructions have the following syntax:
operation dest, source
For example, add looks like
add rax, 10 ; adds 10 to the rax register
But mnemonics like mul and div only have a single operand - source - with the destination being hardcoded as rax. This forces you to set and keep track of the rax register anytime you want to multiply or divide, which can get cumbersome if you are doing a series of multiplications.
I'm assuming there is a technical reason pertaining to the hardware implementation for multiplication and division. Is there?
The destination register of mul and div is not rax, it's rdx:rax. This is because both instructions return two results; mul returns a double-width result and div returns both quotient and remainder (after taking a double-width input).
The x86 instruction encoding scheme only permits two operands to be encoded in legacy instructions (div and mul are old, dating back to the 8086). Having only one operand permits the bits dedicated to the other operand to be used as an extension to the opcode, making it possible to encode more instructions. As the div and mul instructions are not used too often and as one operand has to be hard-coded anyway, hard-coding the second one was seen as a good tradeoff.
This ties into the auxillary instructions cbw, cwd, cwde, cdq, cdqe, and cqo used for preparing operands to the div instruction.
Note that later, additional forms of the imul instruction were introduced, permitting the use of two modr/m operands for the common cases of single word (as opposed to double-word) multiplication (80386) and multiplication by constant (80186). Even later, BMI2 introduced a VEX-encoded mulx instruction permitting three freely chosable operands of which one source operand can be a memory operand.

Which sequence of instructions has better performance for zeroing one register or another?

I had an assignment from my professor and one part of it sparked a conversation on branchless programming. The goal was to convert this C code to MIPS assembly (assuming a and b were in registers $s0 and $s1, respectively) using the slt instruction.
if (a <= b)
a = 0;
else
b = 0;
The expected response was:
slt $t0,$s1,$s0 ; Set on less-than (if $s1 < $s0 then set $t0 to 1, else 0).
bne $t0,$0,else ; Branch on not-equal (jump to `else` if $t0 != $0)
add $s0,$0,$0 ; $s0 = $0 + $0, aka $s0 = $0 * 2
j done ; Unconditional jump to `done`
else: sub $s1,$0,$0 ; $s1 = $0 - $0, aka `$s1 = 0`
done:
However, I submitted the following branchless solution (from what I understand, branchless programming is preferred in terms of performance):
slt $t0,$s1,$s0 ; Set on less-than (if $s1 < $s0 then set $t0 to 1, else 0).
mul $s0,$s0,$t0 ; $s0 = $s0 * $t0 (truncated to 32 bits)
xori $t0,$t0,0x1 ; $t0 = XOR( $t0, 1 )
mul $s1,$s1,$t0 ; $s1 = $s1 * $t0 (truncated to 32 bits)
I understand this is a small case wherein any performance gain would be negligible, but I am wondering if the line of thinking I used was on the right track.
My professor pointed out that the mul instruction is complex (read: hardware intensive), thus (if implemented in hardware) any performance gains made by using less instructions would ultimately be lost due to that complexity. Is this true?
Its impossible to say as you haven't specified a MIPS implementation, which there have been a lot of over the years.
However, multiplication can be more expensive, especially on earlier MIPS processors, where each could cost two (or three) cycles.
An alternative along the lines of unconditional control flow is to subtract one from the slt and use and instead of mul, later the xor with -1.  That will cost one extra instruction over the mul version, so unclear as to which will be faster, but does involve "simpler" instructions.
slt $t0,$s1,$s0 ; Set on less-than (if $s1 < $s0 then set $t0 to 1, else 0).
subi $t0, $t0, 1 ; 1/true->0, 0/false->-1
and $s1,$s1,$t0 ; $s1 &= $t0-1 mask is either 0 or -1
xori $t0,$t0,-1 ; $t0 = ~ $t0
and $s0,$s0,$t0 ; $s1 &= ~ (t0-1) mask is either -1 or 0
Whether this is faster than the conditional logic depends on the processor, and the data set, assuming this is in a loop (if not in a loop at some level, then the question is of lessor importance).
The data set will determine whether the branch prediction is effective or not.  Each time the branch prediction is wrong, it will cost a couple of cycles (of course depending on the processor implementation).  If the branch prediction is 100% effective, then the conditional logic would probably be best.  But if the data set defies branch prediction then several cycles of overhead per execution might be incurred, so the unconditional logic might be best.
It would be nice if processors had a few extra opcodes so that the subtraction of 1 was unnecessary, then also the inversion for the else part.  In such hypothetical case, we would have:
slt $t0, $s1, $s0
sameIfTrueOrZero $s0, $s0, $t0 # $s0 = ($t0 ? $s0 : 0)
sameIfFalseOrZero $s1, $s1, $t0 # $s1 = ($t0 ? 0 : $s1)
These would be simple instructions for hardware to implement — they would not tax the ALU, or instruction and operand encodings.
Some more advanced processors may be able to fuse operations, or dual issue, on some of the above so that the execution cost might come down.
Interesting and clever use of mul by 0 or 1, but if you have MIPS32 extensions like mul (not classic MIPS mult/mflo) then you also have movz / movn conditional moves. In fact those were introduced with MIPS IV (first commercial release in 1994, R8000), earlier than 1999's MIPS 4K (MIPS32 ISA) MIPS 5K (MIPS64 ISA).
movn and movz are very much like x86 cmov, AArch64 csel, and equivalent instructions on other ISAs. Since MIPS has a zero register, you can condition-move from it to conditionally zero something based on another register being zero (movz) or non-zero (movn).
That allows GCC to compile your logic into 3 instructions, when I put it in front of a tailcall that uses the values in their incoming registers.
int bar(int,int);
int foo(int a, int b){
if (a <= b)
a = 0;
else
b = 0;
return bar(a, b);
}
Godbolt MIPS GCC 11.2 -O3 -march=mips4 -fno-delayed-branch
foo:
slt $2,$5,$4
movn $5,$0,$2
movz $4,$0,$2
j bar
nop
This is pretty much unbeatable.
Note that in your proposed version, the dynamic instruction count is at worst 4 (no path of execution runs both all 5), and at best 3 (if the bne is taken). But it avoid branching which is good, and does save static code size.
(If you assemble for a real MIPS with an assembler that tries but fails to fill branch-delay slots here, the branchy version might get an extra NOP after one or both branches, widening the advantage of branchless.)
On a fast wide machine (like 4-wide MIPS R8000 or R10000 although those are MIPS IV CPUs with movn/z but not mul) it's certainly interesting to consider mul if you don't do this often; the front-end can decode these instructions and then get more into the pipeline while it chews on these. Modern x86 CPUs have fully pipelined integer multiply units with 3 cycle latency, 1 cycle throughput (so they can start a new multiply every clock cycle). But much older MIPS CPUs almost certainly didn't, likely not even fully pipelined, and probably higher latency. (This is why until MIPS32/64 they only had mult, which wrote the result to a special hi/lo registers, to decouple that logic from the regular integer pipeline.)
I don't know where the tradeoff might be in terms of how predictable the branch would have to be (by the older simpler predictors in old MIPS CPUs) before that would be better than mul, for some realistic historical mul latency/throughput numbers. MIPS pipelines are shorter than modern CPUs; it helps some that the ISA is literally designed from the ground up to pipeline easily. But the branch delay slot only really helps scalar CPUs.
With a slow enough mul (or mult / mfo), even worst-case branch prediction might be better.
See also Modern Microprocessors
A 90-Minute Guide!
I didn't look at details of MIPS III CPUs to see if any had longer pipelines / superscalar or out-of-order exec which might favour less branchy code even without movn / movz
But as Erik showed in his answer, it's possible to just use bitwise operations even if you don't have MIPS IV movn/movz, so this consideration of mul only goes from purely hypothetical on MIPS IV to basically hypothetical on MIPS III CPUs.
(Or I guess if you have code that needs to be able to run on old MIPS CPUs, but is being run on a newer MIPS.)

instrinsic _mm512_round_ps is missing for AVX512

I'm missing the intrinsic _mm512_round_ps for AVX512 (it is only available for KNC). Any idea why this is not available?
What would be a good workaround?
apply _mm256_round_ps to upper and lower half and fuse the results?
use _mm512_add_round_ps with one argument being zero?
Thanks!
TL:DR: AVX512F
__m512 nearest_integer = _mm512_roundscale_ps(input_vec, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC);
related: AVX512DQ _mm512_reduce_pd or _ps will subtract the integer part (and a specified number of leading fraction bits), range-reducing your input to only the fractional part. asm docs for vreducepd have the most detail.
The EVEX prefix allows overriding the default rounding direction {er} and setting suppress-all-exceptions {sae}, for FP instructions. (This is what the ..._round_ps() versions of intrinsics are for.) But it doesn't have a "round to integer" option; you still need a separate asm instruction for that.
vroundps xy, xy/mem, imm8 didn't get upgraded to AVX512. Actually it did: the same opcode has a new mnemonic for the EVEX version, using the high 4 bits of the immediate that are reserved in the SSE and VEX encodings.
vrndscaleps xyz, xyz/mem/m32broadcast, imm8 is available in ss/sd/ps/pd flavours. The high 4 bits of the imm8 specify the number of fraction bits to round to. In these terms, rounding to the nearest integer is rounding to 0 fraction bits. Rounding to nearest 0.5 would be rounding to 1 fraction bit. It's the same as scaling by 2^M, rounding to nearest integer, then scaling back down (done without overflow).
I think the field is unsigned, so you can't use M=-1 to round to an even number. The ISA ref manual doesn't mention signedness, so I'm leaning towards unsigned being the most likely.
The low 4 bits of the field specify the rounding mode like with roundps. As usual, the PD version of the instruction has the diagram (because it's alphabetically first).
With the upper 4 bits = 0, it behaves the same as roundps: they use the same encoding for the low 4 bits. It's not a coincidence that the instructions have the same opcode, just different prefixes.
(I'm curious if SSE or VEX roundpd on an AVX512 CPU would actually scale based on the upper 4 bits; it says they're "reserved" not "ignored". But probably not.)
__m512 _mm512_roundscale_ps( __m512 a, int imm); is the no-frills intrinsic. See Intel's intrinsic finder
The merge-masking + SAE-override version is __m512 _mm512_mask_roundscale_round_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae);. There's nothing you can do with the sae operand that roundscale can't already do with its imm8, though, so it's a bit pointless.
You can use the _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC and so on constants documented for _mm_round_pd / _mm256_round_pd, to round up, down, or truncate towards zero, or the usual nearest with even-as-tiebreak that's the IEEE default rounding mode. Or _MM_FROUND_CUR_DIRECTION to use whatever the current mode is. _MM_FROUND_NO_EXC suppresses setting the inexact exception bit in the MXCSR.
You might be wondering why vrndscaleps needs any immediate bits to specify rounding direction when you could just use the EVEX prefix to override the rounding direction with vrndscaleps zmm0 {k1}, zmm1, {rz-sae} (Or whatever the right syntax is; NASM doesn't seem to be accepting any of the examples I found.)
The answer is that explicit rounding is only available with 512-bit vectors or with scalars, and only for register operands. (It repurposes 3 EVEX bits used to set vector length (if AVX512VL is supported), and to distinguish between broadcast memory operands vs. vector. EVEX bits are overloaded based on context to pack more functionality into limited space.)
So having the rounding-control in the imm8 makes it possible to do vrndscaleps zmm0{k1}, [rdi]{m32bcst}, imm8 to broadcast a float from memory, round it, and merge that into an existing register according to mask register k1. All in a single instruction which decodes to probably 3 uops on SKX, assuming it's the same as vroundps. (http://agner.org/optimize/).

MASM Assembly 8086 - Implementing division without use DIV instruction for any number

I'm trying to implement a division operation without using DIV instruction for performance reasons. I saw this link:
http://what-when-how.com/microcontrollers/multiplication-and-division-microcontrollers/
and in the code of division operation (is not Assembly 8086) I don't understand what it does.
I try to write the code with the corresponding ASM 8086 instructions hoping that I don't wrong.
In that example we have B --> Dividend (BX) and A ---> Divisor (AX)
DIVIDE: PSHA (PUSH AX) ;Save the divisor
CLRA (XOR BX,BX) ;Expand dividend, fill with zeros
LDX #8 ;Initialize counter (I don't understand this part, maybe initialize a counter = 8 because we have here 8 bits)
LOOP: ASLD ;Shift dividend and quotient left (?? SHL BX,1 ?)
CMPA 0,SP ;Check if subtraction will leave positive result (?? Subtraction between?)
and I stop here because if I don't understand this part is useless going ahead.
Can you help me? There is another division algorithm better than DIV instruction?
What you're looking for is called the "Russian Peasant Algorithm" and can be implemented using shifts and addition. However, it's not going to buy you any performance on a machine with a dedicated divide instruction. I implemented it on an 8051 years ago, but that didn't have multiply / divide instructions, so it made sense. You're not going to outperform microcoded multiplication and division in regular assembly code. They've already spent the time to optimize the div instruction quite heavily.

Pseudorandom generator in Assembly Language

I need a pseudorandom number generator algorithm for a assembler program assigned in a course, and I would prefer a simple algorithm. However, I cannot use an external library.
What is a good, simple pseudorandom number generator algorithm for assembly?
Easy one is to just choose two big relative primes a and b, then keep multiplying your random number by a and adding b. Use the modulo operator to keep the low bits as your random number and keep the full value for the next iteration.
This algorithm is known as the linear congruential generator.
Volume 2 of The Art of Computer Programming has a lot of information about pseudorandom number generation. The algorithms are demonstrated in assembler, so you can see for yourself which are simplest in assembler.
If you can link to an external library or object file, though, that would be your best bet. Then you could link to, e.g., Mersenne Twister.
Note that most pseudorandom number generators are not safe for cryptography, so if you need secure random number generation, you need to look beyond the basic algorithms (and probably should tap into OS-specific crypto APIs).
Simple code for testing, don't use with Crypto
From Testing Computer Software, page 138
With is 32 bit maths, you don't need the operation MOD 2^32
RNG = (69069*RNG + 69069) MOD 2^32
Well - Since I haven't seen a reference to the good old Linear Feedback Shift Register I post some SSE intrinsic based C-Code. Just for completenes. I wrote that thing a couple of month ago to sharpen my SSE-skills again.
#include <emmintrin.h>
static __m128i LFSR;
void InitRandom (int Seed)
{
LFSR = _mm_cvtsi32_si128 (Seed);
}
int GetRandom (int NumBits)
{
__m128i seed = LFSR;
__m128i one = _mm_cvtsi32_si128(1);
__m128i mask;
int i;
for (i=0; i<NumBits; i++)
{
// generate xor of adjecting bits
__m128i temp = _mm_xor_si128(seed, _mm_srli_epi64(seed,1));
// generate xor of feedback bits 5,6 and 62,61
__m128i NewBit = _mm_xor_si128( _mm_srli_epi64(temp,5),
_mm_srli_epi64(temp,61));
// Mask out single bit:
NewBit = _mm_and_si128 (NewBit, one);
// Shift & insert new result bit:
seed = _mm_or_si128 (NewBit, _mm_add_epi64 (seed,seed));
}
// Write back seed...
LFSR = seed;
// generate mask of NumBit ones.
mask = _mm_srli_epi64 (_mm_cmpeq_epi8(seed, seed), 64-NumBits);
// return random number:
return _mm_cvtsi128_si32 (_mm_and_si128(seed,mask));
}
Translating this code to assembler is trivial. Just replace the intrinsics with the real SSE instructions and add a loop around it.
Btw - the sequence this code genreates repeats after 4.61169E+18 numbers. That's a lot more than you'll get via the prime method and 32 bit arithmetic. If unrolled it's faster as well.
#jjrv
What you're describing is actually a linear congrential generator. The most random bits are the highest bits. To get a number from 0..N-1 you multiply the full value by N (32 bits by 32 bits giving 64 bits) and use the high 32 bits.
You shouldn't just use any number for a (the multiplier for progressing from one full value to the next), the numbers recommended in Knuth (Table 1 section 3.3.4 TAOCP vol 2 1981) are 1812433253, 1566083941, 69069 and 1664525.
You can just pick any odd number for b. (the addition).
Why not use an external library??? That wheel has been invented a few hundred times, so why do it again?
If you need to implement an RNG yourself, do you need to produce numbers on demand -- i.e. are you implementing a rand() function -- or do you need to produce streams of random numbers -- e.g. for memory testing?
Do you need an RNG that is crypto-strength? How long does it have to go before it repeats? Do you have to absolutely, positively guarantee uniform distribution of all bits?
Here's simple hack I used several years ago. I was working in embedded and I needed to test RAM on power-up and I wanted really small, fast code and very little state, and I did this:
Start with an arbitrary 4-byte constant for your seed.
Compute the 32-bit CRC of those 4 bytes. That gives you the next 4 bytes
Feed back those 4 bytes into the CRC32 algorithm, as if they had been appended. The CRC32 of those 8 bytes is the next value.
Repeat as long as you want.
This takes very little code (although you need a table for the crc32 function) and has very little state, but the psuedorandom output stream has a very long cycle time before it repeats. Also, it doesn't require SSE on the processor. And assuming you have the CRC32 function handy, it's trivial to implement.
Using masm615 to compiler:
delay_function macro
mov cx,0ffffh
.repeat
push cx
mov cx,0f00h
.repeat
dec cx
.until cx==0
pop cx
dec cx
.until cx==0
endm
random_num macro
mov cx,64 ;assum we want to get 64 random numbers
mov si,0
get_num:
push cx
delay_function ;since cpu clock is fast,so we use delay_function
mov ah,2ch
int 21h
mov ax,dx ;get clock 1/100 sec
div num ;assume we want to get a number from 0~num-1
mov arry[si],ah ;save to array you set
inc si
pop cx
loop get_num ;here we finish the get_random number
also you probably can emulate shifting register with XOR sum elements between separate bits, which will give you pseudo-random sequence of numbers.
Linear congruential (X = AX+C mod M) PRNG's might be a good one to assign for an assembler course as your students will have to deal with carry bits for intermediate AX results over 2^31 and computing a modulus. If you are the student they are fairly straightforward to implement in assembler and may be what the lecturer had in mind.

Resources