How to optimise this 8-bit positional popcount using assembly?

How to optimise this 8-bit positional popcount using assembly? - go

This post is related to Golang assembly implement of _mm_add_epi32 , where it adds paired elements in two [8]int32 list, and returns the updated first one.
According to pprof profile, I found passing [8]int32 is expensive, so I think passing pointer of the list is much cheaper and the bech result verified this. Here's the go version:
func __mm_add_epi32_inplace_purego(x, y *[8]int32) {
(*x)[0] += (*y)[0]
(*x)[1] += (*y)[1]
(*x)[2] += (*y)[2]
(*x)[3] += (*y)[3]
(*x)[4] += (*y)[4]
(*x)[5] += (*y)[5]
(*x)[6] += (*y)[6]
(*x)[7] += (*y)[7]
}
This function is called in two levels of loop.
The algorithm computes a position population count over an array of bytes.
Thanks advice from #fuz , I know that writing whole algorithm in assembly is the best choice and makes sense, but it's beyond my ability since I never learn programming in assembly.
However, it should be easy to optimize the inner loop with assembly:
counts := make([][8]int32, numRowBytes)
for i, b = range byteSlice {
if b == 0 { // more than half of elements in byteSlice is 0.
continue
}
expand = _expand_byte[b]
__mm_add_epi32_inplace_purego(&counts[i], expand)
}
// expands a byte into its bits
var _expand_byte = [256]*[8]int32{
&[8]int32{0, 0, 0, 0, 0, 0, 0, 0},
&[8]int32{0, 0, 0, 0, 0, 0, 0, 1},
&[8]int32{0, 0, 0, 0, 0, 0, 1, 0},
&[8]int32{0, 0, 0, 0, 0, 0, 1, 1},
&[8]int32{0, 0, 0, 0, 0, 1, 0, 0},
...
}
Can you help to write an assembly version of __mm_add_epi32_inplace_purego (this is enough for me), or even the whole loop? Thank you in advance.

The operation you want to perform is called a positional population count on bytes. This is a well-known operation used in machine learning and some research has been done on fast algorithms to solve this problem.
Unfortunately, the implementation of these algorithms is fairly involved. For this reason, I have developed a custom algorithm that is much simpler to implement but only yields roughly half the performance of the other other method. However, at measured 10 GB/s, it should still be a decent improvement over what you had previously.
The idea of this algorithm is to gather corresponding bits from groups of 32 bytes using vpmovmskb and then to take a scalar population count which is then added to the corresponding counter. This allows the dependency chains to be short and a consistent IPC of 3 to be reached.
Note that compared to your algorithm, my code flips the order of bits around. You can change this by editing which counts array elements the assembly code accesses if you want. However, in the interest of future readers, I'd like to leave this code with the more common convention where the least significant bit is considered bit 0.
Source code
The complete source code can be found on github. The author has meanwhile developed this algorithm idea into a portable library that can be used like this:
import "github.com/clausecker/pospop"
var counts [8]int
pospop.Count8(counts, buf) // add positional popcounts for buf to counts
The algorithm is provided in two variants and has been tested on a machine with a processor identified as “Intel(R) Xeon(R) W-2133 CPU # 3.60GHz.”
Positional Population Count 32 Bytes at a Time.
The counters are kept in general purpose registers for best performance. Memory is prefetched well in advance for better streaming behaviour. The scalar tail is processed using a very simple SHRL/ADCL combination. A performance of up to 11 GB/s is achieved.
#include "textflag.h"
// func PospopcntReg(counts *[8]int32, buf []byte)
TEXT ·PospopcntReg(SB),NOSPLIT,$0-32
MOVQ counts+0(FP), DI
MOVQ buf_base+8(FP), SI // SI = &buf[0]
MOVQ buf_len+16(FP), CX // CX = len(buf)
// load counts into register R8--R15
MOVL 4*0(DI), R8
MOVL 4*1(DI), R9
MOVL 4*2(DI), R10
MOVL 4*3(DI), R11
MOVL 4*4(DI), R12
MOVL 4*5(DI), R13
MOVL 4*6(DI), R14
MOVL 4*7(DI), R15
SUBQ $32, CX // pre-subtract 32 bit from CX
JL scalar
vector: VMOVDQU (SI), Y0 // load 32 bytes from buf
PREFETCHT0 384(SI) // prefetch some data
ADDQ $32, SI // advance SI past them
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R15 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R14 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R13 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R12 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R11 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R10 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R9 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R8 // add to counter
SUBQ $32, CX
JGE vector // repeat as long as bytes are left
scalar: ADDQ $32, CX // undo last subtraction
JE done // if CX=0, there's nothing left
loop: MOVBLZX (SI), AX // load a byte from buf
INCQ SI // advance past it
SHRL $1, AX // CF=LSB, shift byte to the right
ADCL $0, R8 // add CF to R8
SHRL $1, AX
ADCL $0, R9 // add CF to R9
SHRL $1, AX
ADCL $0, R10 // add CF to R10
SHRL $1, AX
ADCL $0, R11 // add CF to R11
SHRL $1, AX
ADCL $0, R12 // add CF to R12
SHRL $1, AX
ADCL $0, R13 // add CF to R13
SHRL $1, AX
ADCL $0, R14 // add CF to R14
SHRL $1, AX
ADCL $0, R15 // add CF to R15
DECQ CX // mark this byte as done
JNE loop // and proceed if any bytes are left
// write R8--R15 back to counts
done: MOVL R8, 4*0(DI)
MOVL R9, 4*1(DI)
MOVL R10, 4*2(DI)
MOVL R11, 4*3(DI)
MOVL R12, 4*4(DI)
MOVL R13, 4*5(DI)
MOVL R14, 4*6(DI)
MOVL R15, 4*7(DI)
VZEROUPPER // restore SSE-compatibility
RET
Positional Population Count 96 Bytes at a Time with CSA
This variant performs all of the optimisations above but reduces 96 bytes to 64 using a single CSA step beforehand. As expected, this improves the performance by roughly 30% and achieves up to 16 GB/s.
#include "textflag.h"
// func PospopcntRegCSA(counts *[8]int32, buf []byte)
TEXT ·PospopcntRegCSA(SB),NOSPLIT,$0-32
MOVQ counts+0(FP), DI
MOVQ buf_base+8(FP), SI // SI = &buf[0]
MOVQ buf_len+16(FP), CX // CX = len(buf)
// load counts into register R8--R15
MOVL 4*0(DI), R8
MOVL 4*1(DI), R9
MOVL 4*2(DI), R10
MOVL 4*3(DI), R11
MOVL 4*4(DI), R12
MOVL 4*5(DI), R13
MOVL 4*6(DI), R14
MOVL 4*7(DI), R15
SUBQ $96, CX // pre-subtract 32 bit from CX
JL scalar
vector: VMOVDQU (SI), Y0 // load 96 bytes from buf into Y0--Y2
VMOVDQU 32(SI), Y1
VMOVDQU 64(SI), Y2
ADDQ $96, SI // advance SI past them
PREFETCHT0 320(SI)
PREFETCHT0 384(SI)
VPXOR Y0, Y1, Y3 // first adder: sum
VPAND Y0, Y1, Y0 // first adder: carry out
VPAND Y2, Y3, Y1 // second adder: carry out
VPXOR Y2, Y3, Y2 // second adder: sum (full sum)
VPOR Y0, Y1, Y0 // full adder: carry out
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R15
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R14
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R13
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R12
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R11
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R10
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R9
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R8
SUBQ $96, CX
JGE vector // repeat as long as bytes are left
scalar: ADDQ $96, CX // undo last subtraction
JE done // if CX=0, there's nothing left
loop: MOVBLZX (SI), AX // load a byte from buf
INCQ SI // advance past it
SHRL $1, AX // is bit 0 set?
ADCL $0, R8 // add it to R8
SHRL $1, AX // is bit 0 set?
ADCL $0, R9 // add it to R9
SHRL $1, AX // is bit 0 set?
ADCL $0, R10 // add it to R10
SHRL $1, AX // is bit 0 set?
ADCL $0, R11 // add it to R11
SHRL $1, AX // is bit 0 set?
ADCL $0, R12 // add it to R12
SHRL $1, AX // is bit 0 set?
ADCL $0, R13 // add it to R13
SHRL $1, AX // is bit 0 set?
ADCL $0, R14 // add it to R14
SHRL $1, AX // is bit 0 set?
ADCL $0, R15 // add it to R15
DECQ CX // mark this byte as done
JNE loop // and proceed if any bytes are left
// write R8--R15 back to counts
done: MOVL R8, 4*0(DI)
MOVL R9, 4*1(DI)
MOVL R10, 4*2(DI)
MOVL R11, 4*3(DI)
MOVL R12, 4*4(DI)
MOVL R13, 4*5(DI)
MOVL R14, 4*6(DI)
MOVL R15, 4*7(DI)
VZEROUPPER // restore SSE-compatibility
RET
Benchmarks
Here are benchmarks for the two algorithms and a naïve reference implementation in pure Go. Full benchmarks can be found in the github repository.
BenchmarkReference/10-12 12448764 80.9 ns/op 123.67 MB/s
BenchmarkReference/32-12 4357808 258 ns/op 124.25 MB/s
BenchmarkReference/1000-12 151173 7889 ns/op 126.76 MB/s
BenchmarkReference/2000-12 68959 15774 ns/op 126.79 MB/s
BenchmarkReference/4000-12 36481 31619 ns/op 126.51 MB/s
BenchmarkReference/10000-12 14804 78917 ns/op 126.72 MB/s
BenchmarkReference/100000-12 1540 789450 ns/op 126.67 MB/s
BenchmarkReference/10000000-12 14 77782267 ns/op 128.56 MB/s
BenchmarkReference/1000000000-12 1 7781360044 ns/op 128.51 MB/s
BenchmarkReg/10-12 49255107 24.5 ns/op 407.42 MB/s
BenchmarkReg/32-12 186935192 6.40 ns/op 4998.53 MB/s
BenchmarkReg/1000-12 8778610 115 ns/op 8677.33 MB/s
BenchmarkReg/2000-12 5358495 208 ns/op 9635.30 MB/s
BenchmarkReg/4000-12 3385945 357 ns/op 11200.23 MB/s
BenchmarkReg/10000-12 1298670 901 ns/op 11099.24 MB/s
BenchmarkReg/100000-12 115629 8662 ns/op 11544.98 MB/s
BenchmarkReg/10000000-12 1270 916817 ns/op 10907.30 MB/s
BenchmarkReg/1000000000-12 12 93609392 ns/op 10682.69 MB/s
BenchmarkRegCSA/10-12 48337226 23.9 ns/op 417.92 MB/s
BenchmarkRegCSA/32-12 12843939 80.2 ns/op 398.86 MB/s
BenchmarkRegCSA/1000-12 7175629 150 ns/op 6655.70 MB/s
BenchmarkRegCSA/2000-12 3988408 295 ns/op 6776.20 MB/s
BenchmarkRegCSA/4000-12 3016693 382 ns/op 10467.41 MB/s
BenchmarkRegCSA/10000-12 1810195 642 ns/op 15575.65 MB/s
BenchmarkRegCSA/100000-12 191974 6229 ns/op 16053.40 MB/s
BenchmarkRegCSA/10000000-12 1622 698856 ns/op 14309.10 MB/s
BenchmarkRegCSA/1000000000-12 16 68540642 ns/op 14589.88 MB/s

Related

Faster bitwise AND operation on byte slices

I'd like to perform bitwise AND on every column of a byte matrix, which is stored in a [][]byte in golang. I created a repo with runnable test code.
It can be simplified as a bitwise AND operation on two byte-slice of equal length. The simplest way is using for loop to handle every pair of bytes.
func and(x, y []byte) []byte {
z := make([]byte, lenght(x))
for i:= 0; i < len(x); i++ {
z[i] = x[i] & y[i]
}
return z
}
However, it's very slow for long slices. A faster way is to unroll the for loop (check the benchmark result)
BenchmarkLoop-16 14467 84265 ns/op
BenchmarkUnrollLoop-16 17668 67550 ns/op
Any faster way? Go assembly?
Thank you in advance.

I write a go assembly implementation using AVX2 instructions after two days of (go) assembly learning.
The performance is good, 10X of simple loop version. While optimizations for compatibility and performance are still needed. Suggestions and PRs are welcome.
Note: code and benchmark results are updated.
I appreciate #PeterCordes for many valuable suggestions.
#include "textflag.h"
// func AND(x []byte, y []byte)
// Requires: AVX
TEXT ·AND(SB), NOSPLIT|NOPTR, $0-48
// pointer of x
MOVQ x_base+0(FP), AX
// length of x
MOVQ x_len+8(FP), CX
// pointer of y
MOVQ y_base+24(FP), DX
// --------------------------------------------
// end address of x, will not change: p + n
MOVQ AX, BX
ADDQ CX, BX
// end address for loop
// n <= 8, jump to tail
CMPQ CX, $0x00000008
JLE tail
// n < 16, jump to loop8
CMPQ CX, $0x00000010
JL loop8_start
// n < 32, jump to loop16
CMPQ CX, $0x00000020
JL loop16_start
// --------------------------------------------
// end address for loop32
MOVQ BX, CX
SUBQ $0x0000001f, CX
loop32:
// compute x & y, and save value to x
VMOVDQU (AX), Y0
VANDPS (DX), Y0, Y0
VMOVDQU Y0, (AX)
// move pointer
ADDQ $0x00000020, AX
ADDQ $0x00000020, DX
CMPQ AX, CX
JL loop32
// n <= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// n < 16, jump to loop8
CMPQ CX, $0x00000010
JL loop8_start
// --------------------------------------------
loop16_start:
// end address for loop16
MOVQ BX, CX
SUBQ $0x0000000f, CX
loop16:
// compute x & y, and save value to x
VMOVDQU (AX), X0
VANDPS (DX), X0, X0
VMOVDQU X0, (AX)
// move pointer
ADDQ $0x00000010, AX
ADDQ $0x00000010, DX
CMPQ AX, CX
JL loop16
// n <= 8, jump to tail
MOVQ BX, CX
SUBQ AX, CX
CMPQ CX, $0x00000008
JLE tail
// --------------------------------------------
loop8_start:
// end address for loop8
MOVQ BX, CX
SUBQ $0x00000007, CX
loop8:
// compute x & y, and save value to x
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
// move pointer
ADDQ $0x00000008, AX
ADDQ $0x00000008, DX
CMPQ AX, CX
JL loop8
// --------------------------------------------
tail:
// left elements (<=8)
MOVQ (AX), BX
ANDQ (DX), BX
MOVQ BX, (AX)
RET
Benchmark result:
test data-size time
------------------- --------- -----------
BenchmarkGrailbio 8.00_B 4.654 ns/op
BenchmarkGoAsm 8.00_B 4.824 ns/op
BenchmarkUnrollLoop 8.00_B 6.851 ns/op
BenchmarkLoop 8.00_B 8.683 ns/op
BenchmarkGrailbio 16.00_B 5.363 ns/op
BenchmarkGoAsm 16.00_B 6.369 ns/op
BenchmarkUnrollLoop 16.00_B 10.47 ns/op
BenchmarkLoop 16.00_B 13.48 ns/op
BenchmarkGoAsm 32.00_B 6.079 ns/op
BenchmarkGrailbio 32.00_B 6.497 ns/op
BenchmarkUnrollLoop 32.00_B 17.46 ns/op
BenchmarkLoop 32.00_B 21.09 ns/op
BenchmarkGoAsm 128.00_B 10.52 ns/op
BenchmarkGrailbio 128.00_B 14.40 ns/op
BenchmarkUnrollLoop 128.00_B 56.97 ns/op
BenchmarkLoop 128.00_B 80.12 ns/op
BenchmarkGoAsm 256.00_B 15.48 ns/op
BenchmarkGrailbio 256.00_B 23.76 ns/op
BenchmarkUnrollLoop 256.00_B 110.8 ns/op
BenchmarkLoop 256.00_B 147.5 ns/op
BenchmarkGoAsm 1.00_KB 47.16 ns/op
BenchmarkGrailbio 1.00_KB 87.75 ns/op
BenchmarkUnrollLoop 1.00_KB 443.1 ns/op
BenchmarkLoop 1.00_KB 540.5 ns/op
BenchmarkGoAsm 16.00_KB 751.6 ns/op
BenchmarkGrailbio 16.00_KB 1342 ns/op
BenchmarkUnrollLoop 16.00_KB 7007 ns/op
BenchmarkLoop 16.00_KB 8623 ns/op

Never triggered if statements make code execution in benchmark faster? Why?

I have recently started the Go track on exercism.io and had fun optimizing the "nth-prime" calculation. Actually I came across a funny fact I can't explain. Imagine the following code:
// Package prime provides ...
package prime
// Nth function checks for the prime number on position n
func Nth(n int) (int, bool) {
if n <= 0 {
return 0, false
}
if (n == 1) {
return 2, true
}
currentNumber := 1
primeCounter := 1
for n > primeCounter {
currentNumber+=2
if isPrime(currentNumber) {
primeCounter++
}
}
return currentNumber, primeCounter==n
}
// isPrime function checks if a number
// is a prime number
func isPrime(n int) bool {
//useless because never triggered but makes it faster??
if n < 2 {
println("n < 2")
return false
}
//useless because never triggered but makes it faster??
if n%2 == 0 {
println("n%2")
return n==2
}
for i := 3; i*i <= n; i+=2 {
if n%i == 0 {
return false
}
}
return true
}
In the private function isPrime I have two initial if-statements that are never triggered, because I only give in uneven numbers greater than 2. The benchmark returns following:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 100 18114825 ns/op 0 B/op 0
If I remove the never triggered if-statements the benchmark goes slower:
Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$
BenchmarkNth-8 50 21880749 ns/op 0 B/op 0
I have run the benchmark multiple times changing the code back and forth always getting more or less the same numbers and I can't think of a reason why these two if-statements should make the execution faster. Yes it is micro-optimization, but I want to know: Why?
Here is the whole exercise from exercism with test-cases: nth-prime
Go version i am using is 1.12.1 linux/amd64 on a manjaro i3 linux

What happens is the compiler is guaranteed with some assertions about the input when those if's are added.
If those assertions are lifted, the compiler has to add it himself. The way it does it is by validating it on each iteration. We can take a look at the assembly code to prove it. (by passing -gcflags=-S to the go test command)
With the if's:
0x004b 00075 (func.go:16) JMP 81
0x004d 00077 (func.go:16) LEAQ 2(BX), AX
0x0051 00081 (func.go:16) MOVQ AX, DX
0x0054 00084 (func.go:16) IMULQ AX, AX
0x0058 00088 (func.go:16) CMPQ AX, CX
0x005b 00091 (func.go:16) JGT 133
0x005d 00093 (func.go:17) TESTQ DX, DX
0x0060 00096 (func.go:17) JEQ 257
0x0066 00102 (func.go:17) MOVQ CX, AX
0x0069 00105 (func.go:17) MOVQ DX, BX
0x006c 00108 (func.go:17) CQO
0x006e 00110 (func.go:17) IDIVQ BX
0x0071 00113 (func.go:17) TESTQ DX, DX
0x0074 00116 (func.go:17) JNE 77
Without the if's:
0x0016 00022 (func.go:16) JMP 28
0x0018 00024 (func.go:16) LEAQ 2(BX), AX
0x001c 00028 (func.go:16) MOVQ AX, DX
0x001f 00031 (func.go:16) IMULQ AX, AX
0x0023 00035 (func.go:16) CMPQ AX, CX
0x0026 00038 (func.go:16) JGT 88
0x0028 00040 (func.go:17) TESTQ DX, DX
0x002b 00043 (func.go:17) JEQ 102
0x002d 00045 (func.go:17) MOVQ CX, AX
0x0030 00048 (func.go:17) MOVQ DX, BX
0x0033 00051 (func.go:17) CMPQ BX, $-1
0x0037 00055 (func.go:17) JEQ 64
0x0039 00057 (func.go:17) CQO
0x003b 00059 (func.go:17) IDIVQ BX
0x003e 00062 (func.go:17) JMP 69
0x0040 00064 func.go:17) NEGQ AX
0x0043 00067 (func.go:17) XORL DX, DX
0x0045 00069 (func.go:17) TESTQ DX, DX
0x0048 00072 (func.go:17) JNE 24
Line 51 in the assembly code 0x0033 00051 (func.go:17) CMPQ BX, $-1 is the culprit.
Line 16, for i := 3; i*i <= n; i+=2, in the original Go code, is translated the same for both cases. But line 17 if n%i == 0 that runs every iteration compiles to more instructions and as a result more work for the CPU in total.
Something similar in the encoding/base64 package by ensuring the loop won't receive a nil value. You can take a look here:
https://go-review.googlesource.com/c/go/+/151158/3/src/encoding/base64/base64.go
This check was added intentionally. In your case, you optimized it accidentally :)

I want to make a bubble sort in assembly, and I don't understand why it's not working

I have a function "swapByRef" that works in this code (this code just checks if the function swaps the values).
MODEL small
STACK 10h
DATA SEGMENT
a dw 12h
b dw 0A9h
DATA ENDS
CODE SEGMENT
ASSUME CS:CODE, DS:DATA
start:
mov ax, DATA
mov ds, ax
push offset a ;push to the stack the adress of the variable a
push offset b ;push to the stack the adress of the variable b
call swapByRef
exit:
mov ax, 4c00h
int 21h
swapByRef proc
mov bp, sp
mov bx, [bp + 2]
mov ax, [bx]
mov si, [bp + 4]
mov cx, [si]
mov [bx], cx
mov[si], ax
ret 4
swapByRef endP
CODE ENDS
END start
But in my code (the bubble sort code) the procedure doesn't swap the values in the array and the array does not get sorted.
MODEL small
STACK 100h
DATA SEGMENT
ARR dw 9,5,7,3,8
len dw 5
DATA ENDS
CODE SEGMENT
ASSUME CS:CODE, DS:DATA
start:
mov ax, DATA
mov ds, ax
mov bx, offset ARR
sorting:
mov ax, len
dec ax
cmp bx, ax
je redo
mov ax, ARR[bx]
mov dx, ARR[bx+ 2]
cmp al, ah
jg swap
jmp continue
swap:
push offset [ARR + bx]
push offset [ARR + bx + 2]
call swapByRef
continue:
inc bx
jmp sorting
redo:
cmp len, 0
je exit
mov ax, len
dec ax
mov len, ax
xor bl, bl
jmp sorting
exit:
mov ax, 4c00h
int 21h
swapByRef proc
mov bp, sp
mov bx, [bp + 2]
mov ax, [bx]
mov si, [bp + 4]
mov cx, [si]
mov [bx], cx
mov[si], ax
ret 4
swapByRef endP
CODE ENDS
END start
I've tried to debug and still I couldn't find the problem in my code...
Any help will be awesome, thanks.

mov bx, offset ARR
...
mov ax, ARR[bx]
mov dx, ARR[bx+ 2]
You're adding the offset to the array twice! You need to initialize BX=0.
mov ax, ARR[bx]
mov dx, ARR[bx+ 2]
cmp al, ah
jg swap
jmp continue
swap:
You've read the elements in AX and DX. Then also compare AX and DX.
You can also write it shorter like this:
mov ax, ARR[bx]
cmp ax, ARR[bx+2]
jng continue
swap:
Given that the array contains words and that BX is an offset within the array, you need to change BX in steps of 2. Write add bx, 2 instead of inc bx.
This also means that it's best to set len dw 10 and modify it in steps of 2 using sub word ptr len, 2.
swapByRef proc
mov bp, sp
mov bx, [bp + 2]
mov ax, [bx]
mov si, [bp + 4]
mov cx, [si]
mov [bx], cx
mov[si], ax
ret 4
Your swapByRef proc destroys a lot of registers. Especially loosing BX is problematic!
This is a general solution to not clobber registers. Optimize as needed.
swapByRef proc
push bp
mov bp, sp
push ax
push bx
push cx
push si
mov bx, [bp + 4]
mov ax, [bx]
mov si, [bp + 6]
mov cx, [si]
mov [bx], cx
mov [si], ax
pop si
pop cx
pop bx
pop ax
pop bp
ret 4

Sorting a list of ten numbers with selection sort in assembly language

sorting a list of ten numbers with selection sort in assembly language.
How does i convert this bubble sort method into selection sort method
`[org 0x0100]
jmp start
data: dw 60, 55, 45, 50, 40, 35, 25, 30, 10, 0
swap: db 0
start: mov bx, 0 ; initialize array index to zero
mov byte [swap], 0 ; rest swap flag to no swaps
loop1: mov ax, [data+bx] ; load number in ax
cmp ax, [data+bx+2] ; compare with next number
jbe noswap ; no swap if already in order
mov dx, [data+bx+2] ; load second element in dx
mov [data+bx+2], ax ; store first number in second
mov [data+bx], dx ; store second number in first
mov byte [swap], 1 ; flag that a swap has been done
noswap: add bx, 2 ; advance bx to next index
cmp bx, 18 ; are we at last index
jne loop1 ; if not compare next two
cmp byte [swap], 1 ; check if a swap has been done
je start ; if yes make another pass
mov ax, 0x4c00 ; terminate program
int 0x21`

The key here is to change your loop. Currently it's swapping adjacent numbers. You need to change it to copy the rightmost element into a register, and shift the pre-existing sorted array to the right until the element you just shifted is greater than or equal to the previously rightmost element you just copied into your register.

Maybe this will be helpfull. I wrote this a long time ago. Realmode intel assembler.
MAIN.ASM
SSTACK SEGMENT PARA STACK 'STACK'
DW 128 DUP(?)
SSTACK ENDS
DSEG SEGMENT PUBLIC 'DATA'
S LABEL BYTE
ARR DB 'IHGFED27182392JASKD1O12312345CBA'
LEN EQU ($-S)
PUBLIC TMP
PUBLIC MIN
TMP DW ?
MIN DW ?
DSEG ENDS
CSEG SEGMENT 'CODE'
ASSUME CS:CSEG, SS:SSTACK, DS:DSEG
EXTRN OUTPUT:NEAR
EXTRN SORT:NEAR
START: MOV AX, DSEG
MOV DS, AX
MOV BX, OFFSET ARR
MOV CX, LEN
CALL OUTPUT
MOV AX, 60
CMP AX, 0
JZ NO_SORT
CMP AX, 1
JZ NO_SORT
MOV BX, OFFSET ARR
MOV CX, LEN
CALL SORT
NO_SORT: MOV BX, OFFSET ARR
MOV CX, LEN
CALL OUTPUT
MOV AH, 4CH
MOV AL, 0
INT 21H
CSEG ENDS
END START
SORT.ASM
DSEG SEGMENT PUBLIC 'DATA'
EXTRN TMP:WORD
EXTRN MIN:WORD
DSEG ENDS
CSEG SEGMENT 'CODE'
ASSUME CS:CSEG, DS:DSEG
PUBLIC SORT
SORT PROC; (AX - N, BX - ARRAY ADDRESS, CX - ARRAY LENGTH)
PUSH SI
PUSH DI
PUSH DX
CALL COMPARE_MIN
DEC CX
MOV AX, CX
XOR SI, SI
XOR DI, DI
L1:
PUSH CX
MOV MIN, SI
MOV TMP, DI
INC DI
MOV CX, AX
L2:
MOV DH, BYTE PTR[BX+DI]
PUSH SI
MOV SI, MIN
MOV DL, BYTE PTR[BX+SI]
POP SI
CMP DH, DL
JA OLD_MIN
NEW_MIN: MOV MIN, DI
OLD_MIN: INC DI
DEC CX
CMP CX, TMP
JNZ L2
SWAP: PUSH DI
MOV DI, MIN
MOV DL, BYTE PTR[BX+DI]
MOV DH, BYTE PTR[BX+SI]
MOV BYTE PTR [BX+SI], DL
MOV BYTE PTR [BX+DI], DH
POP DI
INC SI
MOV DI, SI
POP CX
LOOP L1
POP DX
POP DI
POP DI
RET
SORT ENDP
COMPARE_MIN PROC; (AX - A, CX - B CX - MIN)
PUSH AX
CMP AX, CX
JB B__A
JA A__B
A__B: MOV CX, CX
JMP EX
B__A: MOV CX, AX
JMP EX
EX: POP AX
RET
COMPARE_MIN ENDP
CSEG ENDS
END
OUTPUT.ASM
CSEG SEGMENT 'CODE'
ASSUME CS:CSEG
PUBLIC OUTPUT
OUTPUT PROC ; (BX - ARRAY ADDRESS, CX - ARRAY LENGTH)
PUSH DX
PUSH SI
PUSH AX
XOR SI, SI
MOV AH, 02H
OUTARR: MOV DL,[BX+SI]
INT 21H
INC SI
LOOP OUTARR
MOV DL, 10
INT 21H
POP AX
POP SI
POP DX
RET
OUTPUT ENDP
CSEG ENDS
END

Jumping to random code when using IDIV

I am relatively new to assembler, but when creating code what works with arrays and calculates the average of each row, I encountered a problem that suggests I don't know how division really works. This is my code:
.model tiny
.code
.startup
Org 100h
Jmp Short Start
N Equ 2 ;columns
M Equ 3 ;rows
Matrix DW 2, 2, 3 ; elements
DW 4, 6, 6 ; elements]
Vector DW M Dup (?)
S Equ Type Matrix
Start:
Mov Cx, M;20
Lea Di, Vector
Xor Si, Si
Cols: Push Cx
Mov Cx, N
Xor Bx, Bx
Xor Ax, Ax
Rows:
Add Ax, Matrix[Bx][Si]
Next:
Add Bx, S*M
Loop Rows
Add Si, S
Mov [Di], Ax
Add Di, S
Pop Cx
Loop Cols
Xor Bx, Bx
Mov Cx, M
Mov DX, 2
Print: Mov Ax, Vector[Bx]
IDiv Dx; div/idiv error here
Add Bx, S
Loop Print
.exit 0
There are no errors when compiling. Elements are counted correctly, but when division happens the debugger shows the program jumping to apparently random code. Why is this happening and how can I resolve it?

If you use x86 architecture, IDiv with 16-bit operand will also take Dx as a part of the integer to be divided and throw an exception (interrupt) if the quotient is too large to fit in 16bits.
Try something like this:
Mov Di, 2
Print: Mov Ax, Vector[Bx]
Cwd ; sign extend Ax to Dx:Ax
IDiv Di

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to optimise this 8-bit positional popcount using assembly? - go

Related

Faster bitwise AND operation on byte slices

Never triggered if statements make code execution in benchmark faster? Why?

I want to make a bubble sort in assembly, and I don't understand why it's not working

Sorting a list of ten numbers with selection sort in assembly language

Jumping to random code when using IDIV

Categories

Resources