How to optimise this 8-bit positional popcount using assembly? - go
This post is related to Golang assembly implement of _mm_add_epi32 , where it adds paired elements in two [8]int32 list, and returns the updated first one.
According to pprof profile, I found passing [8]int32 is expensive, so I think passing pointer of the list is much cheaper and the bech result verified this. Here's the go version:
func __mm_add_epi32_inplace_purego(x, y *[8]int32) {
(*x)[0] += (*y)[0]
(*x)[1] += (*y)[1]
(*x)[2] += (*y)[2]
(*x)[3] += (*y)[3]
(*x)[4] += (*y)[4]
(*x)[5] += (*y)[5]
(*x)[6] += (*y)[6]
(*x)[7] += (*y)[7]
}
This function is called in two levels of loop.
The algorithm computes a position population count over an array of bytes.
Thanks advice from #fuz , I know that writing whole algorithm in assembly is the best choice and makes sense, but it's beyond my ability since I never learn programming in assembly.
However, it should be easy to optimize the inner loop with assembly:
counts := make([][8]int32, numRowBytes)
for i, b = range byteSlice {
if b == 0 { // more than half of elements in byteSlice is 0.
continue
}
expand = _expand_byte[b]
__mm_add_epi32_inplace_purego(&counts[i], expand)
}
// expands a byte into its bits
var _expand_byte = [256]*[8]int32{
&[8]int32{0, 0, 0, 0, 0, 0, 0, 0},
&[8]int32{0, 0, 0, 0, 0, 0, 0, 1},
&[8]int32{0, 0, 0, 0, 0, 0, 1, 0},
&[8]int32{0, 0, 0, 0, 0, 0, 1, 1},
&[8]int32{0, 0, 0, 0, 0, 1, 0, 0},
...
}
Can you help to write an assembly version of __mm_add_epi32_inplace_purego (this is enough for me), or even the whole loop? Thank you in advance.
The operation you want to perform is called a positional population count on bytes. This is a well-known operation used in machine learning and some research has been done on fast algorithms to solve this problem.
Unfortunately, the implementation of these algorithms is fairly involved. For this reason, I have developed a custom algorithm that is much simpler to implement but only yields roughly half the performance of the other other method. However, at measured 10 GB/s, it should still be a decent improvement over what you had previously.
The idea of this algorithm is to gather corresponding bits from groups of 32 bytes using vpmovmskb and then to take a scalar population count which is then added to the corresponding counter. This allows the dependency chains to be short and a consistent IPC of 3 to be reached.
Note that compared to your algorithm, my code flips the order of bits around. You can change this by editing which counts array elements the assembly code accesses if you want. However, in the interest of future readers, I'd like to leave this code with the more common convention where the least significant bit is considered bit 0.
Source code
The complete source code can be found on github. The author has meanwhile developed this algorithm idea into a portable library that can be used like this:
import "github.com/clausecker/pospop"
var counts [8]int
pospop.Count8(counts, buf) // add positional popcounts for buf to counts
The algorithm is provided in two variants and has been tested on a machine with a processor identified as “Intel(R) Xeon(R) W-2133 CPU # 3.60GHz.”
Positional Population Count 32 Bytes at a Time.
The counters are kept in general purpose registers for best performance. Memory is prefetched well in advance for better streaming behaviour. The scalar tail is processed using a very simple SHRL/ADCL combination. A performance of up to 11 GB/s is achieved.
#include "textflag.h"
// func PospopcntReg(counts *[8]int32, buf []byte)
TEXT ·PospopcntReg(SB),NOSPLIT,$0-32
MOVQ counts+0(FP), DI
MOVQ buf_base+8(FP), SI // SI = &buf[0]
MOVQ buf_len+16(FP), CX // CX = len(buf)
// load counts into register R8--R15
MOVL 4*0(DI), R8
MOVL 4*1(DI), R9
MOVL 4*2(DI), R10
MOVL 4*3(DI), R11
MOVL 4*4(DI), R12
MOVL 4*5(DI), R13
MOVL 4*6(DI), R14
MOVL 4*7(DI), R15
SUBQ $32, CX // pre-subtract 32 bit from CX
JL scalar
vector: VMOVDQU (SI), Y0 // load 32 bytes from buf
PREFETCHT0 384(SI) // prefetch some data
ADDQ $32, SI // advance SI past them
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R15 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R14 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R13 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R12 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R11 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R10 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R9 // add to counter
VPADDD Y0, Y0, Y0 // shift Y0 left by one place
VPMOVMSKB Y0, AX // move MSB of Y0 bytes to AX
POPCNTL AX, AX // count population of AX
ADDL AX, R8 // add to counter
SUBQ $32, CX
JGE vector // repeat as long as bytes are left
scalar: ADDQ $32, CX // undo last subtraction
JE done // if CX=0, there's nothing left
loop: MOVBLZX (SI), AX // load a byte from buf
INCQ SI // advance past it
SHRL $1, AX // CF=LSB, shift byte to the right
ADCL $0, R8 // add CF to R8
SHRL $1, AX
ADCL $0, R9 // add CF to R9
SHRL $1, AX
ADCL $0, R10 // add CF to R10
SHRL $1, AX
ADCL $0, R11 // add CF to R11
SHRL $1, AX
ADCL $0, R12 // add CF to R12
SHRL $1, AX
ADCL $0, R13 // add CF to R13
SHRL $1, AX
ADCL $0, R14 // add CF to R14
SHRL $1, AX
ADCL $0, R15 // add CF to R15
DECQ CX // mark this byte as done
JNE loop // and proceed if any bytes are left
// write R8--R15 back to counts
done: MOVL R8, 4*0(DI)
MOVL R9, 4*1(DI)
MOVL R10, 4*2(DI)
MOVL R11, 4*3(DI)
MOVL R12, 4*4(DI)
MOVL R13, 4*5(DI)
MOVL R14, 4*6(DI)
MOVL R15, 4*7(DI)
VZEROUPPER // restore SSE-compatibility
RET
Positional Population Count 96 Bytes at a Time with CSA
This variant performs all of the optimisations above but reduces 96 bytes to 64 using a single CSA step beforehand. As expected, this improves the performance by roughly 30% and achieves up to 16 GB/s.
#include "textflag.h"
// func PospopcntRegCSA(counts *[8]int32, buf []byte)
TEXT ·PospopcntRegCSA(SB),NOSPLIT,$0-32
MOVQ counts+0(FP), DI
MOVQ buf_base+8(FP), SI // SI = &buf[0]
MOVQ buf_len+16(FP), CX // CX = len(buf)
// load counts into register R8--R15
MOVL 4*0(DI), R8
MOVL 4*1(DI), R9
MOVL 4*2(DI), R10
MOVL 4*3(DI), R11
MOVL 4*4(DI), R12
MOVL 4*5(DI), R13
MOVL 4*6(DI), R14
MOVL 4*7(DI), R15
SUBQ $96, CX // pre-subtract 32 bit from CX
JL scalar
vector: VMOVDQU (SI), Y0 // load 96 bytes from buf into Y0--Y2
VMOVDQU 32(SI), Y1
VMOVDQU 64(SI), Y2
ADDQ $96, SI // advance SI past them
PREFETCHT0 320(SI)
PREFETCHT0 384(SI)
VPXOR Y0, Y1, Y3 // first adder: sum
VPAND Y0, Y1, Y0 // first adder: carry out
VPAND Y2, Y3, Y1 // second adder: carry out
VPXOR Y2, Y3, Y2 // second adder: sum (full sum)
VPOR Y0, Y1, Y0 // full adder: carry out
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R15
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R14
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R13
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R12
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R11
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R10
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
VPADDB Y0, Y0, Y0 // shift carry out bytes left
VPADDB Y2, Y2, Y2 // shift sum bytes left
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R9
VPMOVMSKB Y0, AX // MSB of carry out bytes
VPMOVMSKB Y2, DX // MSB of sum bytes
POPCNTL AX, AX // carry bytes population count
POPCNTL DX, DX // sum bytes population count
LEAL (DX)(AX*2), AX // sum popcount plus 2x carry popcount
ADDL AX, R8
SUBQ $96, CX
JGE vector // repeat as long as bytes are left
scalar: ADDQ $96, CX // undo last subtraction
JE done // if CX=0, there's nothing left
loop: MOVBLZX (SI), AX // load a byte from buf
INCQ SI // advance past it
SHRL $1, AX // is bit 0 set?
ADCL $0, R8 // add it to R8
SHRL $1, AX // is bit 0 set?
ADCL $0, R9 // add it to R9
SHRL $1, AX // is bit 0 set?
ADCL $0, R10 // add it to R10
SHRL $1, AX // is bit 0 set?
ADCL $0, R11 // add it to R11
SHRL $1, AX // is bit 0 set?
ADCL $0, R12 // add it to R12
SHRL $1, AX // is bit 0 set?
ADCL $0, R13 // add it to R13
SHRL $1, AX // is bit 0 set?
ADCL $0, R14 // add it to R14
SHRL $1, AX // is bit 0 set?
ADCL $0, R15 // add it to R15
DECQ CX // mark this byte as done
JNE loop // and proceed if any bytes are left
// write R8--R15 back to counts
done: MOVL R8, 4*0(DI)
MOVL R9, 4*1(DI)
MOVL R10, 4*2(DI)
MOVL R11, 4*3(DI)
MOVL R12, 4*4(DI)
MOVL R13, 4*5(DI)
MOVL R14, 4*6(DI)
MOVL R15, 4*7(DI)
VZEROUPPER // restore SSE-compatibility
RET
Benchmarks
Here are benchmarks for the two algorithms and a naïve reference implementation in pure Go. Full benchmarks can be found in the github repository.
BenchmarkReference/10-12 12448764 80.9 ns/op 123.67 MB/s
BenchmarkReference/32-12 4357808 258 ns/op 124.25 MB/s
BenchmarkReference/1000-12 151173 7889 ns/op 126.76 MB/s
BenchmarkReference/2000-12 68959 15774 ns/op 126.79 MB/s
BenchmarkReference/4000-12 36481 31619 ns/op 126.51 MB/s
BenchmarkReference/10000-12 14804 78917 ns/op 126.72 MB/s
BenchmarkReference/100000-12 1540 789450 ns/op 126.67 MB/s
BenchmarkReference/10000000-12 14 77782267 ns/op 128.56 MB/s
BenchmarkReference/1000000000-12 1 7781360044 ns/op 128.51 MB/s
BenchmarkReg/10-12 49255107 24.5 ns/op 407.42 MB/s
BenchmarkReg/32-12 186935192 6.40 ns/op 4998.53 MB/s
BenchmarkReg/1000-12 8778610 115 ns/op 8677.33 MB/s
BenchmarkReg/2000-12 5358495 208 ns/op 9635.30 MB/s
BenchmarkReg/4000-12 3385945 357 ns/op 11200.23 MB/s
BenchmarkReg/10000-12 1298670 901 ns/op 11099.24 MB/s
BenchmarkReg/100000-12 115629 8662 ns/op 11544.98 MB/s
BenchmarkReg/10000000-12 1270 916817 ns/op 10907.30 MB/s
BenchmarkReg/1000000000-12 12 93609392 ns/op 10682.69 MB/s
BenchmarkRegCSA/10-12 48337226 23.9 ns/op 417.92 MB/s
BenchmarkRegCSA/32-12 12843939 80.2 ns/op 398.86 MB/s
BenchmarkRegCSA/1000-12 7175629 150 ns/op 6655.70 MB/s
BenchmarkRegCSA/2000-12 3988408 295 ns/op 6776.20 MB/s
BenchmarkRegCSA/4000-12 3016693 382 ns/op 10467.41 MB/s
BenchmarkRegCSA/10000-12 1810195 642 ns/op 15575.65 MB/s
BenchmarkRegCSA/100000-12 191974 6229 ns/op 16053.40 MB/s
BenchmarkRegCSA/10000000-12 1622 698856 ns/op 14309.10 MB/s
BenchmarkRegCSA/1000000000-12 16 68540642 ns/op 14589.88 MB/s
Related
Faster bitwise AND operation on byte slices
I'd like to perform bitwise AND on every column of a byte matrix, which is stored in a [][]byte in golang. I created a repo with runnable test code. It can be simplified as a bitwise AND operation on two byte-slice of equal length. The simplest way is using for loop to handle every pair of bytes. func and(x, y []byte) []byte { z := make([]byte, lenght(x)) for i:= 0; i < len(x); i++ { z[i] = x[i] & y[i] } return z } However, it's very slow for long slices. A faster way is to unroll the for loop (check the benchmark result) BenchmarkLoop-16 14467 84265 ns/op BenchmarkUnrollLoop-16 17668 67550 ns/op Any faster way? Go assembly? Thank you in advance.
I write a go assembly implementation using AVX2 instructions after two days of (go) assembly learning. The performance is good, 10X of simple loop version. While optimizations for compatibility and performance are still needed. Suggestions and PRs are welcome. Note: code and benchmark results are updated. I appreciate #PeterCordes for many valuable suggestions. #include "textflag.h" // func AND(x []byte, y []byte) // Requires: AVX TEXT ·AND(SB), NOSPLIT|NOPTR, $0-48 // pointer of x MOVQ x_base+0(FP), AX // length of x MOVQ x_len+8(FP), CX // pointer of y MOVQ y_base+24(FP), DX // -------------------------------------------- // end address of x, will not change: p + n MOVQ AX, BX ADDQ CX, BX // end address for loop // n <= 8, jump to tail CMPQ CX, $0x00000008 JLE tail // n < 16, jump to loop8 CMPQ CX, $0x00000010 JL loop8_start // n < 32, jump to loop16 CMPQ CX, $0x00000020 JL loop16_start // -------------------------------------------- // end address for loop32 MOVQ BX, CX SUBQ $0x0000001f, CX loop32: // compute x & y, and save value to x VMOVDQU (AX), Y0 VANDPS (DX), Y0, Y0 VMOVDQU Y0, (AX) // move pointer ADDQ $0x00000020, AX ADDQ $0x00000020, DX CMPQ AX, CX JL loop32 // n <= 8, jump to tail MOVQ BX, CX SUBQ AX, CX CMPQ CX, $0x00000008 JLE tail // n < 16, jump to loop8 CMPQ CX, $0x00000010 JL loop8_start // -------------------------------------------- loop16_start: // end address for loop16 MOVQ BX, CX SUBQ $0x0000000f, CX loop16: // compute x & y, and save value to x VMOVDQU (AX), X0 VANDPS (DX), X0, X0 VMOVDQU X0, (AX) // move pointer ADDQ $0x00000010, AX ADDQ $0x00000010, DX CMPQ AX, CX JL loop16 // n <= 8, jump to tail MOVQ BX, CX SUBQ AX, CX CMPQ CX, $0x00000008 JLE tail // -------------------------------------------- loop8_start: // end address for loop8 MOVQ BX, CX SUBQ $0x00000007, CX loop8: // compute x & y, and save value to x MOVQ (AX), BX ANDQ (DX), BX MOVQ BX, (AX) // move pointer ADDQ $0x00000008, AX ADDQ $0x00000008, DX CMPQ AX, CX JL loop8 // -------------------------------------------- tail: // left elements (<=8) MOVQ (AX), BX ANDQ (DX), BX MOVQ BX, (AX) RET Benchmark result: test data-size time ------------------- --------- ----------- BenchmarkGrailbio 8.00_B 4.654 ns/op BenchmarkGoAsm 8.00_B 4.824 ns/op BenchmarkUnrollLoop 8.00_B 6.851 ns/op BenchmarkLoop 8.00_B 8.683 ns/op BenchmarkGrailbio 16.00_B 5.363 ns/op BenchmarkGoAsm 16.00_B 6.369 ns/op BenchmarkUnrollLoop 16.00_B 10.47 ns/op BenchmarkLoop 16.00_B 13.48 ns/op BenchmarkGoAsm 32.00_B 6.079 ns/op BenchmarkGrailbio 32.00_B 6.497 ns/op BenchmarkUnrollLoop 32.00_B 17.46 ns/op BenchmarkLoop 32.00_B 21.09 ns/op BenchmarkGoAsm 128.00_B 10.52 ns/op BenchmarkGrailbio 128.00_B 14.40 ns/op BenchmarkUnrollLoop 128.00_B 56.97 ns/op BenchmarkLoop 128.00_B 80.12 ns/op BenchmarkGoAsm 256.00_B 15.48 ns/op BenchmarkGrailbio 256.00_B 23.76 ns/op BenchmarkUnrollLoop 256.00_B 110.8 ns/op BenchmarkLoop 256.00_B 147.5 ns/op BenchmarkGoAsm 1.00_KB 47.16 ns/op BenchmarkGrailbio 1.00_KB 87.75 ns/op BenchmarkUnrollLoop 1.00_KB 443.1 ns/op BenchmarkLoop 1.00_KB 540.5 ns/op BenchmarkGoAsm 16.00_KB 751.6 ns/op BenchmarkGrailbio 16.00_KB 1342 ns/op BenchmarkUnrollLoop 16.00_KB 7007 ns/op BenchmarkLoop 16.00_KB 8623 ns/op
Never triggered if statements make code execution in benchmark faster? Why?
I have recently started the Go track on exercism.io and had fun optimizing the "nth-prime" calculation. Actually I came across a funny fact I can't explain. Imagine the following code: // Package prime provides ... package prime // Nth function checks for the prime number on position n func Nth(n int) (int, bool) { if n <= 0 { return 0, false } if (n == 1) { return 2, true } currentNumber := 1 primeCounter := 1 for n > primeCounter { currentNumber+=2 if isPrime(currentNumber) { primeCounter++ } } return currentNumber, primeCounter==n } // isPrime function checks if a number // is a prime number func isPrime(n int) bool { //useless because never triggered but makes it faster?? if n < 2 { println("n < 2") return false } //useless because never triggered but makes it faster?? if n%2 == 0 { println("n%2") return n==2 } for i := 3; i*i <= n; i+=2 { if n%i == 0 { return false } } return true } In the private function isPrime I have two initial if-statements that are never triggered, because I only give in uneven numbers greater than 2. The benchmark returns following: Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$ BenchmarkNth-8 100 18114825 ns/op 0 B/op 0 If I remove the never triggered if-statements the benchmark goes slower: Running tool: /usr/bin/go test -benchmem -run=^$ -bench ^(BenchmarkNth)$ BenchmarkNth-8 50 21880749 ns/op 0 B/op 0 I have run the benchmark multiple times changing the code back and forth always getting more or less the same numbers and I can't think of a reason why these two if-statements should make the execution faster. Yes it is micro-optimization, but I want to know: Why? Here is the whole exercise from exercism with test-cases: nth-prime Go version i am using is 1.12.1 linux/amd64 on a manjaro i3 linux
What happens is the compiler is guaranteed with some assertions about the input when those if's are added. If those assertions are lifted, the compiler has to add it himself. The way it does it is by validating it on each iteration. We can take a look at the assembly code to prove it. (by passing -gcflags=-S to the go test command) With the if's: 0x004b 00075 (func.go:16) JMP 81 0x004d 00077 (func.go:16) LEAQ 2(BX), AX 0x0051 00081 (func.go:16) MOVQ AX, DX 0x0054 00084 (func.go:16) IMULQ AX, AX 0x0058 00088 (func.go:16) CMPQ AX, CX 0x005b 00091 (func.go:16) JGT 133 0x005d 00093 (func.go:17) TESTQ DX, DX 0x0060 00096 (func.go:17) JEQ 257 0x0066 00102 (func.go:17) MOVQ CX, AX 0x0069 00105 (func.go:17) MOVQ DX, BX 0x006c 00108 (func.go:17) CQO 0x006e 00110 (func.go:17) IDIVQ BX 0x0071 00113 (func.go:17) TESTQ DX, DX 0x0074 00116 (func.go:17) JNE 77 Without the if's: 0x0016 00022 (func.go:16) JMP 28 0x0018 00024 (func.go:16) LEAQ 2(BX), AX 0x001c 00028 (func.go:16) MOVQ AX, DX 0x001f 00031 (func.go:16) IMULQ AX, AX 0x0023 00035 (func.go:16) CMPQ AX, CX 0x0026 00038 (func.go:16) JGT 88 0x0028 00040 (func.go:17) TESTQ DX, DX 0x002b 00043 (func.go:17) JEQ 102 0x002d 00045 (func.go:17) MOVQ CX, AX 0x0030 00048 (func.go:17) MOVQ DX, BX 0x0033 00051 (func.go:17) CMPQ BX, $-1 0x0037 00055 (func.go:17) JEQ 64 0x0039 00057 (func.go:17) CQO 0x003b 00059 (func.go:17) IDIVQ BX 0x003e 00062 (func.go:17) JMP 69 0x0040 00064 func.go:17) NEGQ AX 0x0043 00067 (func.go:17) XORL DX, DX 0x0045 00069 (func.go:17) TESTQ DX, DX 0x0048 00072 (func.go:17) JNE 24 Line 51 in the assembly code 0x0033 00051 (func.go:17) CMPQ BX, $-1 is the culprit. Line 16, for i := 3; i*i <= n; i+=2, in the original Go code, is translated the same for both cases. But line 17 if n%i == 0 that runs every iteration compiles to more instructions and as a result more work for the CPU in total. Something similar in the encoding/base64 package by ensuring the loop won't receive a nil value. You can take a look here: https://go-review.googlesource.com/c/go/+/151158/3/src/encoding/base64/base64.go This check was added intentionally. In your case, you optimized it accidentally :)
I want to make a bubble sort in assembly, and I don't understand why it's not working
I have a function "swapByRef" that works in this code (this code just checks if the function swaps the values). MODEL small STACK 10h DATA SEGMENT a dw 12h b dw 0A9h DATA ENDS CODE SEGMENT ASSUME CS:CODE, DS:DATA start: mov ax, DATA mov ds, ax push offset a ;push to the stack the adress of the variable a push offset b ;push to the stack the adress of the variable b call swapByRef exit: mov ax, 4c00h int 21h swapByRef proc mov bp, sp mov bx, [bp + 2] mov ax, [bx] mov si, [bp + 4] mov cx, [si] mov [bx], cx mov[si], ax ret 4 swapByRef endP CODE ENDS END start But in my code (the bubble sort code) the procedure doesn't swap the values in the array and the array does not get sorted. MODEL small STACK 100h DATA SEGMENT ARR dw 9,5,7,3,8 len dw 5 DATA ENDS CODE SEGMENT ASSUME CS:CODE, DS:DATA start: mov ax, DATA mov ds, ax mov bx, offset ARR sorting: mov ax, len dec ax cmp bx, ax je redo mov ax, ARR[bx] mov dx, ARR[bx+ 2] cmp al, ah jg swap jmp continue swap: push offset [ARR + bx] push offset [ARR + bx + 2] call swapByRef continue: inc bx jmp sorting redo: cmp len, 0 je exit mov ax, len dec ax mov len, ax xor bl, bl jmp sorting exit: mov ax, 4c00h int 21h swapByRef proc mov bp, sp mov bx, [bp + 2] mov ax, [bx] mov si, [bp + 4] mov cx, [si] mov [bx], cx mov[si], ax ret 4 swapByRef endP CODE ENDS END start I've tried to debug and still I couldn't find the problem in my code... Any help will be awesome, thanks.
mov bx, offset ARR ... mov ax, ARR[bx] mov dx, ARR[bx+ 2] You're adding the offset to the array twice! You need to initialize BX=0. mov ax, ARR[bx] mov dx, ARR[bx+ 2] cmp al, ah jg swap jmp continue swap: You've read the elements in AX and DX. Then also compare AX and DX. You can also write it shorter like this: mov ax, ARR[bx] cmp ax, ARR[bx+2] jng continue swap: Given that the array contains words and that BX is an offset within the array, you need to change BX in steps of 2. Write add bx, 2 instead of inc bx. This also means that it's best to set len dw 10 and modify it in steps of 2 using sub word ptr len, 2. swapByRef proc mov bp, sp mov bx, [bp + 2] mov ax, [bx] mov si, [bp + 4] mov cx, [si] mov [bx], cx mov[si], ax ret 4 Your swapByRef proc destroys a lot of registers. Especially loosing BX is problematic! This is a general solution to not clobber registers. Optimize as needed. swapByRef proc push bp mov bp, sp push ax push bx push cx push si mov bx, [bp + 4] mov ax, [bx] mov si, [bp + 6] mov cx, [si] mov [bx], cx mov [si], ax pop si pop cx pop bx pop ax pop bp ret 4
Sorting a list of ten numbers with selection sort in assembly language
sorting a list of ten numbers with selection sort in assembly language. How does i convert this bubble sort method into selection sort method `[org 0x0100] jmp start data: dw 60, 55, 45, 50, 40, 35, 25, 30, 10, 0 swap: db 0 start: mov bx, 0 ; initialize array index to zero mov byte [swap], 0 ; rest swap flag to no swaps loop1: mov ax, [data+bx] ; load number in ax cmp ax, [data+bx+2] ; compare with next number jbe noswap ; no swap if already in order mov dx, [data+bx+2] ; load second element in dx mov [data+bx+2], ax ; store first number in second mov [data+bx], dx ; store second number in first mov byte [swap], 1 ; flag that a swap has been done noswap: add bx, 2 ; advance bx to next index cmp bx, 18 ; are we at last index jne loop1 ; if not compare next two cmp byte [swap], 1 ; check if a swap has been done je start ; if yes make another pass mov ax, 0x4c00 ; terminate program int 0x21`
The key here is to change your loop. Currently it's swapping adjacent numbers. You need to change it to copy the rightmost element into a register, and shift the pre-existing sorted array to the right until the element you just shifted is greater than or equal to the previously rightmost element you just copied into your register.
Maybe this will be helpfull. I wrote this a long time ago. Realmode intel assembler. MAIN.ASM SSTACK SEGMENT PARA STACK 'STACK' DW 128 DUP(?) SSTACK ENDS DSEG SEGMENT PUBLIC 'DATA' S LABEL BYTE ARR DB 'IHGFED27182392JASKD1O12312345CBA' LEN EQU ($-S) PUBLIC TMP PUBLIC MIN TMP DW ? MIN DW ? DSEG ENDS CSEG SEGMENT 'CODE' ASSUME CS:CSEG, SS:SSTACK, DS:DSEG EXTRN OUTPUT:NEAR EXTRN SORT:NEAR START: MOV AX, DSEG MOV DS, AX MOV BX, OFFSET ARR MOV CX, LEN CALL OUTPUT MOV AX, 60 CMP AX, 0 JZ NO_SORT CMP AX, 1 JZ NO_SORT MOV BX, OFFSET ARR MOV CX, LEN CALL SORT NO_SORT: MOV BX, OFFSET ARR MOV CX, LEN CALL OUTPUT MOV AH, 4CH MOV AL, 0 INT 21H CSEG ENDS END START SORT.ASM DSEG SEGMENT PUBLIC 'DATA' EXTRN TMP:WORD EXTRN MIN:WORD DSEG ENDS CSEG SEGMENT 'CODE' ASSUME CS:CSEG, DS:DSEG PUBLIC SORT SORT PROC; (AX - N, BX - ARRAY ADDRESS, CX - ARRAY LENGTH) PUSH SI PUSH DI PUSH DX CALL COMPARE_MIN DEC CX MOV AX, CX XOR SI, SI XOR DI, DI L1: PUSH CX MOV MIN, SI MOV TMP, DI INC DI MOV CX, AX L2: MOV DH, BYTE PTR[BX+DI] PUSH SI MOV SI, MIN MOV DL, BYTE PTR[BX+SI] POP SI CMP DH, DL JA OLD_MIN NEW_MIN: MOV MIN, DI OLD_MIN: INC DI DEC CX CMP CX, TMP JNZ L2 SWAP: PUSH DI MOV DI, MIN MOV DL, BYTE PTR[BX+DI] MOV DH, BYTE PTR[BX+SI] MOV BYTE PTR [BX+SI], DL MOV BYTE PTR [BX+DI], DH POP DI INC SI MOV DI, SI POP CX LOOP L1 POP DX POP DI POP DI RET SORT ENDP COMPARE_MIN PROC; (AX - A, CX - B CX - MIN) PUSH AX CMP AX, CX JB B__A JA A__B A__B: MOV CX, CX JMP EX B__A: MOV CX, AX JMP EX EX: POP AX RET COMPARE_MIN ENDP CSEG ENDS END OUTPUT.ASM CSEG SEGMENT 'CODE' ASSUME CS:CSEG PUBLIC OUTPUT OUTPUT PROC ; (BX - ARRAY ADDRESS, CX - ARRAY LENGTH) PUSH DX PUSH SI PUSH AX XOR SI, SI MOV AH, 02H OUTARR: MOV DL,[BX+SI] INT 21H INC SI LOOP OUTARR MOV DL, 10 INT 21H POP AX POP SI POP DX RET OUTPUT ENDP CSEG ENDS END
Jumping to random code when using IDIV
I am relatively new to assembler, but when creating code what works with arrays and calculates the average of each row, I encountered a problem that suggests I don't know how division really works. This is my code: .model tiny .code .startup Org 100h Jmp Short Start N Equ 2 ;columns M Equ 3 ;rows Matrix DW 2, 2, 3 ; elements DW 4, 6, 6 ; elements] Vector DW M Dup (?) S Equ Type Matrix Start: Mov Cx, M;20 Lea Di, Vector Xor Si, Si Cols: Push Cx Mov Cx, N Xor Bx, Bx Xor Ax, Ax Rows: Add Ax, Matrix[Bx][Si] Next: Add Bx, S*M Loop Rows Add Si, S Mov [Di], Ax Add Di, S Pop Cx Loop Cols Xor Bx, Bx Mov Cx, M Mov DX, 2 Print: Mov Ax, Vector[Bx] IDiv Dx; div/idiv error here Add Bx, S Loop Print .exit 0 There are no errors when compiling. Elements are counted correctly, but when division happens the debugger shows the program jumping to apparently random code. Why is this happening and how can I resolve it?
If you use x86 architecture, IDiv with 16-bit operand will also take Dx as a part of the integer to be divided and throw an exception (interrupt) if the quotient is too large to fit in 16bits. Try something like this: Mov Di, 2 Print: Mov Ax, Vector[Bx] Cwd ; sign extend Ax to Dx:Ax IDiv Di