Is there a more efficient function for finding []byte similarities? - performance

I am looking for an efficient way to find the prefix similarity between two byte slices. I am currently using this but am looking for a more efficient way if possible.
Thank you.
s1 -> [0 15 136 96 88 76 0 0 0 1]
s2 -> [0 15 136 96 246 1 255 255 255 255]
output -> [0 15 136 96]
func bytesSimilar(s1 []byte, s2 []byte) []byte {
for !bytes.Equal(s1,s2) {
s1 = s1[:len(s1)-1]
s2 = s2[:len(s2)-1]
}
return s1
}
Benchmarking code:
func BenchmarkBytePrefix200(b *testing.B) {
s1 := []byte{0, 15, 136, 96, 88, 76, 0, 0, 0, 1}
s2 := []byte{0, 15, 136, 96, 246, 1, 255, 255, 255, 255}
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
bytePrefix(s1, s2)
}
}
Results on an MBP:
BenchmarkBytePrefix200-8 48738078 29.5 ns/op 0 B/op 0 allocs/op

My opinion, from your code above, the following section is very expensive on I/O resource
s1 = s1[:len(s1)-1]
s2 = s2[:len(s2)-1]
We can actually just do a simple loop and exit early when different byte found. With this approach, we don't need much memory allocation process. It is more lines in code, but better performance.
Code is as below
func bytesSimilar2(s1 []byte, s2 []byte) []byte {
l1 := len(s1)
l2 := len(s2)
least := l1
if least > l2 {
least = l2
}
count := 0
for i := 0; i < least; i++ {
if s1[i] == s2[i] {
count++
continue
}
break
}
if count == 0 {
return []byte{}
}
return s1[:count]
}
func BenchmarkBytePrefix200v1(b *testing.B) {
s1 := []byte{0, 15, 136, 96, 88, 76, 0, 0, 0, 1}
s2 := []byte{0, 15, 136, 96, 246, 1, 255, 255, 255, 255}
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
bytesSimilar1(s1, s2)
}
}
func BenchmarkBytePrefix200v2(b *testing.B) {
s1 := []byte{0, 15, 136, 96, 88, 76, 0, 0, 0, 1}
s2 := []byte{0, 15, 136, 96, 246, 1, 255, 255, 255, 255}
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
bytesSimilar2(s1, s2)
}
}
The comparison result is per below, 38.7ns/op vs 7.40ns/op
goos: darwin
goarch: amd64
pkg: git.kanosolution.net/kano/acl
BenchmarkBytePrefix200v1-8 27184414 38.7 ns/op 0 B/op 0 allocs/op
BenchmarkBytePrefix200v2-8 161031307 7.40 ns/op 0 B/op 0 allocs/op
PASS

In case bytePrefix is the same as bytesSimilar in your question:
func BytesSimilarNew(s1 []byte, s2 []byte) []byte {
for i := 0; i < len(s1); i++ {
if s1[i] ^ s2[i] > 0 {
return s1[:i]
}
}
return []byte{}
}
then the comparison:
BenchmarkBytePrefix200
BenchmarkBytePrefix200-8 28900861 36.5 ns/op 0 B/op 0 allocs/op
BenchmarkByteSimilarNew200
BenchmarkByteSimilarNew200-8 237646268 5.06 ns/op 0 B/op 0 allocs/op
PASS

Related

optimize iteration over a list of sorted periods

given a list of periods, already sorted, and containing no dups.
periods := periods{
period{min: 0, max: time.Millisecond},
period{min: time.Millisecond, max: time.Millisecond * 1},
period{min: time.Millisecond * 1, max: time.Millisecond * 2},
period{min: time.Millisecond * 2, max: time.Millisecond * 7},
period{min: time.Millisecond * 7, max: 0},
}
with the types periods and period defined as
type periods []period
func (ks periods) index(v time.Duration) period {
for i := 0; i < len(ks); i++ {
if ks[i].contains(v) {
return ks[i]
}
}
return period{}
}
type period struct {
min time.Duration
max time.Duration
}
func (k period) String() string {
if k.max == 0 && k.max < k.min {
return fmt.Sprintf("%v-", k.min)
}
return fmt.Sprintf("%v-%v", k.min, k.max)
}
func (k period) contains(t time.Duration) bool {
if t <= 0 && k.min == 0 {
return true
}
return t > k.min && (k.max == 0 || t <= k.max)
}
The full code is available at https://play.golang.org/p/cDmQ7Ho6hUI
Can you suggest solution(s) to improve the search implementation in the periods.index function ?
Also, can you provide a factorized solution such as it is possible to re use the implementation ?
A generics included solution is OK for that i can still specialize using code gen.
A benchmark is included
func BenchmarkIndex(b *testing.B) {
periods := periods{
period{min: 0, max: 8000},
period{min: 8000, max: 16000},
period{min: 16000, max: 24000},
period{min: 24000, max: 32000},
period{min: 32000, max: 40000},
period{min: 40000, max: 48000},
period{min: 48000, max: 56000},
period{min: 56000, max: 64000},
period{min: 64000, max: 72000},
period{min: 72000, max: 80000},
period{min: 80000, max: 0},
}
inputs := []time.Duration{
time.Duration(0),
time.Duration(72000 + 1),
time.Duration(80000 + 1),
}
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
for _, input := range inputs {
_ = periods.index(input)
}
}
}
It's unlikely you can achieve better performance than a simple, sequential search for a list as small as yours (11 elements).
If your slice would be much bigger (e.g. hundreds or even thousands of periods), since your periods is sorted as you claim, then you could use binary search.
Binary search is implemented in sort.Search(). You just basically have to provide a lessOrContains() implementation for period. This is how it could look like:
func (k period) lessOrContains(t time.Duration) bool {
return k.max == 0 || t <= k.max
}
Now an index() function using binary search:
func (ks periods) indexBinary(v time.Duration) period {
idx := sort.Search(len(ks), func(i int) bool {
return ks[i].lessOrContains(v)
})
if idx < len(ks) && ks[idx].contains(v) {
return ks[idx]
}
return period{}
}
Now on to benchmarking. Let's create a createPeriods() helper function that creates either a small or big periods slice:
func createPeriods(big bool) periods {
ps := periods{
period{min: 0, max: 8000},
period{min: 8000, max: 16000},
period{min: 16000, max: 24000},
period{min: 24000, max: 32000},
period{min: 32000, max: 40000},
period{min: 40000, max: 48000},
period{min: 48000, max: 56000},
period{min: 56000, max: 64000},
period{min: 64000, max: 72000},
period{min: 72000, max: 80000},
period{min: 80000, max: 0},
}
if big {
psbig := periods{}
for i := 0; i < 1000; i++ {
psbig = append(psbig, period{time.Duration(i), time.Duration(i + 1)})
}
psbig = append(psbig, ps[1:]...)
ps = psbig
}
return ps
}
Now let's write benchmark functions for all cases:
func BenchmarkIndexSmall(b *testing.B) {
benchmarkIndexImpl(b, false)
}
func BenchmarkIndexBinarySmall(b *testing.B) {
benchmarkIndexBinaryImpl(b, false)
}
func BenchmarkIndexBig(b *testing.B) {
benchmarkIndexImpl(b, true)
}
func BenchmarkIndexBinaryBig(b *testing.B) {
benchmarkIndexBinaryImpl(b, true)
}
func benchmarkIndexImpl(b *testing.B, big bool) {
periods := createPeriods(big)
inputs := []time.Duration{
time.Duration(0),
time.Duration(72000 + 1),
time.Duration(80000 + 1),
}
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
for _, input := range inputs {
_ = periods.index(input)
}
}
}
func benchmarkIndexBinaryImpl(b *testing.B, big bool) {
periods := createPeriods(big)
inputs := []time.Duration{
time.Duration(0),
time.Duration(72000 + 1),
time.Duration(80000 + 1),
}
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
for _, input := range inputs {
_ = periods.indexBinary(input)
}
}
}
And now let's see the benchmark results:
BenchmarkIndexSmall-8 44408948 25.50 ns/op 0 B/op 0 allocs/op
BenchmarkIndexBinarySmall-8 18441049 58.35 ns/op 0 B/op 0 allocs/op
BenchmarkIndexBig-8 562202 1908 ns/op 0 B/op 0 allocs/op
BenchmarkIndexBinaryBig-8 9234846 125.1 ns/op 0 B/op 0 allocs/op
As you can see, index() is faster than indexBinary() for small lists with 11 elements (25 ns vs 58 ns).
But when the list gets big (more than a thousand periods, 1010 in the above benchmark), then indexBinary() outperforms index() by more than an order of magnitude (125 ns vs 2000 ns), and this difference will get even bigger if the lists gets bigger.

Go code running result in local environment is not same with run in go play

I am using Go to implement an algorithm described below:
There is an array,only one number appear one time,all the other numbers appear three times,find the number only appear one time
My code listed below:
import (
"testing"
)
func findBySum(arr []int) int {
result := 0
sum := [32]int{}
for i := 0; i < 32; i++ {
for _, v := range arr {
sum[i] += (v >> uint(i)) & 0x1
}
sum[i] %= 3
sum[i] <<= uint(i)
result |= sum[i]
}
return result
}
func TestThree(t *testing.T) {
// except one nubmer,all other number appear three times
a1 := []int{11, 222, 444, 444, 222, 11, 11, 17, -123, 222, -123, 444, -123} // unqiue number is 17
a2 := []int{11, 222, 444, 444, 222, 11, 11, -17, -123, 222, -123, 444, -123} // unque number is -17
t.Log(findBySum(a1))
t.Log(findBySum(a2))
}
However,I found that the running result in my PC is wrong,and the same code running in https://play.golang.org/p/hEseLZVL617 is correct,I do not know why.
Result in my PC:
Result in https://play.golang.org/p/hEseLZVL617:
As we see,when the unique number is positive,both result are right,but when the unique number is negative,the result in my PC in wrong and the result online is right.
I think it has something to do with the bit operations in my code,but I can't find the root cause.
I used IDEA 2019.1.1 and my Golang version listed below:
I don't know why the same code can works fine online and do not work in my local PC,can anyone help me analysis this? Thanks in advance!
Size of int is platform dependent, it may be 32-bit and it may be 64-bit. On the Go Playground it's 32-bit, on your local machine it's 64-bit.
If we change your example to use int64 explicitly instead of int, the result is the same on the Go Playground too:
func findBySum(arr []int64) int64 {
result := int64(0)
sum := [32]int64{}
for i := int64(0); i < 32; i++ {
for _, v := range arr {
sum[i] += (v >> uint64(i)) & 0x1
}
sum[i] %= 3
sum[i] <<= uint(i)
result |= sum[i]
}
return result
}
func TestThree(t *testing.T) {
// except one nubmer,all other number appear three times
a1 := []int64{11, 222, 444, 444, 222, 11, 11, 17, -123, 222, -123, 444, -123} // unqiue number is 17
a2 := []int64{11, 222, 444, 444, 222, 11, 11, -17, -123, 222, -123, 444, -123} // unque number is -17
t.Log(findBySum(a1))
t.Log(findBySum(a2))
}
You perform bitwise operations that assume 32-bit integer size. To get correct results locally (where your architecture and thus size of int and uint is 64-bit), change all ints to int32 and uint to uint32:
func findBySum(arr []int32) int32 {
result := int32(0)
sum := [32]int32{}
for i := int32(0); i < 32; i++ {
for _, v := range arr {
sum[i] += (v >> uint32(i)) & 0x1
}
sum[i] %= 3
sum[i] <<= uint(i)
result |= sum[i]
}
return result
}
func TestThree(t *testing.T) {
// except one nubmer,all other number appear three times
a1 := []int32{11, 222, 444, 444, 222, 11, 11, 17, -123, 222, -123, 444, -123} // unqiue number is 17
a2 := []int32{11, 222, 444, 444, 222, 11, 11, -17, -123, 222, -123, 444, -123} // unque number is -17
t.Log(findBySum(a1))
t.Log(findBySum(a2))
}
Lesson: if you perform calculations whose result depend on the representation size, always be explicit, and use fixed-size numbers like int32, int64, uint32, uint64.

Is there more efficient way of multiplying byte array?

I developed a Golang package BESON for doing operation of big number.
The Multiply operation code:
func Multiply(a []byte, b []byte) {
ans := make([]byte, len(a) + len(b))
bits := nbits(b)
var i uint
for i = bits - 1; int(i) >= 0; i-- {
byteNum := i >> 3
bitNum := i & 7
LeftShift(ans, 1, 0)
if (b[byteNum] & (1 << bitNum)) > 0 {
Add(ans, a)
}
}
copy(a, ans)
}
My way is to add every a multiply bits of b.
Is there more efficient way to implement Multiply?
Edit
The BESON package represent a big number in byte array. For example, it represent a 128-bit unsigned interger in an byte array of size 16. Therefore, when doing two 128-bit unsigned interger multiplying, it's actually multiplying two byte array.
Example:
input: a, b
a = []byte{ 204, 19, 46, 255, 0, 0, 0, 0 }
b = []byte{ 117, 10, 68, 47, 0, 0, 0, 0 }
Multiply(a, b)
fmt.Println(a)
output: a (The result will write back to a)
[60 4 5 35 76 72 29 47]

How to merge consecutive integers

I want to know if there's a good way to return true if both integers can be merged but it must be in consecutive means, {100,101} can be merge with {103, 104, 102 }, but not {100,101} and {103,104,105} (Missing 102) coding based on the question.
package main
import (
"fmt"
"math/rand"
"time"
)
func main() {
slice := generateSlice(20)
fmt.Println("\n--- Unsorted --- \n\n", slice)
fmt.Println("\n--- Sorted ---\n\n", mergeSort(slice), "\n")
}
// Generates a slice of size, size filled with random numbers
func generateSlice(size int) []int {
slice := make([]int, size, size)
rand.Seed(time.Now().UnixNano())
for i := 0; i < size; i++ {
slice[i] = rand.Intn(999) - rand.Intn(999)
}
return slice
}
func mergeSort(items []int) []int {
var num = len(items)
if num == 1 {
return items
}
middle := int(num / 2)
var (
left = make([]int, middle)
right = make([]int, num-middle)
)
for i := 0; i < num; i++ {
if i < middle {
left[i] = items[i]
} else {
right[i-middle] = items[i]
}
}
return merge(mergeSort(left), mergeSort(right))
}
func merge(left, right []int) (result []int) {
result = make([]int, len(left) + len(right))
i := 0
for len(left) > 0 && len(right) > 0 {
if left[0] < right[0] {
result[i] = left[0]
left = left[1:]
} else {
result[i] = right[0]
right = right[1:]
}
i++
}
for j := 0; j < len(left); j++ {
result[i] = left[j]
i++
}
for j := 0; j < len(right); j++ {
result[i] = right[j]
i++
}
return
}
Output:
https://play.golang.org/p/oAtGTiUnxrE
The question:
A pumpung is a permutation of consecutive integers, possibly with repeated items.
Two pumpungs can be merged if they form a bigger pumpung. For example,
[100, 101] and [103, 102, 104],
[222, 221, 220, 219] and [221, 222, 223, 225, 224]
can be merged; whereas
[100, 101] and [103, 104, 105]
cannot.
Write a function, IsMergeable(pumpung1, pumpung2), returning true iff the given
pumpungs can be merged. You may assume that the arguments are really pumpungs.
For example,
package main
import "fmt"
const (
maxInt = int(^uint(0) >> 1)
minInt = -maxInt - 1
)
func isMerge(a1, a2 []int) bool {
min1, max1 := maxInt, minInt
for _, e1 := range a1 {
if min1 > e1 {
min1 = e1
}
if max1 < e1 {
max1 = e1
}
}
for _, e2 := range a2 {
if e2 == min1-1 || e2 == max1+1 {
return true
}
}
return false
}
func main() {
a1 := []int{100, 101}
a2 := []int{103, 104, 102}
a3 := []int{103, 104, 105}
fmt.Println(a1, a2, isMerge(a1, a2))
fmt.Println(a2, a1, isMerge(a2, a1))
fmt.Println(a1, a3, isMerge(a1, a3))
fmt.Println(a3, a1, isMerge(a3, a1))
a4 := []int{222, 221, 220, 219}
a5 := []int{221, 222, 223, 225, 224}
fmt.Println(a4, a5, isMerge(a4, a5))
fmt.Println(a5, a4, isMerge(a5, a4))
}
Playground: https://play.golang.org/p/rRGPoivhEWW
Output:
[100 101] [103 104 102] true
[103 104 102] [100 101] true
[100 101] [103 104 105] false
[103 104 105] [100 101] false
[222 221 220 219] [221 222 223 225 224] true
[221 222 223 225 224] [222 221 220 219] true

Matrix multiplication with goroutine drops performance

I am optimizing matrix multiplication via goroutines in Go.
My benchmark shows, introducing concurrency per row or per element largely drops performance:
goos: darwin
goarch: amd64
BenchmarkMatrixDotNaive/A.MultNaive-8 2000000 869 ns/op 0 B/op 0 allocs/op
BenchmarkMatrixDotNaive/A.ParalMultNaivePerRow-8 100000 14467 ns/op 80 B/op 9 allocs/op
BenchmarkMatrixDotNaive/A.ParalMultNaivePerElem-8 20000 77299 ns/op 528 B/op 65 allocs/op
I know some basic prior knowledge of cache locality, it make sense that per element concurrency drops performance. However, why per row still drops the performance even in naive version?
In fact, I also wrote a block/tiling optimization, its vanilla version (without goroutine concurrency) even worse than naive version (not present here, let's focus on naive first).
What did I do wrong here? Why? How to optimize here?
Multiplication:
package naive
import (
"errors"
"sync"
)
// Errors
var (
ErrNumElements = errors.New("Error number of elements")
ErrMatrixSize = errors.New("Error size of matrix")
)
// Matrix is a 2d array
type Matrix struct {
N int
data [][]float64
}
// New a size by size matrix
func New(size int) func(...float64) (*Matrix, error) {
wg := sync.WaitGroup{}
d := make([][]float64, size)
for i := range d {
wg.Add(1)
go func(i int) {
defer wg.Done()
d[i] = make([]float64, size)
}(i)
}
wg.Wait()
m := &Matrix{N: size, data: d}
return func(es ...float64) (*Matrix, error) {
if len(es) != size*size {
return nil, ErrNumElements
}
for i := range es {
wg.Add(1)
go func(i int) {
defer wg.Done()
m.data[i/size][i%size] = es[i]
}(i)
}
wg.Wait()
return m, nil
}
}
// At access element (i, j)
func (A *Matrix) At(i, j int) float64 {
return A.data[i][j]
}
// Set set element (i, j) with val
func (A *Matrix) Set(i, j int, val float64) {
A.data[i][j] = val
}
// MultNaive matrix multiplication O(n^3)
func (A *Matrix) MultNaive(B, C *Matrix) (err error) {
var (
i, j, k int
sum float64
N = A.N
)
if N != B.N || N != C.N {
return ErrMatrixSize
}
for i = 0; i < N; i++ {
for j = 0; j < N; j++ {
sum = 0.0
for k = 0; k < N; k++ {
sum += A.At(i, k) * B.At(k, j)
}
C.Set(i, j, sum)
}
}
return
}
// ParalMultNaivePerRow matrix multiplication O(n^3) in concurrency per row
func (A *Matrix) ParalMultNaivePerRow(B, C *Matrix) (err error) {
var N = A.N
if N != B.N || N != C.N {
return ErrMatrixSize
}
wg := sync.WaitGroup{}
for i := 0; i < N; i++ {
wg.Add(1)
go func(i int) {
defer wg.Done()
for j := 0; j < N; j++ {
sum := 0.0
for k := 0; k < N; k++ {
sum += A.At(i, k) * B.At(k, j)
}
C.Set(i, j, sum)
}
}(i)
}
wg.Wait()
return
}
// ParalMultNaivePerElem matrix multiplication O(n^3) in concurrency per element
func (A *Matrix) ParalMultNaivePerElem(B, C *Matrix) (err error) {
var N = A.N
if N != B.N || N != C.N {
return ErrMatrixSize
}
wg := sync.WaitGroup{}
for i := 0; i < N; i++ {
for j := 0; j < N; j++ {
wg.Add(1)
go func(i, j int) {
defer wg.Done()
sum := 0.0
for k := 0; k < N; k++ {
sum += A.At(i, k) * B.At(k, j)
}
C.Set(i, j, sum)
}(i, j)
}
}
wg.Wait()
return
}
Benchmark:
package naive
import (
"os"
"runtime/trace"
"testing"
)
type Dot func(B, C *Matrix) error
var (
A = &Matrix{
N: 8,
data: [][]float64{
[]float64{1, 2, 3, 4, 5, 6, 7, 8},
[]float64{9, 1, 2, 3, 4, 5, 6, 7},
[]float64{8, 9, 1, 2, 3, 4, 5, 6},
[]float64{7, 8, 9, 1, 2, 3, 4, 5},
[]float64{6, 7, 8, 9, 1, 2, 3, 4},
[]float64{5, 6, 7, 8, 9, 1, 2, 3},
[]float64{4, 5, 6, 7, 8, 9, 1, 2},
[]float64{3, 4, 5, 6, 7, 8, 9, 0},
},
}
B = &Matrix{
N: 8,
data: [][]float64{
[]float64{9, 8, 7, 6, 5, 4, 3, 2},
[]float64{1, 9, 8, 7, 6, 5, 4, 3},
[]float64{2, 1, 9, 8, 7, 6, 5, 4},
[]float64{3, 2, 1, 9, 8, 7, 6, 5},
[]float64{4, 3, 2, 1, 9, 8, 7, 6},
[]float64{5, 4, 3, 2, 1, 9, 8, 7},
[]float64{6, 5, 4, 3, 2, 1, 9, 8},
[]float64{7, 6, 5, 4, 3, 2, 1, 0},
},
}
C = &Matrix{
N: 8,
data: [][]float64{
[]float64{0, 0, 0, 0, 0, 0, 0, 0},
[]float64{0, 0, 0, 0, 0, 0, 0, 0},
[]float64{0, 0, 0, 0, 0, 0, 0, 0},
[]float64{0, 0, 0, 0, 0, 0, 0, 0},
[]float64{0, 0, 0, 0, 0, 0, 0, 0},
[]float64{0, 0, 0, 0, 0, 0, 0, 0},
[]float64{0, 0, 0, 0, 0, 0, 0, 0},
[]float64{0, 0, 0, 0, 0, 0, 0, 0},
},
}
)
func BenchmarkMatrixDotNaive(b *testing.B) {
f, _ := os.Create("bench.trace")
defer f.Close()
trace.Start(f)
defer trace.Stop()
tests := []struct {
name string
f Dot
}{
{
name: "A.MultNaive",
f: A.MultNaive,
},
{
name: "A.ParalMultNaivePerRow",
f: A.ParalMultNaivePerRow,
},
{
name: "A.ParalMultNaivePerElem",
f: A.ParalMultNaivePerElem,
},
}
for _, tt := range tests {
b.Run(tt.name, func(b *testing.B) {
for i := 0; i < b.N; i++ {
tt.f(B, C)
}
})
}
}
Performing 8x8 matrix multipliciation is relatively small work.
Goroutines (although may be lightweight) do have overhead. If the work they do is "small", the overhead of launching, synchronizing and throwing them away may outweight the performance gain of utilizing multiple cores / threads, and overall you might not gain performance by executing such small tasks concurrently (hell, you may even do worse than without using goroutines). Measure.
If we increase the matrix size to 80x80, running the benchmark we already see some performance gain in case of ParalMultNaivePerRow:
BenchmarkMatrixDotNaive/A.MultNaive-4 2000 1054775 ns/op
BenchmarkMatrixDotNaive/A.ParalMultNaivePerRow-4 2000 709367 ns/op
BenchmarkMatrixDotNaive/A.ParalMultNaivePerElem-4 100 10224927 ns/op
(As you see in the results, I have 4 CPU cores, running it on your 8-core machine might show more performance gain.)
When rows are small, you are using goroutines to do minimal work, you may improve performance by not "throwing" away goroutines once they're done with their "tiny" work, but you may "reuse" them. See related question: Is this an idiomatic worker thread pool in Go?
Also see related / possible duplicate: Vectorise a function taking advantage of concurrency

Resources