Count distinct values in array - performance tips

Count distinct values in array - performance tips - go

I'm having some issues optimizing a go map.
I want to generate a frequency table (count distinct occurrences) in an array of strings. My code holds nicely for small arrays, but as I start working with 100k+ structures -with many distinct values- it just isn't performant enough.
Right now, my approach is to generate an array with the distinct values, compare values and increasing the counter variable (mapped to the string).
counter := make( map[string]int )
for _, distinct := range distinctStrArray{
for _, row := range StrArray{
if (row == distinct){
counter[distinct]++
}
}
}
I've tried another approach, where with the input array previously sorted (to minimize the number of changes to the map). This is a bit faster.
count:=0
for _, distinct := range distinctStrArray{
for _, row := range StrArray{
if (row == distinct){
count++
}
}
counter[distinct] += count
count= 0
}
Do you have any suggestion of what I can do to optimize a simple count(distinct) type problem...? I'm open to anything.
thanks!

Without more context, I would dump the separate array of distinct values - generating it takes time, and using it necessitates the nested loop. Assuming there's no other purpose to the second array, I'd use something like:
counter := make( map[string]int )
for _, row := range StrArray {
counter[row]++
}
If you need the list of distinct strings without the counts for some separate purpose, you can easily get it afterward:
distinctStrings := make([]string, len(counter))
i := 0
for k := range counter {
distinctStrings[i] = k
i++
}
Iterating the array of distinct strings is O(n), while map access by key is O(log(n)). That takes your overall from O(n^2) to O(n*log(n)), which should be a significant improvement with larger datasets. But, as with any optimization: test, measure, analyze, optimize.

Related

How do I print a 2-Dimensional array as a grid in Golang?

The 2D Array I am trying to print as a board
Note: I am a complete novice at using Go and need this for a final project.
I am making an attempt to make the game, snake and ladders. I need to print a 10x10 2D array as a grid so it can look more like a board.
I've tried using:
for row := 0; row < 10; row ++ 10{
} for column := 0; column < 10; column++{
fmt.Println()
}
But it fails.
Any function or any method to do so?

You are almost there, you should pass the variable you want to print to fmt.Println. Also keep in mind that this will always add a newline to the end of the output. You can use the fmt.Print function to just print the variable.
for row := 0; row < 10; row++ {
for column := 0; column < 10; column++{
fmt.Print(board[row][column], " ")
}
fmt.Print("\n")
}
Bonus tip, instead of using hardcoded sizes you can also use range to loop over each element which works for arrays/slices of any size.

Range-Based Solution
Ranges save us from passing the length directly, and so could make the function re-usable for 2D arrays of differing heights and widths. (Go By Example range page).
A general purpose 2D matrix iterator
Using a range to loop over every value in a 2D array might look like ...
Run this code in Go playground here
// Code for some "board" matrix of type [][]int, for example...
board := [][]int{
{1, 2, 3},
{4, 5, 6},
}
// First we iterate over "board", which is an array of rows:
for r, _ := range board {
// Then we iterate over the items of each row:
for c, colValue := range board[r] {
// See string formatting docs at
// https://gobyexample.com/string-formatting
fmt.Printf("value at index [%d][%d]", r, c)
fmt.Println(" is", colValue)
}
}
What the underscores mean
Underscores are necessary where declared variables would not be used, or the (compiler?) will throw an error and won't run the code.
The variables r and c are used to give us ongoing access to integer indexes within the matrix, starting at 0 and counting up. We have to pass an underscore _ there after the r because that space would give us access to the whole row object, which we don't ever use later in the code. (Yes, we could alternatively have defined range row instead of range board[r], and then we would be using the row object. )
We also would have had to pass a _ in position of c if we hadn't later used c in the Printf statement. Here is a simpler version (and Go Playground) with no index-access:
// Just prints all the values.
for _, row := range board {
for _, colValue := range row {
fmt.Println(colValue)
}
}
why is "colValue" and not "col" ?
In this pattern, some telling name like "colValue" is used instead of column. This is because at this inner point in the code, we have drilled down to a single element instead of a whole set of elements like by accessing whole rows with board[r]
Here, we don't use the indices at all, so they have to be written with _.

Add []bytes append slice []bytes

I began to learn the language of GO and I do not quite understand something, maybe I'm just confused and tired.
Here is my code, there is an array of result (from encoded strings, size 2139614 elements). I need to decode them and use them further. But when I run an iteration, the resultrips is twice as large and the first half is completely empty. Therefore, I make a slice and add to it the desired range.
Why it happens?
It might be easier to decode the result immediately and re-record it, but I don’t know how to do it, well)))
maybe there is a completely different way and as a beginner I don’t know it yet
result := []string{}
for i, _ := range input {
result = append(result, i)
}
sort.Strings(result)
rips := make([][]byte, 2139614)
for _, i := range result {
c := Decode(i)
c = c[1:37]
rips = append(rips, c)
}
//len(result) == 2139614
for i := 2139610; i < 2139700; i++ {
fmt.Println(i, rips[i])
}
resultrips := rips[2139614:]
for _,i := range resultrips {
fmt.Println(i)
}
fmt.Println("All write: ", len(resultrips))
and this question: I do it right if I need an array of byte arrays (I do it so as not to do too much work and will check the values in bytes, because there is no any coding) ???
rips := make([][]byte, 2139614) //array []byte
in the end, I need an array of the type of the set in C ++ to check if there is an element in my set
in C ++ it was code:
if (resultrips.count > 0) { ... }

When you write:
make([][]byte, 2139614)
This creates a slice with length and capacity equal to 2139614. When you append to a slice, it always appends after the last element, thereby increasing the length. If you want to pre-allocate a large slice so that you can append into it, you want to specify a length of 0:
make([][]byte, 0, 2139614)
This pre-allocates 2139614 elements, but with a length of 0, subsequent append calls will start at the beginning of the slice; after the first append it will have a length of 1, and it will not need to have increased its capacity.
Length vs capacity is covered in the Tour of Go: https://tour.golang.org/moretypes/13
A quick note based on the text of your question - remember that slices and arrays are not the same thing. Arrays have a compile-time fixed length and their capacity is synonymous with their length. Slices are backed by arrays but have runtime dynamic independent length and capacity.

Why is iterating over a map so much slower than iterating over a slice in Golang?

I was implementing a sparse matrix using a map in Golang and I noticed that my code started taking much longer to complete after this change, after dismissing other possible causes, seems that the culprit is the iteration on the map itself. Go Playground link (doesn't work for some reason).
package main
import (
"fmt"
"time"
"math"
)
func main() {
z := 50000000
a := make(map[int]int, z)
b := make([]int, z)
for i := 0; i < z; i++ {
a[i] = i
b[i] = i
}
t0 := time.Now()
for key, value := range a {
if key != value { // never happens
fmt.Println("a", key, value)
}
}
d0 := time.Now().Sub(t0)
t1 := time.Now()
for key, value := range b {
if key != value { // never happens
fmt.Println("b", key, value)
}
}
d1 := time.Now().Sub(t1)
fmt.Println(
"a:", d0,
"b:", d1,
"diff:", math.Max(float64(d0), float64(d1)) / math.Min(float64(d0), float64(d1)),
)
}
Iterating over 50M items returns the following timings:
alix#local:~/Go/src$ go version
go version go1.3.3 linux/amd64
alix#local:~/Go/src$ go run b.go
a: 1.195424429s b: 68.588488ms diff: 17.777154632611037
I wonder, why is iterating over a map almost 20x as slow when compared to a slice?

This comes down to the representation in memory. How familiar are you with the representation of different data structures and the concept of algorithmic complexity? Iterating over an array or slice is simple. Values are contiguous in memory. However iterating over a map requires traversing the key space and doing lookups into the hash-table structure.
The dynamic ability of maps to insert keys of any value without using up tons of space allocating a sparse array, and the fact that look-ups can be done efficiently over the key space despite being not as fast as an array, are why hash tables are sometimes preferred over an array, although arrays (and slices) have a faster "constant" (O(1)) lookup time given an index.
It all comes down to whether you need the features of this or that data structure and whether you're willing to deal with the side-effects or gotchas involved.

Seems reasonable to put my comment as an answer. The underlying structures who's iteration performance you're comparing are a hash table and an array (https://en.wikipedia.org/wiki/Hash_table vs https://en.wikipedia.org/wiki/Array_data_structure). The range abstraction is actually (speculation, can't find the code) iterating all the keys, accessing each value, and assigning the two to k,v :=. If you're not familiar with accessing in the array it is constant time because you just add sizeof(type)*i to the starting pointer to get the item. I don't know what the internals of map are in golang but I know enough to know that it's memory representation and therefor access is nothing close that efficient.
The specs statement on the topic isn't much; http://golang.org/ref/spec#For_statements
If I find the time to look up the implementation of range for map and slice/array I will and put some more technical details.

Check whether a string slice contains a certain value in Go

What is the best way to check whether a certain value is in a string slice? I would use a Set in other languages, but Go doesn't have one.
My best try is this so far:
package main
import "fmt"
func main() {
list := []string{"a", "b", "x"}
fmt.Println(isValueInList("b", list))
fmt.Println(isValueInList("z", list))
}
func isValueInList(value string, list []string) bool {
for _, v := range list {
if v == value {
return true
}
}
return false
}
http://play.golang.org/p/gkwMz5j09n
This solution should be ok for small slices, but what to do for slices with many elements?

If you have a slice of strings in an arbitrary order, finding if a value exists in the slice requires O(n) time. This applies to all languages.
If you intend to do a search over and over again, you can use other data structures to make lookups faster. However, building these structures require at least O(n) time. So you will only get benefits if you do lookups using the data structure more than once.
For example, you could load your strings into a map. Then lookups would take O(1) time. Insertions also take O(1) time making the initial build take O(n) time:
set := make(map[string]bool)
for _, v := range list {
set[v] = true
}
fmt.Println(set["b"])
You can also sort your string slice and then do a binary search. Binary searches occur in O(log(n)) time. Building can take O(n*log(n)) time.
sort.Strings(list)
i := sort.SearchStrings(list, "b")
fmt.Println(i < len(list) && list[i] == "b")
Although in theory given an infinite number of values, a map is faster, in practice it is very likely searching a sorted list will be faster. You need to benchmark it yourself.

To replace sets you should use a map[string]struct{}. This is efficient and considered idiomatic, the "values" take absolutely no space.
Initialize the set:
set := make(map[string]struct{})
Put an item :
set["item"]=struct{}{}
Check whether an item is present:
_, isPresent := set["item"]
Remove an item:
delete(set, "item")

You can use a map, and have the value e.g. a bool
m := map[string] bool {"a":true, "b":true, "x":true}
if m["a"] { // will be false if "a" is not in the map
//it was in the map
}
There's also the sort package, so you could sort and binary search your slices

how to sort data at the time of adding it, not later?

I am new to algorithms so please forgive me if this sounds basic or stupid.
I want to know this : instead of adding data into some kind of list and then performing a sort on the list, is there a method (data structure+algorithm) that lets me sort the data at the time of adding itself, or to put it another way, inserts the data in its proper place?
eg: if I want to add '3' to {1,5,6}, instead of adding it at the start or end and then sorting the list, I want '3' to go after '1' "directly".
thanks

If you use a binary search tree instead of an array, the sorting would happen "automatically", because it's already done by the insert method of the nodes. So a binary tree is always sorted, and it's easy to traverse. The only problem is that when you have already (more or less) sorted data, the tree becomes inbalanced (which is where red-black-trees and other variations come into play).

You want to maintain a sorted array at all times, so you shall find a correct place in sequence for every new element you want to add to the array. This can be done efficiently (O(logn) complexity) by utilizing a modified binary search algorithm.

There are basically two different methods to insert a value in a list, which you use depend a bit on what kind of list you are using:
Use binary search to locate where the value should be inserted, and insert the value there.
Loop from the end of the list, moving all higher values one step up, and put the value in the gap before the lowest higher value.
The first method would typically be used if you are using a binary tree or a linked list, where you don't have to move items in the list to do the insert.

Yes but that's usually slower than adding all the data and sorting it afterwards because to insert the data as it is added, you need to traverse the list every time you add an element.
With binary search, you need not look at every element but even then, you often need to get more elements from the list as when you sort afterwards.
So from a performance view, sorted insert is inferior to post sorting.

Here is a golang code that sorts inputs on the fly
What I am doing here is that I am determining the position of where possibly the input will fit through binary search and then partitioning the already sorted array to fit the element, appending to part one and then rejoining the two parts.
time complexity = Log N(to determine the position) + 3N (to create slices and append for each input)
package main
import (
"bufio"
"fmt"
"os"
"strconv"
"strings"
)
func main() {
reader := bufio.NewReader(os.Stdin)
var a []int
for {
fmt.Print("Please enter a value:")
text := readLine(reader)
key, _ := strconv.ParseInt(text, 10, 0)
pos := binarySearch(a, 0, len(a)-1, int(key))
p1 := append([]int{}, a[:pos]...)
p2 := a[pos:]
p1 = append(p1, int(key))
p1 = append(p1, p2...)
a = p1
fmt.Println(a)
}
}
func binarySearch(a []int, low int, high int, key int) int {
var result int
if high == -1 {
return 0
} else if key >= a[high] {
return high + 1
} else if key <= a[low] {
return low
}
mid := (high + low) / 2
if a[mid] == key {
result = mid
} else if key < a[mid] {
return binarySearch(a, low, mid-1, key)
} else if key > a[mid] {
return binarySearch(a, mid+1, high, key)
}
return result
}
func readLine(reader *bufio.Reader) string {
text, err := reader.ReadString('\n')
if err != nil {
fmt.Println(err)
}
text = strings.TrimRight(text, "\n")
return text
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Count distinct values in array - performance tips - go

Related

How do I print a 2-Dimensional array as a grid in Golang?

Add []bytes append slice []bytes

Why is iterating over a map so much slower than iterating over a slice in Golang?

Check whether a string slice contains a certain value in Go

how to sort data at the time of adding it, not later?

Categories

Resources