GO: Slice of unique structs effective reusable implementation - go

I often need to get rid of duplicates based on arbitrary equals function.
I need implementation that:
is fast and memory effective (does not create map)
is reusable and easy to use, think of slice.Sort() (github.com/bradfitz/slice)
it's not required to keep order of the original slice or preserve original slice
would be nice to minimize copying
Can this be implemented in go? Why this function is not part of some library I am aware of?
I was looking e.g. godash (github.com/zillow/godash) implementation uses map and does not allow arbitrary less and equal.
Here is how it should approximately look like.
Test:
import (
"reflect"
"testing"
)
type bla struct {
ID string
}
type blas []bla
func (slice blas) Less(i, j int) bool {
return slice[i].ID < slice[j].ID
}
func (slice blas) EqualID(i, j int) bool {
return slice[i].ID == slice[j].ID
}
func Test_Unique(t *testing.T) {
input := []bla{bla{ID: "d"}, bla{ID: "a"}, bla{ID: "b"}, bla{ID: "a"}, bla{ID: "c"}, bla{ID: "c"}}
expected := []bla{bla{ID: "a"}, bla{ID: "b"}, bla{ID: "c"}, bla{ID: "d"}}
Unique(input, blas(input).Less, blas(input).EqualID)
if !reflect.DeepEqual(expected, input) {
t.Errorf("2: Expected: %v but was %v \n", expected, input)
}
}
What I think will need to be used to implement this:
Only slices as data structure to keep it simple and for easy sorting.
Some reflection - the hard part for me! Since I am new to go.

Options
You can sort slice and check for adjacent nodes creation = O(n logn),lookup = O(log n) , insertion = O(n), deletion = O(n)
You can use a Tree and the original slice together creation = O(n logn),lookup = O(log n) , insertion = O(log n), deletion = O(log n)
In the tree implementation you may put only the index in tree nodes and evaluation of nodes will be done using the Equal/Less functions defined for the interface.
Here is an example with tree, here is the play link
You have to add more functions to make it usable ,and the code is not cache friendly so you may improve the code for make it cache friendly
How to use
Make the type representing slice implement Setter interface
set := NewSet(slice),creates a slice
now set.T has only unique values indexes
implement more functions to Set for other set operations
Code
type Set struct {
T Tree
Slice Setter
}
func NewSet(slice Setter) *Set {
set := new(Set)
set.T = Tree{nil, 0, nil}
set.Slice = slice
for i:=0;i < slice.Len();i++ {
insert(&set.T, slice, i)
}
return set
}
type Setter interface {
Len() int
At(int) (interface{},error)
Less(int, int) bool
Equal(int, int) bool
}
// A Tree is a binary tree with integer values.
type Tree struct {
Left *Tree
Value int
Right *Tree
}
func insert(t *Tree, Setter Setter, index int) *Tree {
if t == nil {
return &Tree{nil, index, nil}
}
if Setter.Equal(t.Value, index) {
return t
}
if Setter.Less(t.Value, index) {
t.Left = insert(t.Left, Setter, index)
return t
}
t.Right = insert(t.Right, Setter, index)
return t
}

Bloom filter is frequently used for equality test. There is https://github.com/willf/bloom for example, which awarded some stars on github. This particular implementation uses murmur3 for hashing and bitset for filter, so can be more efficient than map.

Related

Golang maps/hashmaps, optimizing for iteration speed

When migrating a production NodeJS application to Golang I've noticed that iteration of GO's native Map is actually slower than Node.
I've come up with an alternative solution that sacrifices removal/insertion speed with iteration speed instead, by exposing an array that can be iterated over and storing key=>index pairs inside a separate map.
While this solution works, and has a significant performance increase, I was wondering if there is a better solution to this that I could look into.
The setup I have is that its very rare something is removed from the hashmaps, only additions and replacements are common for which this implementation 'works', albeit feels like a workaround more than an actual solution.
The maps are always indexed by an integer, holding arbitrary data.
FastMap: 500000 Iterations - 0.153000ms
Native Map: 500000 Iterations - 4.988000ms
/*
Unordered hash map optimized for iteration speed.
Stores values in an array and holds key=>index mappings inside a separate hashmap
*/
type FastMapEntry[K comparable, T any] struct {
Key K
Value T
}
type FastMap[K comparable, T any] struct {
m map[K]int // Stores key => array index mappings
entries []FastMapEntry[K, T] // Array holding entries and their keys
len int // Total map size
}
func MakeFastMap[K comparable, T any]() *FastMap[K, T] {
return &FastMap[K, T]{
m: make(map[K]int),
entries: make([]FastMapEntry[K, T], 0),
}
}
func (m *FastMap[K, T]) Set(key K, value T) {
index, exists := m.m[key]
if exists {
// Replace if key already exists
m.entries[index] = FastMapEntry[K, T]{
Key: key,
Value: value,
}
} else {
// Store the key=>index pair in the map and add value to entries. Increase total len by one
m.m[key] = m.len
m.entries = append(m.entries, FastMapEntry[K, T]{
Key: key,
Value: value,
})
m.len++
}
}
func (m *FastMap[K, T]) Has(key K) bool {
_, exists := m.m[key]
return exists
}
func (m *FastMap[K, T]) Get(key K) (value T, found bool) {
index, exists := m.m[key]
if exists {
found = true
value = m.entries[index].Value
}
return
}
func (m *FastMap[K, T]) Remove(key K) bool {
index, exists := m.m[key]
if exists {
// Remove value from entries
m.entries = append(m.entries[:index], m.entries[index+1:]...)
// Remove key=>index mapping
delete(m.m, key)
m.len--
for i := index; i < m.len; i++ {
// Move all index mappings up, starting from current index
m.m[m.entries[i].Key] = i
}
}
return exists
}
func (m *FastMap[K, T]) Entries() []FastMapEntry[K, T] {
return m.entries
}
func (m *FastMap[K, T]) Len() int {
return m.len
}
The test code that was ran is:
// s.Variations is a native map holding ~500k records
start := time.Now()
iterations := 0
for _, variation := range s.Variations {
if variation.Id > 0 {
}
iterations++
}
log.Printf("Native Map: %d Iterations - %fms\n", iterations, float64(time.Since(start).Microseconds())/1000)
// Copy data into FastMap
fm := helpers.MakeFastMap[state.VariationId, models.ItemVariation]()
for key, variation := range s.Variations {
fm.Set(key, variation)
}
start = time.Now()
iterations = 0
for _, variation := range fm.Entries() {
if variation.Value.Id > 0 {
}
iterations++
}
log.Printf("FastMap: %d Iterations - %fms\n", iterations, float64(time.Since(start).Microseconds())/1000)
I think this kind of comparison and benchmarking is a little off-topic. Go implementation of map is quite different from your implementation, basically because it needs to cover a wider area of entries, the structs used in compile time are actually kind of heavy (not so much though, they basically store some information about the types you use in your map and so on), and the implementation approach is different! Go implementation of map is basically a hashmap (yours is not obviously, or it is, but the actual hashing implementation is delegated to the m map you hold internally).
One of the other factors makes you get this result is, if you take a look at this:
for _, variation := range fm.Entries() {
if variation.Value.Id > 0 {
}
iterations++
}
Basically, you're iterating over a slice, which is much easier and faster to iterate rather than a map, you have a view to an array, which holds elements of the same types next to each other, makes sense, right?
What you should do to make a better comparison would be something like this:
for _, y := range fastMap.m {
_ = fastMap.Entries()[y].Value + 1 // some simple calculation
}
If you're really looking for performance, a well written hash function and a fixed size array would be your best choice.

Sort 2D array of structs Golang

I want to create a consistent ordering for a 2D slice of structs, I am creating the 2D slice from a map so the order is always different.
My structs look like
// Hit contains the data for a hit.
type Hit struct {
Key string `json:"key"`
Data []Field `json:"data"`
}
// Hits stores a list of hits.
type Hits [][]Hit
I want to provide a consistent order for the contents of my Hits type.
I have tried:
func (c Hits) Len() int { return len(c) }
func (c Hits) Swap(i, j int) { c[i], c[j] = c[j], c[i] }
func (c Hits) Less(i, j int) bool { return strings.Compare(c[i][0].Key, c[j][0].Key) == -1 }
But the results still seem to come back in random order.
I was thinking of possibly hashing each item in the slice but thought there might be an easier option
The order of iteration over a map, because it's a hash table is rather indeterminate (it's not, really — insert items with the same keys in the same exact sequence into 2 maps and the order of iteration for each will be identical).
Assuming that your map is a map[string]Hit, to iterate it over in a determinate order, I would enumerate the set of keys in the map, sort that, and use that sorted set to enumerate the map.
Something like this:
package main
import (
"fmt"
"sort"
)
type Hit struct {
Key string `json:"key"`
Data []Field `json:"data"`
}
type Field struct {
Value string `json:"value"`
}
func main() {
var mapOfHits = getSomeHits()
var sortedHits = sortHits(mapOfHits)
for _, h := range sortedHits {
fmt.Println(h.Key)
}
}
func getSomeHits() map[string]Hit {
return make(map[string]Hit, 0)
}
func sortHits(m map[string]Hit) []Hit {
keys := make([]string, 0, len(m))
sorted := make([]Hit, 0, len(m))
for k := range m {
keys = append(keys, k)
}
sort.Strings(keys)
for _, k := range keys {
sorted = append(sorted, m[k])
}
return sorted
}

golang: Insert to a sorted slice

What's the most efficient way of inserting an element to a sorted slice?
I tried a couple of things but all ended up using at least 2 appends which as I understand makes a new copy of the slice
Here is how to insert into a sorted slice of strings:
Go Playground Link to full example: https://play.golang.org/p/4RkVgEpKsWq
func Insert(ss []string, s string) []string {
i := sort.SearchStrings(ss, s)
ss = append(ss, "")
copy(ss[i+1:], ss[i:])
ss[i] = s
return ss
}
If the slice has enough capacity then there's no need for a new copy.
The elements after the insert position can be shifted to the right.
Only when the slice doesn't have enough capacity,
a new slice and copying all values will be necessary.
Keep in mind that slices are not designed for fast insertion.
So there won't be a miracle solution here using slices.
You could create a custom data structure to make this more efficient,
but obviously there will be other trade-offs.
One point that can be optimized in the process is finding the insertion point quickly. If the slice is sorted, then you can use binary search to perform this in O(log n) time.
However, this might not matter much,
considering the expensive operation of copying the end of the slice,
or reallocating when necessary.
I like #likebike's answer but it only works for strings. Here is the generic version that will work for a slice of any ordered type (requires Go 1.18):
func Insert[T constraints.Ordered](ts []T, t T) []T {
var dummy T
ts = append(ts, dummy) // extend the slice
i, _ := slices.BinarySearch(ts, t) // find slot
copy(ts[i+1:], ts[i:]) // make room
ts[i] = t
return ts
}
Note that this uses the package golang.org/x/exp/slices but this will almost certainly be included in the std Go library in Go 1.19.
Try it in the Go Playground
There are two parts to the problem: finding where to insert the value and inserting the value.
Use the sort package search functions to efficiently find the insertion index using binary search.
Use a single call to append to efficiently insert a value into a slice:
// insertAt inserts v into s at index i and returns the new slice.
func insertAt(data []int, i int, v int) []int {
if i == len(data) {
// Insert at end is the easy case.
return append(data, v)
}
// Make space for the inserted element by shifting
// values at the insertion index up one index. The call
// to append does not allocate memory when cap(data) is
// greater ​than len(data).
data = append(data[:i+1], data[i:]...)
// Insert the new element.
data[i] = v
// Return the updated slice.
return data
}
Here's the code for inserting a value a sorted slice:
func insertSorted(data []int, v int) []int {
i := sort.Search(len(data), func(i int) bool { return data[i] >= v })
return insertAt(data, i, v)
}
The code in this answer uses a slice of int. Adjust the type to match your actual data.
The call to sort.Search in this answer can be replaced with a call to the helper function sort.SearchInts. I show sort.Search in this answer because the function applies to a slice of any type.
If you do not want to add duplicate values, check the value at the search index before inserting:
func insertSortedNoDups(data []int, v int) []int {
i := sort.Search(len(data), func(i int) bool { return data[i] >= v })
if i < len(data) && data[i] == v {
return data
}
return insertAt(data, i, v)
}
You could use a heap:
package main
import (
"container/heap"
"sort"
)
type slice struct { sort.IntSlice }
func (s slice) Pop() interface{} { return 0 }
func (s *slice) Push(x interface{}) {
(*s).IntSlice = append((*s).IntSlice, x.(int))
}
func main() {
s := &slice{
sort.IntSlice{11, 10, 14, 13},
}
heap.Init(s)
heap.Push(s, 12)
println(s.IntSlice[0] == 10)
}
Note that a heap is not strictly sorted, but the "minimum element" is guaranteed
to be the first element. Also I did not implement the Pop function in my
example, you would want to do that.
https://golang.org/pkg/container/heap
There are two approaches mentioned here to insert into the slice when the position i is known:
data = append(data, "")
copy(data[i+1:], data[i:])
data[i] = s
and
data = append(data[:i+1], data[i:]...)
data[i] = s
I just benchmarked both with go1.18beta2, and the first solution is approximately 10% faster.
no dependency, generic data type with duplicated options. (go 1.18)
time complexity : Log2(n) + 1
import "golang.org/x/exp/constraints"
import "golang.org/x/exp/slices"
func InsertionSort[T constraints.Ordered](array []T, value T, canDupicate bool) []T {
pos, isFound := slices.BinarySearch(array, value)
if canDupicate || !isFound {
array = slices.Insert(array, pos, value)
}
return array
}
full version : https://go.dev/play/p/P2_ou2Fqs37
play : https://play.golang.org/p/dUGmPurouxA
array1 := []int{1, 3, 4, 5}
//want to insert at index 1
insertAtIndex := 1
temp := append([]int{}, array1[insertAtIndex:]...)
array1 = append(array1[0:insertAtIndex], 2)
array1 = append(array1, temp...)
fmt.Println(array1)
You can try the below code. It basically uses the golang sort package
package main
import "sort"
import "fmt"
func main() {
data := []int{20, 21, 22, 24, 25, 26, 28, 29, 30, 31, 32}
var items = []int{23, 27}
for _, x := range items {
i := sort.Search(len(data), func(i int) bool { return data[i] >= x })
if i < len(data) && data[i] == x {
fmt.Println(i)
} else {
data = append(data, 0)
copy(data[i+1:], data[i:])
data[i] = x
}
fmt.Println(data)
}
}

Problems understanding usage of `interface{}` in Go

I'm trying to port an algorithm from Python to Go. The central part of it is a tree built using dicts, which should stay this way since each node can have an arbitrary number of children. All leaves are at the same level, so up the the lowest level the dicts contain other dicts, while the lowest level ones contain floats. Like this:
tree = {}
insert(tree, ['a', 'b'], 1.0)
print tree['a']['b']
So while trying to port the code to Go while learning the language at the same time, this is what I started with to test the basic idea:
func main() {
tree := make(map[string]interface{})
tree["a"] = make(map[string]float32)
tree["a"].(map[string]float32)["b"] = 1.0
fmt.Println(tree["a"].(map[string]float32)["b"])
}
This works as expected, so the next step was to turn this into a routine that would take a "tree", a path, and a value. I chose the recursive approach and came up with this:
func insert(tree map[string]interface{}, path []string, value float32) {
node := path[0]
l := len(path)
switch {
case l > 1:
if _, ok := tree[node]; !ok {
if l > 2 {
tree[node] = make(map[string]interface{})
} else {
tree[node] = make(map[string]float32)
}
}
insert(tree[node], path[1:], value) //recursion
case l == 1:
leaf := tree
leaf[node] = value
}
}
This is how I imagine the routine should be structured, but I can't get the line marked with "recursion" to work. There is either a compiler error, or a runtime error if I try to perform a type assertion on tree[node]. What would be the correct way to do this?
Go is perhaps not the ideal solution for generic data structures like this. The type assertions make it possible, but manipulating data in it requires more work that you are used to from python and other scripting languages.
About your specific issue: You are missing a type assertion in the insert() call. The value of tree[node] is of type interface{} at that point. The function expects type map[string]interface{}. A type assertion will solve that.
Here's a working example:
package main
import "fmt"
type Tree map[string]interface{}
func main() {
t := make(Tree)
insert(t, []string{"a", "b"}, 1.0)
v := t["a"].(Tree)["b"]
fmt.Printf("%T %v\n", v, v)
// This prints: float32 1
}
func insert(tree Tree, path []string, value float32) {
node := path[0]
len := len(path)
switch {
case len == 1:
tree[node] = value
case len > 1:
if _, ok := tree[node]; !ok {
tree[node] = make(Tree)
}
insert(tree[node].(Tree), path[1:], value) //recursion
}
}
Note that I created a new type for the map. This makes the code a little easier to follow. I also use the same 'map[string]interface{}` for both tree nodes and leaves. If you want to get a float out of the resulting tree, another type assertion is needed:
leaf := t["a"].(Tree)["b"] // leaf is of type 'interface{}`.
val := leaf.(float32)
well... the problem is that you're trying to code Go using Python idioms, and you're making a tree with... hashtables? Huh? Then you have to maintain that the keys are unique and do a bunch of otherstuff, when if you just made the set of children a slice, you get that sort of thing for free.
I wouldn't make a Tree an explicit map[string]interface{}. A tree and a node on a tree are really the same thing, since it's a recursive datatype.
type Tree struct {
Children []*Tree
Value interface{}
}
func NewTree(v interface{}) *Tree {
return &Tree{
Children: []*Tree{},
Value: v,
}
}
so to add a child...
func (t *Tree) AddChild(child interface{}) {
switch c := child.(type) {
case *Tree:
t.Children = append(t.Children, c)
default:
t.Children = append(t.Children, NewTree(c))
}
}
and if you wanted to implement some recursive function...
func (t *Tree) String() string {
return fmt.Sprint(t.Value)
}
func (t *Tree) PrettyPrint(w io.Writer, prefix string) {
var inner func(int, *Tree)
inner = func(depth int, child *Tree) {
for i := 0; i < depth; i++ {
io.WriteString(w, prefix)
}
io.WriteString(w, child.String()+"\n") // you should really observe the return value here.
for _, grandchild := range child.Children {
inner(depth+1, grandchild)
}
}
inner(0, t)
}
something like that. Any node can be made the root of some tree, since a subtree is just a tree itself. See here for a working example: http://play.golang.org/p/rEx43vOnXN
There are some articles out there like "Python is not Java" (http://dirtsimple.org/2004/12/python-is-not-java.html), and to that effect, Go is not Python.

Is there a way to avoid the implementation of the full sort.Interface for slices of structs?

If I have an array/slice of structs in Go and want to sort them using the sort package it seems to me that I need to implement the whole sort interface which contains 3 methods:
Len
Swap
Less
It seems that Len and Swap should always have the same implementation no matter the type of struct is in the array.
Is there a way to avoid having the implement Len and Swap every time or is this just a limitation from the lack of Generics in Go?
If you are implementing several different comparison operations on the same slice type, you can use embedding to avoid redefining Len and Swap each time. You can also use this technique to add parameters to the sort, for example to sort in reverse or not depending on some run-time value.
e.g.
package main
import (
"sort"
)
type T struct {
Foo int
Bar int
}
// TVector is our basic vector type.
type TVector []T
func (v TVector) Len() int {
return len(v)
}
func (v TVector) Swap(i, j int) {
v[i], v[j] = v[j], v[i]
}
// default comparison.
func (v TVector) Less(i, j int) bool {
return v[i].Foo < v[j].Foo
}
// TVectorBarOrdered embeds TVector and overrides
// its Less method so that it is ordered by the Bar field.
type TVectorBarOrdered struct {
TVector
}
func (v TVectorBarOrdered) Less(i, j int) bool {
return v.TVector[i].Bar < v.TVector[j].Bar
}
// TVectorArbitraryOrdered sorts in normal or reversed
// order depending on the order of its Reversed field.
type TVectorArbitraryOrdered struct {
Reversed bool
TVector
}
func (v TVectorArbitraryOrdered) Less(i, j int) bool {
if v.Reversed {
i, j = j, i
}
return v.TVector[i].Foo < v.TVector[j].Foo
}
func main() {
v := []T{{1, 3}, {0, 6}, {3, 2}, {8, 7}}
sort.Sort(TVector(v))
sort.Sort(TVectorBarOrdered{v})
sort.Sort(TVectorArbitraryOrdered{true, v})
}
Your own answer is right. In your case of an array or slice the implementations of Len() and Swap() are simple. Like len() Go could provide a native swap() here. But the interface which is used now can also be used for more complex data structures like BTrees. It still allows the Sort() function to work (like my parallel quicksort, which uses the same sort interface).
If you want to sort slices (for which Len and Swap always have the same implementation), the sort package now has a function that only requires an implementation of Less:
func Slice(slice interface{}, less func(i, j int) bool)
Although this is an old question, I'd like to point out
the github.com/bradfitz/slice
package.
But as an example or proof of concept only, I would not recommend this actually be used (it's documented with the word "gross"):
It uses gross, low-level operations to make it easy to sort arbitrary slices with only a less function, without defining a new type with Len and Swap operations.
In actual code, I find it completely trivial, quick, short, readable, and non-distracting to just do something like:
type points []point
func (p []points) Len() int { return len(p) }
func (p []points) Swap(i, j int) { p[i], p[j] = p[j], p[i] }
func (p []points) Less(i, j int) bool {
// custom, often multi-line, comparison code here
}
Here gofmt insists on a blank line between the type and func lines
but it has no problem with
multiple one-line functions with no blank lines
and it nicely lines up the function bodies.
I find this a nice readable compact form for such things.
As for your comment that:
It seems that Len and Swap should always have the same implementation no matter the type of struct is in the [slice]
just the other week I need a sort that kept pairs of elements in a slice together (for input to strings.NewReplacer) that required a trivial variation like:
type pairByLen []string
func (p pairByLen) Len() int { return len(p) / 2 }
func (p pairByLen) Less(i, j int) bool { return len(p[i*2]) > len(p[j*2]) }
func (p pairByLen) Swap(i, j int) {
p[i*2], p[j*2] = p[j*2], p[i*2]
p[i*2+1], p[j*2+1] = p[j*2+1], p[i*2+1]
}
This is not supported by an interface like the one in github.com/bradfitz/slice.
Again, I find this layout easy, compact, and readable.
Although (perhaps more so in this case), others may disagree.

Resources