Better sharding function for int64 keys in golang? - algorithm

I am using concurrent map from this repo where it is possible to select the key type when creating a map using NewWithCustomShardingFunction. I just need to provide my own sharding function for int64 keys and that's what I am using here.
I am also using latest version of Go where I can use generics so I decided to use concurrent-map with keys as int64 by implementing my own sharding function.
import (
cmap "github.com/orcaman/concurrent-map/v2"
)
func shardingFunc(key int64) uint32 {
return uint32(key) // TODO - create a better sharding function that does not rely on how uint32 type conversion works
}
func main() {
testMap := cmap.NewWithCustomShardingFunction[int64, *definitions.CustomerProduct](shardingFunc)
// ... use the map ...
}
I wanted to know whether my sharding function is okay here for int64 keys or should I have better sharding function? I don't want to be in situation where it can give me index out of range error or any other issues.

The sharding function is a hash function. The function should uniformly distribute the keys over a 32 bit space.
If the low four bytes if your init64 values are uniformly distributed, then uint32(key) will work as a sharding function.
An example of where uint32(key) is a bad choice is where the low bytes have constant values. For example, if the key values are like 0x00010000, 0x00020000, ..., then uint32(key) will evaluate to zero. This is not a uniform distribution.
If you don't know how your int64 keys are distributed, then it's best to use all of the key's bits in the shard function. Here's one that uses XOR:
func shardingFunc(key int64) uint32 {
return uint32(key) ^ uint32(key >> 32)
}

For something robust, use crypto/sha256 to hash the key then convert (some of) it to unint32:
func shardingFunc(key int64) uint32 {
bytes := sha256.Sum256(new(big.Int).SetInt64(key).Bytes())
return binary.BigEndian.Uint32(bytes[0:32])
}
There are more efficient, although more verbose, ways of coding this, but hopefully you get the idea.

This is a more context sensitive answer, since I know more background information through chat. It may not apply for others.
First, you won't get an "index out of range" because the concurrent map library will always take the remainder after dividing by the number of shards.
Second, integer product IDs are generally sequential, meaning that they will naturally be evenly distributed across each shard.
There are some possible exceptions to this if you happen to have pathologically antagonistic update / access patterns, but realistically it will make no difference. Even in the worst case scenario where only 1 shard is used, the performance will be effectively the same as using a regular Go map, since each shard has it's own regular map internally. If you do find yourself in a situation where performance really matters, you would be best to roll your own concurrent map rather than using this library. I know it is billed as a "high-performance solution", but there is no such thing as a 'one-size-fits-all optimization'. It is an oxymoron.

Related

What is internal implementation of make(map[type1]type2) in Golang? [duplicate]

This question already has answers here:
Golang map internal implementation - how does it search the map for a key?
(3 answers)
Closed 3 years ago.
Golang is a native programming language. So there is a lot of limitation than dynamic languages (like python and ruby).
When initialize Maps as m := make(Map[string]int), this map m seems to be able to contain infinity many key-values.
But when initialize Maps with maps literal or make with cap, The maps cannot contain infinity many key-values.
Some article said that make without cap allocate a huge amount of memory to this map. But this is not option, because if it was true, there will be a giant memory consumption when initialize single map. But no matter what computer hardware monitoring tools I use, the memory is no difference between before and during my program runs.
func main(){
Hello()
}
func Hello(){
m := make(SizeRecord)
l := 10000000
for i := 0; i < l; i++ {
m[strconv.Itoa(i)] = Size{float64(rand.Intn(100)), float64(rand.Intn(100)), float64(rand.Intn(100))}
}
fmt.Println(m)
}
The program take a while to be executed.
I also read an article Go maps in action, it said (I don't know if I have understood correctly) that make without cap use an alternative implementation to represent map and use an unified interface to access the map as the other maps with limited capacity.
If my understanding is wrong, could any body tell me what correct one is?
If I am correct, why didn't golang implement all maps in this way?
Your understanding of how maps work is incorrect. The spec says:
Blockquote
The initial capacity does not bound its size: maps grow to accommodate the number of items stored in them, with the exception of nil maps. A nil map is equivalent to an empty map except that no elements may be added.
So, the capacity is only a hint and it affects the initial size of the map. The map can grow as needed. There is no separate implementation for maps with a given capacity.

Convention for modifying maps in go

In go, is it more of a convention to modify maps by reassigning values, or using pointer values?
type Foo struct {
Bar int
}
Reassignment:
foos := map[string]Foo{"a": Foo{1}}
v := foos["a"]
v.Bar = 2
foos["a"] = v
vs Pointers
foos := map[string]*Foo{"a": &Foo{1}}
foos["a"].Bar = 2
You may be (inadvertently) conflating the matters here.
The reason to store pointers in a map is not to make "dot-field" modifications work—it is rather to preserve the exact placements of the values "kept" by a map.
One of the crucial properties of Go maps is that the values bound to their keys are not addressable. In other words, you cannot legally do something like
m := {"foo": 42}
p := &m["foo"] // this won't compile
The reason is that particular implementations of the Go language¹ are free to implement maps in a way which allow them to move around the values they hold. This is needed because maps are typically implemented as balanced trees, and these trees may require rebalancing after removing and/or adding new entries.
Hence if the language specification were to allow taking an address of a value kept in a map, that would forbid the map to move its values around.
This is precisely the reason why you cannot do "in place" modification of map values if they have struct types, and you have to replace them "wholesale".
By extension, when you add an element to a map, the value is copied into a map, and it is also copied (moved) when the map shuffles its entries around.
Hence, the chief reason to store pointers into a map is to preserve "identities" of the values to be "indexed" by a map—having them exist in only a single place in memory—and/or to prevent excessive memory operations.
Some types cannot even be sensibly copied without introducing a bug—sync.Mutex or a struct type containing one is a good example.
Getting back to your question, using pointers with the map for the purpose you propose might be a nice hack, but be aware that this is a code smell: when deciding on values vs pointers regarding a map, you should be rather concerned with the considerations outlined above.
¹ There are at least two of them which are actively maintained: the "stock" one, dubbed "gc", and a part of GCC.

Using pointer to channel

Is it good practice to use pointer to channel? For example I read the data concurrently and pass those data map[string]sting using channel and process this channel inside getSameValues().
func getSameValues(results *chan map[string]string) []string {
var datas = make([]map[string]string, len(*results))
i := 0
for values := range *results {
datas[i] = values
i++
}
}
The reason I do this is because the chan map[string]string there will be around millions of data inside the map and it will be more than one map.
So I think it would be a good approach if I can pass pointer to the function so that it will not copy the data to save some resource of memory.
I didn't find a good practice in effective go. So I'm kinda doubt about my approach here.
It is poor practice to use pointers to channels, maps, functions, interfaces, or slices for efficiency.
Values of these types have a small fixed size independent of the length or capacity of the value. An internal pointer references the variable size data.
Channels, maps, and functions are the same size as a pointer. Therefore, the runtime cost of copying a value of these types is identical to copying a pointer to the value.
Interfaces are two × the size of a pointer, and slices are three × the size of a pointer. The cost of copying a value of these types is higher than copying a pointer. That extra copying cost is often lower or equal to the cost of dereferencing the pointer.
In Go, there are six categories of value that are passed by reference rather than by value. These are pointers, slices, maps, channels, interfaces and functions.
Copying a reference value and copying a pointer should be considered equal in terms of what the CPU has to actually do (at least as a good approximation).
So it is almost never useful to use pointers to channels, just like it is rarely useful to use pointers to maps.
Because your channel carries maps, the channel is a reference type and so are the maps, so all the CPU is doing is copying pointers around the heap. In the case of the channel, it also does goroutine synchronisation too.
For further reading, open Effective Go and search the page for the word 'reference'.
Everything in Golang is passed by value. Even pointers are a type and assigned the value of the memory address. So they are values too.
(Extending Rick's answer) There are actually six types that hold pointer values and a pointer to these (i.e. a pointer to a pointer) types doesn't help anyway:
pointers
slices
maps
channels
interfaces
function

Is it safe to hold only unsafe.Pointer on the first element of slice and no refs to that slice itself?

package main
import (
"fmt"
"unsafe"
"runtime"
)
func getPoi() unsafe.Pointer {
var a = []int{1, 2, 3}
return unsafe.Pointer(&a[0])
}
func main() {
p := getPoi()
runtime.GC()
fmt.Println("Hello, playground %v\n", *(*int)(unsafe.Pointer(uintptr(p)+8)))
}
output: 3
https://play.golang.org/p/-OQl7KeL9a
Just examining abilities of unsafe pointers, trying to minimize memory overhead of slice structure (12 byte)
I wonder if this example correct or not.
And if not, what will go wrong exactly after such actions. if it's not correct, why the value is still available even after an explicit call to GC ?
Is there any aproach to reach minimum overhead on storage like 'slice of slices', as it would be in C (just array of pointers to allocated arrays, when overhead on each row is sizeof(int*)).
It's possible, that by coincidence, this will work out for you but I would regard it as unsafe and unsupported. The problem is, if the slice grows beyond it's capacity and needs to be reallocated, what happens to your pointer? Really if you want to optimize performance, you should be using an array. On top of it's performance being inherently better and it's memory footprint being smaller, this operation would always be safe.
Also, just generally speaking I see people doing all kinds of stupid things to try and improve performance when their design is inherently poor (like using dynamic arrays or linked lists for no reason). If you need the dynamic growth of a slice then an array isn't really an option (and using that pointer is also most likely unsafe) but in many cases developers just fail to size their collection appropriately out of idk, laziness? I assume your example is contrived but in such a scenario you have no reason to ever use a slice since your collections size is known at compile time. Even if it weren't, often times the size can be determined during runtime in advance of the allocation and people just fail to do it for the convenience of using abstracted dynamically sized collections.

Maps in Go - how to avoid double key lookup?

Suppose I want to update some existing value in a map, or do something else if the key is not found. How do I do this, without performing 2 lookups? What's the golang equivalent of the following C++ code:
auto it = m.find(key);
if (it != m.end()) {
// update the value, without performing a second lookup
it->second = calc_new_value(it->second);
} else {
// do something else
m.insert(make_pair(key, 42));
}
Go does not expose the map's internal (key,value) pair data structure like C++ does, so you can't replicate this exactly.
One possible work around would be to make the values of your map pointers, so you can keep the same values in the map but update what they point to. For example, if m is a map[int]*int, you could change a value with:
v := m[10]
*v = 42
With that said, I wouldn't be surprised if the savings from reducing the number of hash lookups will be eaten by the additional memory management overhead. So it would be worth benchmarking whatever solution you settle on.
You cannot. The situation is actually the same with Python dicts. However it shouldn't matter. Both lookup and assignment to a Go map are amortized O(1). Combining the two operations has the same time complexity.

Resources