I'm creating a program which create random bson.M documents, and insert them in database.
The main goroutine generate the documents, and push them to a buffered channel. In the same time, two goroutines fetch the documents from the channel and insert them in database.
This process take a lot of memory and put too much pressure on garbage colelctor, so I'm trying to implement a memory pool to limit the number of allocations
Here is what I have so far:
package main
import (
"fmt"
"math/rand"
"sync"
"time"
"gopkg.in/mgo.v2/bson"
)
type List struct {
L []bson.M
}
func main() {
var rndSrc = rand.NewSource(time.Now().UnixNano())
pool := sync.Pool{
New: func() interface{} {
l := make([]bson.M, 1000)
for i, _ := range l {
m := bson.M{}
l[i] = m
}
return &List{L: l}
},
}
// buffered channel to store generated bson.M docs
var record = make(chan List, 3)
// start worker to insert docs in database
for i := 0; i < 2; i++ {
go func() {
for r := range record {
fmt.Printf("first: %v\n", r.L[0])
// do the insert ect
}
}()
}
// feed the channel
for i := 0; i < 100; i++ {
// get an object from the pool instead of creating a new one
list := pool.Get().(*List)
// re generate the documents
for j, _ := range list.L {
list.L[j]["key1"] = rndSrc.Int63()
}
// push the docs to the channel, and return them to the pool
record <- *list
pool.Put(list)
}
}
But it looks like one List is used 4 times before being regenerated:
> go run test.go
first: map[key1:943279487605002381 key2:4444061964749643436]
first: map[key1:943279487605002381 key2:4444061964749643436]
first: map[key1:943279487605002381 key2:4444061964749643436]
first: map[key1:943279487605002381 key2:4444061964749643436]
first: map[key1:8767993090152084935 key2:8807650676784718781]
...
Why isn't the list regenerated each time ? How can I fix this ?
The problem is that you have created a buffered channel with var record = make(chan List, 3). Hence this code:
record <- *list
pool.Put(list)
May return immediately and the entry will be placed back into the pool before it has been consumed. Hence the underlying slice will likely be modified in another loop iteration before your consumer has had a chance to consume it. Although you are sending List as a value object, remember that the []bson.M is a pointer to an allocated array and will still be pointing to the same memory when you send a new List value. Hence why you are seeing the duplicate output.
To fix, modify your channel to send the List pointer make(chan *List, 3) and change your consumer to put the entry back in the pool once finished, e.g:
for r := range record {
fmt.Printf("first: %v\n", r.L[0])
// do the insert etc
pool.Put(r) // Even if error occurs
}
Your producer should then sent the pointer with the pool.Put removed, i.e.
record <- list
Related
I'm a newbee for golang, now need to read a big amount data in mysql, so I wanna use goroutine and channel to get data in high performance, but don't know how to avoid data duplication for each goroutine and make whole process stable. for instance, table schema is as below, I wanna get all records which create_time is smaller than 1000000000000000000, I wanna create 10 goroutines and read data concurrently, each goroutine do some business logic, how to design codes? thank u
id content last_id create_time
I would suggest you to create a goroutine that would publish the data to your channel. Then you can add listener go routines to handle the published data.
This can be done as following:
Main :
const GoroutineCount = 10
type SomeData []int
func main() {
ch := make(chan SomeData, 1)
go PublishData(ch)
for i := 0; i < GoroutineCount; i++ {
go ProcessData(ch)
}
}
For assumptions, I have used a simple slice of int as data. This can be slice of any struct as required.
Publish data to channel:
const ChunkSize = 1000
func PublishData(ch chan SomeData) {
// Assume having 10000 records in result set
// This has to come from db transaction
res := make([]int, 10000)
// split into chunks of 1000
chunks := GetChunk(res)
// write chunk data to channel
for i := range chunks {
ch <- chunks[i]
}
}
func GetChunk(input SomeData) []SomeData {
var result []SomeData
boundary := len(input)
index := 0
for index = 0; boundary >= ChunkSize; index+=ChunkSize {
boundary -= ChunkSize
lastIndex := index+ChunkSize
result = append(result, input[index:lastIndex])
}
boundary = len(input) % ChunkSize
if boundary > 0 {
lastIndex := index+ boundary
result = append(result, input[index:lastIndex])
}
return result
}
Process individual chunks as :
func ProcessData(ch chan SomeData) {
// Read single chunk
res := <-ch
// Process chunk data
fmt.Printf("len %d\n", len(res))
}
Code on go playground: https://play.golang.org/p/X9Q991h6ru_n
I am trying to iterate a loop and call go routine on an anonymous function and adding a waitgroup on each iteration. And passing a string to same anonymous function and appending the value to slice a. Since I am looping 10000 times length of the slice is expected to be 10000. But I see random numbers. Not sure what is the issue. Can anyone help me fix this problem?
Here is my code snippet
import (
"fmt"
"sync"
)
func main() {
var wg = new(sync.WaitGroup)
var a []string
for i := 0; i <= 10000; i++ {
wg.Add(1)
go func(s string) {
a = append(a, s)
wg.Done()
}("MaxPayne")
}
wg.Wait()
fmt.Println(len(a))
}
Notice how appending a slice, you actually make a new slice, and then assign it back to the slice variable. So you have un-controlled concurrent writing to the variable a. Concurrent writing to the same value is not safe in Go (and most languages). In order to make it safe, you can serialize the writes with a mutex.
Try:
var lock sync.Mutex
var a []string
and
lock.Lock()
a = append(a, s)
lock.Unlock()
For more information about how a mutex works, see the tour and the sync package.
Here is a pattern to achieve a similar result, but without needing a mutex and still being safe.
package main
import (
"fmt"
"sync"
)
func main() {
const sliceSize = 10000
var wg = new(sync.WaitGroup)
var a = make([]string, sliceSize)
for i := 0; i < sliceSize; i++ {
wg.Add(1)
go func(s string, index int) {
a[index] = s
wg.Done()
}("MaxPayne", i)
}
wg.Wait()
}
This isn't exactly the same as your other program, but here's what it does.
Create a slice that already has the desired size of 10,000 (each element is an empty string at this point)
For each number 0...9999, create a new goroutine that is given a specific index to write a specific string into
After all goroutines have exited and the waitgroup is done waiting, then we know that each index of the slice has successfully been filled.
The memory access is now safe even without a mutex, because each goroutine is only writing to it's respective index (and each goroutine gets a unique index). Therefore, none of these concurrent memory writes conflict with each other. After initially creating the slice with the desired size, the variable a itself doesn't need to be assigned to again, so the original memory race is eliminated.
I'm extracting from a long redshift table all its data in chunks, each chunk to a csv file. I want to control how many files are created at the "same" time (concurrently), i.e. if the whole process will create 10 files, I want to, let's say, create 4 files, wait until they are created and once they are "done", create another 4, and then the remaining 2.
How can I achieve this using channel/s?
I've tried to change the following slice to a channel, but I couldn't get it to work as I said, the implementation I did, did not wait/stop for the 4 first files to end before creating the following ones.
Right now I'm doing the following using WaitGroup:
package, imports, var, etc...
//Inside func main:
//Create a WaitGroup
var wg = sync.WaitGroup{}
//Opening the connection
db, err := sql.Open("postgres", connStr)
if err != nil {
panic(err.Error())
}
defer db.Close()
//Define chunks using a slice
chunkSizer := Slicer(totalRowsInTable, numberRowsByChunk) // e.g. []int{100, 100, 100... 100}
//Iterating over the array
for index, value := range chunkSizer {
wg.Add(1)
go ExtractorToCSV(db, queriedSection, expFileName)
if (index+1)% 4 == 0 { // <-- 4 is the maximum number of files created at the "same" time
wg.Wait()
}
wg.Wait() // <-- waits for the remaining files (last 2 in this case)
}
//Outside main
func ExtractorToCSV(db *sql.DB, queryToExtract, fileName string) {
//... do its process
wg.Done()
}
I've tried using a buffered channel of the size that I wanted to stop (4 in this case), but I didn't use it properly, I don't know...
Thanks in advance.
UPDATED - STOP CONDITION
You can use channel to hold the next line of code like this. This is minimum code that I write for you. Tweak it as you like
var doneCh = make(chan bool)
func main() {
WRITE_POOL := 4
for index, val := range RANGE {
go extractToFile(val)
if (index + 1) % WRITE_POOL == 0 {
// wait for doneCh to finish
// if the iteration is divisive of WRITE_POOL
<-doneCh
<-doneCh
<-doneCh
<-doneCh
} else if index == MAX - 1 {
// wait for whatever doneCh left to finish
// if current val is the last one
LEFT := MAX - index - 1
for i := 0; i < LEFT; i++ {
<-doneCh
}
}
}
}
func extractToFile(val int) {
os.Create(fmt.Sprintf("test-%d", val))
doneCh<-true
}
For better performance, try to :
Create data channel to main function can send the data to it and ExtractorToCSV can receive it.
Create ExtractorToCSV as goroutine and read from data channel, after ExtractorToCSV finish, send data to doneCh
Send db data to data channel and after ExtractorToCSV finished to write to file, send true to doneCh.
I will update this post if you need more example.
I have 16 go routines which return output , which is typically a struct.
struct output{
index int,
description string,
}
Now all these 16 go routines run in parallel, and the total expected output structs from all the go routines is expected to be a million. I have used the basic sorting of go lang it is very expensive to do that, could some one help me with the approach to take to sort the output based on the index and I need to write the "description" field on to a file based on the order of index.
For instance ,
if a go routine gives output as {2, "Hello"},{9,"Hey"},{4,"Hola"}, my output file should contain
Hello
Hola
Hey
All these go routines run in parallel and I have no control on the order of execution , hence I am passing the index to finally order the output.
One thing to consider before getting into the answer is your example code will not compile. To define a type of struct in Go, you would need to change your syntax to
type output struct {
index int
description string
}
In terms of a potential solution to your problem - if you already reliably have unique index's as well as the expected count of the result set - you should not have to do any sorting at all. Instead synchronize the go routines over a channel and insert the output in an allocated slice at the respective index. You can then iterate over that slice to write the contents to a file. For example:
ch := make(chan output) //each go routine will write to this channel
wg := new(sync.WaitGroup) //wait group to sync all go routines
//execute 16 goroutines
for i := 0; i < 16; i++ {
wg.Add(1)
go worker(ch, wg) //this is expecting each worker func to call wg.Done() when completing its portion of work
}
//create a "quit" channel that will be used to signal to the select statement below that your go routines are all done
quit := make(chan bool)
go func() {
wg.Wait()
quit <- true
}()
//initialize a slice with length and capacity to 1mil, the expected result size mentioned in your question
sorted := make([]string, 1000000, 1000000)
//use the for loop, select pattern to sync the results from your 16 go routines and insert them into the sorted slice
for {
select {
case output := <-ch:
//this is not robust - check notes below example
sorted[output.index] = output.description
case <-quit:
//implement a function you could pass the sorted slice to that will write the results
// Ex: writeToFile(sorted)
return
}
}
A couple notes on this solution: it is dependent upon you knowing the size of the expected result set. If you do not know what the size of the result set is - in the select statement you will need to check if the index is read from ch exceeds the length of the sorted slice and allocate additional space before inserting our you program will crash as a result of an out of bounds error
You could use the module Ordered-concurrently to merge your inputs and then print them in order.
https://github.com/tejzpr/ordered-concurrently
Example - https://play.golang.org/p/hkcIuRHj63h
package main
import (
concurrently "github.com/tejzpr/ordered-concurrently/v2"
"log"
"math/rand"
"time"
)
type loadWorker int
// The work that needs to be performed
// The input type should implement the WorkFunction interface
func (w loadWorker) Run() interface{} {
time.Sleep(time.Millisecond * time.Duration(rand.Intn(10)))
return w
}
func main() {
max := 10
inputChan := make(chan concurrently.WorkFunction)
output := concurrently.Process(inputChan, &concurrently.Options{PoolSize: 10, OutChannelBuffer: 10})
go func() {
for work := 0; work < max; work++ {
inputChan <- loadWorker(work)
}
close(inputChan)
}()
for out := range output {
log.Println(out.Value)
}
}
Disclaimer: I'm the module creator
I am making a cache wrapper around a database. To account for possibly slow database calls, I was thinking of a mutex per key (pseudo Go code):
mutexes = map[string]*sync.Mutex // instance variable
mutexes[key].Lock()
defer mutexes[key].Unlock()
if value, ok := cache.find(key); ok {
return value
}
value = databaseCall(key)
cache.save(key, value)
return value
However I don't want my map to grow too much. My cache is an LRU and I want to have a fixed size for some other reasons not mentioned here. I would like to do something like
delete(mutexes, key)
when all the locks on the key are over but... that doesn't look thread safe to me... How should I do it?
Note: I found this question
In Go, can we synchronize each key of a map using a lock per key? but no answer
A map of mutexes is an efficient way to accomplish this, however the map itself must also be synchronized. A reference count can be used to keep track of entries in concurrent use and remove them when no longer needed. Here is a working map of mutexes complete with a test and benchmark.
(UPDATE: This package provides similar functionality: https://pkg.go.dev/golang.org/x/sync/singleflight )
mapofmu.go
// Package mapofmu provides locking per-key.
// For example, you can acquire a lock for a specific user ID and all other requests for that user ID
// will block until that entry is unlocked (effectively your work load will be run serially per-user ID),
// and yet have work for separate user IDs happen concurrently.
package mapofmu
import (
"fmt"
"sync"
)
// M wraps a map of mutexes. Each key locks separately.
type M struct {
ml sync.Mutex // lock for entry map
ma map[interface{}]*mentry // entry map
}
type mentry struct {
m *M // point back to M, so we can synchronize removing this mentry when cnt==0
el sync.Mutex // entry-specific lock
cnt int // reference count
key interface{} // key in ma
}
// Unlocker provides an Unlock method to release the lock.
type Unlocker interface {
Unlock()
}
// New returns an initalized M.
func New() *M {
return &M{ma: make(map[interface{}]*mentry)}
}
// Lock acquires a lock corresponding to this key.
// This method will never return nil and Unlock() must be called
// to release the lock when done.
func (m *M) Lock(key interface{}) Unlocker {
// read or create entry for this key atomically
m.ml.Lock()
e, ok := m.ma[key]
if !ok {
e = &mentry{m: m, key: key}
m.ma[key] = e
}
e.cnt++ // ref count
m.ml.Unlock()
// acquire lock, will block here until e.cnt==1
e.el.Lock()
return e
}
// Unlock releases the lock for this entry.
func (me *mentry) Unlock() {
m := me.m
// decrement and if needed remove entry atomically
m.ml.Lock()
e, ok := m.ma[me.key]
if !ok { // entry must exist
m.ml.Unlock()
panic(fmt.Errorf("Unlock requested for key=%v but no entry found", me.key))
}
e.cnt-- // ref count
if e.cnt < 1 { // if it hits zero then we own it and remove from map
delete(m.ma, me.key)
}
m.ml.Unlock()
// now that map stuff is handled, we unlock and let
// anything else waiting on this key through
e.el.Unlock()
}
mapofmu_test.go:
package mapofmu
import (
"math/rand"
"strconv"
"strings"
"sync"
"testing"
"time"
)
func TestM(t *testing.T) {
r := rand.New(rand.NewSource(42))
m := New()
_ = m
keyCount := 20
iCount := 10000
out := make(chan string, iCount*2)
// run a bunch of concurrent requests for various keys,
// the idea is to have a lot of lock contention
var wg sync.WaitGroup
wg.Add(iCount)
for i := 0; i < iCount; i++ {
go func(rn int) {
defer wg.Done()
key := strconv.Itoa(rn)
// you can prove the test works by commenting the locking out and seeing it fail
l := m.Lock(key)
defer l.Unlock()
out <- key + " A"
time.Sleep(time.Microsecond) // make 'em wait a mo'
out <- key + " B"
}(r.Intn(keyCount))
}
wg.Wait()
close(out)
// verify the map is empty now
if l := len(m.ma); l != 0 {
t.Errorf("unexpected map length at test end: %v", l)
}
// confirm that the output always produced the correct sequence
outLists := make([][]string, keyCount)
for s := range out {
sParts := strings.Fields(s)
kn, err := strconv.Atoi(sParts[0])
if err != nil {
t.Fatal(err)
}
outLists[kn] = append(outLists[kn], sParts[1])
}
for kn := 0; kn < keyCount; kn++ {
l := outLists[kn] // list of output for this particular key
for i := 0; i < len(l); i += 2 {
if l[i] != "A" || l[i+1] != "B" {
t.Errorf("For key=%v and i=%v got unexpected values %v and %v", kn, i, l[i], l[i+1])
break
}
}
}
if t.Failed() {
t.Logf("Failed, outLists: %#v", outLists)
}
}
func BenchmarkM(b *testing.B) {
m := New()
b.ResetTimer()
for i := 0; i < b.N; i++ {
// run uncontended lock/unlock - should be quite fast
m.Lock(i).Unlock()
}
}
I wrote a simple similar implementation: mapmutex
But instead of a map of mutexes, in this implementation, a mutex is used to guard the map and each item in the map is used like a 'lock'. The map itself is just simple ordinary map.