How to optimize a large recursive task concurrently - go

i have a chron task to perform the best way in Golang.
I need to store big data from web service in JSON in sellers
After saving these sellers in a database, i need to browse another large JSON webservice with sellersID parameter to save to another table named customers.
Each customer has an initial state, if this state has changed from the data of the webservice (n°2) i need to store the difference in another table changes to have a history of changes.
Finally, if the change is equal to our conditions I perform another task.
My current operation
var wg sync.WaitGroup
action.FetchSellers() // fetch large JSON and stort in sellers table ~2min
sellers := action.ListSellers()
for _, s := range sellers {
wg.Add(1)
go action.FetchCustomers(&wg, s) // fetch multiple large JSON and stort in customers table and store notify... ~20sec
}
wg.Wait()
The first difficulty with this code is that I do not control the number of calls to the webservice.
The second is that the action.FetchCustomers function does a lot of work that I think can be done in a concurrency way.
The third difficulty is that I can not resume where an error has occurred in case of errors.
I need to run this code every hour so it needs to be well built, currently it works but not in the best way.
I think that considering the use of Worker Pools in Go like this example Go by Example: Worker Pools But I have trouble conceiving it

Not to be a jerk! But I would use a queue for this kind of things. I have already created a library and using this. github.com/AnikHasibul/queue
// Limit the max
maximumJobLimit := 50
// Open a new queue with the limit
q := queue.New(maximumJobLimit)
defer q.Close()
// simulate a large amount of jobs
for i := 0; i != 1000; i++ {
// Add a job to queue
q.Add()
// Run your long long long job here in a goroutine
go func(c int) {
// Must call Done() after finishing the job
defer q.Done()
time.Sleep(time.Second)
fmt.Println(c)
}(i)
}
//wait for the end of the all jobs
q.Wait()
// Done!

Related

On which step can a goroutine be interrupted

I am writing some asynchromous code in go which basically implements in-memory caching. I have a not very fast source which I query every minute (using ticker), and save the result into a cache struct field. This field can be queried from different goroutines asynchronously.
In order to avoid using mutexes when updating values from source I do not write to the same struct field which is being queried by other goroutines but create another variable, fill it and then assign it to the queried field. This works fine since assigning operation is atomic and no race occurs.
The code looks like the following:
// this fires up when cache is created
func (f *FeaturesCache) goStartUpdaterDaemon(ctx context.Context) {
go func() {
defer kiterrors.RecoverFunc(ctx, f.logger(ctx))
ticker := time.NewTicker(updateFeaturesPeriod) // every minute
defer ticker.Stop()
for {
select {
case <-ticker.C:
f.refill(ctx)
case <-ctx.Done():
return
}
}
}()
}
func (f *FeaturesCache) refill(ctx context.Context) {
var newSources map[string]FeatureData
// some querying and processing logic
// save new data for future queries
f.features = newSources
}
Now I need to add another view of my data so I can also get it from cache. Basically that means adding one more struct field which will be queriad and filled in the same way the previous one (features) was.
I need these 2 views of my data to be in sync, so it is undesired to have, for example, new data in view 2 and old data in view 1 or the other way round.
So the only thing I need to change about refill is to add a new field, at first I did it this way:
func (f *FeaturesCache) refill(ctx context.Context) {
var newSources map[string]FeatureData
var anotherView map[string]DataView2
// some querying and processing logic
// save new data for future queries
f.features = newSources // line A
f.anotherView = anotherView // line B
}
However, for this code I'm wondering whether it satisfies my consistency requirements. I am worried that if the scheduler decides to interrupt the goroutine which runs refill between lines A nd B (check the code above) than I might get inconsistency between data views.
So I researched the problem. Many sources on the Internet say that the scheduler switches goroutines on syscalls and function calls. However, according to this answer https://stackoverflow.com/a/64113553/12702274 since go 1.14 there is an asynchronous preemtion mechanism in go scheduler which switches goroutines based on their running time in addition to previously checked signals. That makes me think that it is actually possible that refill goroutine can be interrupted between lines A and B.
Then I thought about surrounding those 2 assignments with mutex - lock before line A, unlock after line B. However, it seems to me that this doesn't change things much. The goroutine may still be interrupted between lines A and B and the data gets inconsistent. The only thing mutex achieves here is that 2 simultaneous refills do not conflict with each other which is actually impossible, because I run them in the same thread as timer. Thus it is useless here.
So, is there any way I can ensure atomicity for two consecutive assignments?
If I understand your concern correctly, you don't want to lock existing cached data while updating it(bec. it takes time to update, you want to be able to allow usage of existing cached data while updating it in another routine right ?).
Also you want to make f.features and f.anotherView updates atomic.
What about to take your data in a map[int]map[string]FeatureData and map[int]map[string]DataView2. Put the data to a new key each time and let the queries from this key(newSearchIndex).
Just tried to explain in code roughly(think below like pseudo code)
type FeaturesCache struct {
mu sync.RWMutex
features map[int8]map[string]FeatureData
anotherView map[int8]map[string]DataView2
oldSearchIndex int8
newSearchIndex int8
}
func (f *FeaturesCache) CreateNewIndex() int8 {
f.mu.Lock()
defer f.mu.Unlock()
return (f.newSearchIndex + 1) % 16 // mod 16 could be change per your refill rate
}
func (f *FeaturesCache) SetNewIndex(newIndex int8) {
f.mu.Lock()
defer f.mu.Unlock()
f.oldSearchIndex = f.newSearchIndex
f.newSearchIndex = newIndex
}
func (f *FeaturesCache) refill(ctx context.Context) {
var newSources map[string]FeatureData
var anotherView map[string]DataView2
// some querying and processing logic
// save new data for future queries
newSearchIndex := f.CreateNewIndex()
f.features[newSearchIndex] = newSources
f.anotherView[newSearchIndex] = anotherView
f.SetNewIndex(newSearchIndex) //Let the queries to new cached datas after updating search Index
f.features[f.oldSearchIndex] = nil
f.anotherView[f.oldSearchIndex] = nil
}

Golang concurrency access to slice

My use case:
append items(small struct) to a slice in the main process
every 100 items I want to process items in a processor go routine (then pop them from slice)
items comme in very fast continuously
I read that if there is at least one "write" in more then two goroutines using a variable (slice in my case), one shall handle the concurrency (mutex or similar).
My questions:
If I do not handle with a mutex the r/w on slice do I risk problems ? (ie. Item 101 arrives while the processor is working on 1-100s)
What is the best concurrency technique for the incoming item flow to remain "fluent" ?
Disclaimer:
I do not want any event queueing, I need to process items in a given "bundle" size
Actually you don't need a lock here. Here is a working code:
package main
import (
"fmt"
"sync"
)
type myStruct struct {
Cpt int
}
func main() {
buf := make([]myStruct, 0, 100)
wg := sync.WaitGroup{}
// Main process
// Appending one million times
for i := 0; i < 10e6; i++ {
// Locking buffer
// Appending
buf = append(buf, myStruct{Cpt: i})
// Did we reach 100 items ?
if len(buf) >= 100 {
// Yes we did. Creating a slice from the buffer
processSlice := make([]myStruct, 100)
copy(processSlice, buf[0:100])
// Emptying buffer
buf = buf[:0]
// Running processor in parallel
// Adding one element to waitgroup
wg.Add(1)
go processor(&wg, processSlice)
}
}
// Waiting for all processors to finish
wg.Wait()
}
func processor(wg *sync.WaitGroup, processSlice []myStruct) {
// Removing one element to waitgroup when done
defer wg.Done()
// Doing some process
fmt.Printf("Procesing items from %d to %d\n", processSlice[0].Cpt, processSlice[99].Cpt)
}
A few notes about your problem and this solution:
If you want a minimal stop time in your feeding process (e.g, to respond as fast as possible to a HTTP call), then the minimal thing to do is just the copy part, and run the processor function in a go routine. By doing so, you have to create a unique process slice dynamically and copying the content of your buffer inside it.
The sync.WaitGroup object is needed to ensure that the last processor function has ended before exiting the program.
Note that this is not a perfect solution: If you run this pattern for a long time, and the input data comes in more than 100 times faster than the processor processes the slices, then there are going to be:
More and more processSlice instances in RAM -> Risks for filling the RAM and hitting the swap
More and more parallel processor goroutines -> Same risks for the RAM, and more to process in the same time, making each of the calls be slower and the problem gets self-feeding.
This will end up in the system crashing at some point.
The solution for this is to have a limited number of workers that ensures there is no crash. However, when this number of workers is fully busy, then there will be wait in the feeding process, which does not answer what you want. However this is a good solution to absorb a charge which intensity is changing in time.
In general, just remember that if you feed more data than you can process in the same time, your program will just reach a limit at some point where it can't handle it so it has to slow down input acquisition or crash. This is mathematical!

Go Concurrency Circular Logic

I'm just getting into concurrency in Go and trying to create a dispatch go routine that will send jobs to a worker pool listening on the jobchan channel. If a message comes into my dispatch function via the dispatchchan channel and my other go routines are busy, the message is appended onto the stack slice in the dispatcher and the dispatcher will try to send again later when a worker becomes available, and/or no more messages are received on the dispatchchan. This is because the dispatchchan and the jobchan are unbuffered, and the go routine the workers are running will append other messages to the dispatcher up to a certain point and I don't want the workers blocked waiting on the dispatcher and creating a deadlock. Here's the dispatcher code I've come up with so far:
func dispatch() {
var stack []string
acount := 0
for {
select {
case d := <-dispatchchan:
stack = append(stack, d)
case c := <-mw:
acount = acount + c
case jobchan <-stack[0]:
if len(stack) > 1 {
stack[0] = stack[len(stack)-1]
stack = stack[:len(stack)-1]
} else {
stack = nil
}
default:
if acount == 0 && len(stack) == 0 {
close(jobchan)
close(dispatchchan)
close(mw)
wg.Done()
return
}
}
}
Complete example at https://play.golang.wiki/p/X6kXVNUn5N7
The mw channel is a buffered channel the same length as the number of worker go routines. It acts as a semaphore for the worker pool. If the worker routine is doing [m]eaningful [w]ork it throws int 1 on the mw channel and when it finishes its work and goes back into the for loop listening to the jobchan it throws int -1 on the mw. This way the dispatcher knows if there's any work being done by the worker pool, or if the pool is idle. If the pool is idle and there are no more messages on the stack, then the dispatcher closes the channels and return control to the main func.
This is all good but the issue I have is that the stack itself could be zero length so the case where I attempt to send stack[0] to the jobchan, if the stack is empty, I get an out of bounds error. What I'm trying to figure out is how to ensure that when I hit that case, either stack[0] has a value in it or not. I don't want that case to send an empty string to the jobchan.
Any help is greatly appreciated. If there's a more idomatic concurrency pattern I should consider, I'd love to hear about it. I'm not 100% sold on this solution but this is the farthest I've gotten so far.
This is all good but the issue I have is that the stack itself could be zero length so the case where I attempt to send stack[0] to the jobchan, if the stack is empty, I get an out of bounds error.
I can't reproduce it with your playground link, but it's believable, because at lest one gofunc worker might have been ready to receive on that channel.
My output has been Msgcnt: 0, which is also easily explained, because gofunc might not have been ready to receive on jobschan when dispatch() runs its select. The order of these operations is not defined.
trying to create a dispatch go routine that will send jobs to a worker pool listening on the jobchan channel
A channel needs no dispatcher. A channel is the dispatcher.
If a message comes into my dispatch function via the dispatchchan channel and my other go routines are busy, the message is [...] will [...] send again later when a worker becomes available, [...] or no more messages are received on the dispatchchan.
With a few creative edits, it was easy to turn that into something close to the definition of a buffered channel. It can be read from immediately, or it can take up to some "limit" of messages that can't be immediately dispatched. You do define limit, though it's not used elsewhere within your code.
In any function, defining a variable you don't read will result in a compile time error like limit declared but not used. This stricture improves code quality and helps identify typeos. But at package scope, you've gotten away with defining the unused limit as a "global" and thus avoided a useful error - you haven't limited anything.
Don't use globals. Use passed parameters to define scope, because the definition of scope is tantamount to functional concurrency as expressed with the go keyword. Pass the relevant channels defined in local scope to functions defined at package scope so that you can easily track their relationships. And use directional channels to enforce the producer/consumer relationship between your functions. More on this later.
Going back to "limit", it makes sense to limit the quantity of jobs you're queueing because all resources are limited, and accepting more messages than you have any expectation of processing requires more durable storage than process memory provides. If you don't feel obligated to fulfill those requests no matter what, don't accept "too many" of them in the first place.
So then, what function has dispatchchan and dispatch()? To store a limited number of pending requests, if any, before they can be processed, and then to send them to the next available worker? That's exactly what a buffered channel is for.
Circular Logic
Who "knows" when your program is done? main() provides the initial input, but you close all 3 channels in `dispatch():
close(jobchan)
close(dispatchchan)
close(mw)
Your workers write to their own job queue so only when the workers are done writing to it can the incoming job queue be closed. However, individual workers also don't know when to close the jobs queue because other workers are writing to it. Nobody knows when your algorithm is done. There's your circular logic.
The mw channel is a buffered channel the same length as the number of worker go routines. It acts as a semaphore for the worker pool.
There's a race condition here. Consider the case where all n workers have just received the last n jobs. They've each read from jobschan and they're checking the value of ok. disptatcher proceeds to run its select. Nobody is writing to dispatchchan or reading from jobschan right now so the default case is immediately matched. len(stack) is 0 and there's no current job so dispatcher closes all channels including mw. At some point thereafter, a worker tries to write to a closed channel and panics.
So finally I'm ready to provide some code, but I have one more problem: I don't have a clear problem statement to write code around.
I'm just getting into concurrency in Go and trying to create a dispatch go routine that will send jobs to a worker pool listening on the jobchan channel.
Channels between goroutines are like the teeth that synchronize gears. But to what end do the gears turn? You're not trying to keep time, nor construct a wind-up toy. Your gears could be made to turn, but what would success look like? Their turning?
Let's try to define a more specific use case for channels: given an arbitrarily long set of durations coming in as strings on standard input*, sleep that many seconds in one of n workers. So that we actually have a result to return, we'll say each worker will return the start and end time the duration was run for.
So that it can run in the playground, I'll simulate standard input with a hard-coded byte buffer.
package main
import (
"bufio"
"bytes"
"fmt"
"os"
"strings"
"sync"
"time"
)
type SleepResult struct {
worker_id int
duration time.Duration
start time.Time
end time.Time
}
func main() {
var num_workers = 2
workchan := make(chan time.Duration)
resultschan := make(chan SleepResult)
var wg sync.WaitGroup
var resultswg sync.WaitGroup
resultswg.Add(1)
go results(&resultswg, resultschan)
for i := 0; i < num_workers; i++ {
wg.Add(1)
go worker(i, &wg, workchan, resultschan)
}
// playground doesn't have stdin
var input = bytes.NewBufferString(
strings.Join([]string{
"3ms",
"1 seconds",
"3600ms",
"300 ms",
"5s",
"0.05min"}, "\n") + "\n")
var scanner = bufio.NewScanner(input)
for scanner.Scan() {
text := scanner.Text()
if dur, err := time.ParseDuration(text); err != nil {
fmt.Fprintln(os.Stderr, "Invalid duration", text)
} else {
workchan <- dur
}
}
close(workchan) // we know when our inputs are done
wg.Wait() // and when our jobs are done
close(resultschan)
resultswg.Wait()
}
func results(wg *sync.WaitGroup, resultschan <-chan SleepResult) {
for res := range resultschan {
fmt.Printf("Worker %d: %s : %s => %s\n",
res.worker_id, res.duration,
res.start.Format(time.RFC3339Nano), res.end.Format(time.RFC3339Nano))
}
wg.Done()
}
func worker(id int, wg *sync.WaitGroup, jobchan <-chan time.Duration, resultschan chan<- SleepResult) {
var res = SleepResult{worker_id: id}
for dur := range jobchan {
res.duration = dur
res.start = time.Now()
time.Sleep(res.duration)
res.end = time.Now()
resultschan <- res
}
wg.Done()
}
Here I use 2 wait groups, one for the workers, one for the results. This makes sure Im done writing all the results before main() ends. I keep my functions simple by having each function do exactly one thing at a time: main reads inputs, parses durations from them, and sends them off to the next worker. The results function collects results and prints them to standard output. The worker does the sleeping, reading from jobchan and writing to resultschan.
workchan can be buffered (or not, as in this case); it doesn't matter because the input will be read at the rate it can be processed. We can buffer as much input as we want, but we can't buffer an infinite amount. I've set channel sizes as big as 1e6 - but a million is a lot less than infinite. For my use case, I don't need to do any buffering at all.
main knows when the input is done and can close the jobschan. main also knows when jobs are done (wg.Wait()) and can close the results channel. Closing these channels is an important signal to the worker and results goroutines - they can distinguish between a channel that is empty and a channel that is guaranteed not to have any new additions.
for job := range jobchan {...} is shorthand for your more verbose:
for {
job, ok := <- jobchan
if !ok {
wg.Done()
return
}
...
}
Note that this code creates 2 workers, but it could create 20 or 2000, or even 1. The program functions regardless of how many workers are in the pool. It can handle any volume of input (though interminable input of course leads to an interminable program). It does not create a cyclic loop of output to input. If your use case requires jobs to create more jobs, that's a more challenging scenario that can typically be avoided with careful planning.
I hope this gives you some good ideas about how you can better use concurrency in your Go applications.
https://play.golang.wiki/p/cZuI9YXypxI

How to synchronize constant writing and periodically reading and updating

Defining the problem:
We have this IOT device which each send us logs about cars locations. We want to compute the distance the car is travelling online! so when ever a log comes(after putting it in a queue etc) we do this:
type Delta struct {
DeviceId string
time int64
Distance float64
}
var LastLogs = make(map[string]FullLog)
var Distances = make(map[string]Delta)
func addLastLog(l FullLog) {
LastLogs[l.DeviceID] = l
}
func AddToLogPerDay(l FullLog) {
//mutex.Lock()
if val, ok := LastLogs[l.DeviceID]; ok {
if distance, exist := Distances[l.DeviceID]; exist {
x := computingDistance(val, l)
Distances[l.DeviceID] = Delta{
DeviceId: l.DeviceID,
time: distance.time + 1,
Distance: distance.Distance + x,
}
} else {
Distances[l.DeviceID] = Delta{
DeviceId: l.DeviceID,
time: 1,
Distance: 0,
}
}
}
addLastLog(l)
}
which basically calculates distance using a utility function! so in Distances each device Id is mapped to some distance traveled! now here is where the problem starts: While this distances are added to Distances map, I want a go routine to put this data in the database but since there are many devices and many logs and so on doing this query for every log is not a good idea. So I need to this for every 5 second which means every 5 seconds try to empty the list of all last distances added to the map. I wrote this function:
func UpdateLogPerDayTable() {
for {
for _, distance := range Distances {
logs := model.HourPerDay{}
result := services.CarDBProvider.DB.Table(model.HourPerDay{}.TableName()).
Where("created_at >? AND device_id = ?", getCurrentData(), distance.DeviceId).
Find(&logs)
if result.Error != nil && !result.RecordNotFound() {
log.Infof("Something went wrong while checking the log: %v", result.Error)
} else {
if !result.RecordNotFound() {
logs.CountDistance = distance.Distance
logs.CountSecond = distance.time
err := services.CarDBProvider.DB.Model(&logs).
Update(map[string]interface{}{
"count_second": logs.CountSecond,
"count_distance": logs.CountDistance,
})
if err.Error != nil {
log.Infof("Something went wrong while updating the log: %v", err.Error)
}
} else if result.RecordNotFound() {
dayLog := model.HourPerDay{
Model: gorm.Model{},
DeviceId: distance.DeviceId,
CountSecond: int64(distance.time),
CountDistance: distance.Distance,
}
err := services.CarDBProvider.DB.Create(&dayLog)
if err.Error != nil {
log.Infof("Something went wrong while adding the log: %v", err.Error)
}
}
}
}
time.Sleep(time.Second * 5)
}
}
it is called go utlis.UpdateLogPerDayTable() on another go routine. However there are many problems here:
I don't know how to secure Distances so when I add it in another routine I read it somewhere else ,every thing is ok!(The problem is that I want to use go channels and don't have any idea how to do it)
How can I schedule tasks in go for this problem?
Probably I will add a redis to store all the devices that or online so I could do the select query faster and just update the actual database. also add an expire time for redis so if a device didn't send and data for some time, it vanishes! where should I put this code?
Sorry If my explanations weren't enough but I really need some help. specifically for code implementation
Go has a really cool pattern using for / select over multiple channels. This allows you to batch distance writes using both a timeout and a max record size. Using this pattern requires using channels.
First thing is to model your distances as a channel:
distances := make(chan Delta)
Then you an keep track of the current batch
var deltas []Delta
Then
ticker := time.NewTicker(time.Second * 5)
var deltas []Delta
for {
select {
case <-ticker.C:
// 5 seconds up flush to db
// reset deltas
case d := <-distances:
deltas = append(deltas, d)
if len(deltas) >= maxDeltasPerFlush {
// flush
// reset deltas
}
}
}
I don't know how to secure Distances so when I add it in another
routine I read it somewhere else ,every thing is ok!(The problem is
that I want to use go channels and don't have any idea how to do it)
If you intend to keep a map and share memory you need to protect it using mutual exclusion (mutex) to synchronize access between go routines. Using a channel allows you to send a copy to a channel, removing the need for synchronizing across the Delta Object. Depending on your architecture you could also create a pipeline of go routines connected by channels, which could make it so only a single go routine (monitor go routine) is accessing the Delta, also removing the need for synchronization.
How can I schedule tasks in go for this problem?
Using a channel as the primitive for how you pass Deltas to different go routines :)
Probably I will add a redis to store all the devices that or online so
I could do the select query faster and just update the actual
database. also add an expire time for redis so if a device didn't send
and data for some time, it vanishes! where should I put this code?
This depends on your finished architecture. You could write a decorator for the select operation, which would check redis first then go to the DB. The client of this function wouldn't have to know about this. Write operations could be done the same way: Write to persistent store and then write back to redis with the cached value and the expiration. Using decorators the client wouldn't need to know about this, they would just perform the Reads and Writes and the cache logic would be implemented inside of the decorators. There are many ways for this, and its largely dependent on where your implementation settles.

Golang: limit concurrency levels of a blocking operation

I have the following scenario:
I am receiving a message on a channel telling me to upload a file. The upload is made by the blocking function uploadToServer. The zipGen channel may receive several messages per second, and I want to upload maximum 5 files simultaneously (not more, but possibly less - depending on how many messages are sent on zipGen by a third worker that is out of the scope of this question).
The listenToZips function runs inside a go routine (go listenToZips() on the file's init function):
func listenToZips() {
for {
select {
case zip := <-zipGen:
uploadToServer(zip) // this is blocking
}
}
}
If I launch go uploadToServer(zip) instead of just uploadToServer(zip) - I get too much concurrency (so for example my program will try to upload 10 files at the same time, but I want a maximum of 5).
On the other hand, without go uploadToServer(zip) (just using uploadToServer(zip) like in the above function), I only upload one file at a time (since the uploadToServer(zip) is blocking).
How can I achieve this level of control to allow me a max upload of 5 files simultaneously?
Thanks!
The simplest option - prespawn N goroutines that take input from the channel, and upload it, in a loop. In each goroutine's context the operation will be blocking, but N goroutines do this. Only one goroutine will receive each message, of course.
func listenToZips(concurrent int) {
for i:=0; i < concurrent; i++ {
// spawn a listener goroutine
go func() {
for {
select {
case zip := <-zipGen:
uploadToServer(zip) // this is blocking
}
}
}()
}
}
Of course you can then add stop condition, probably using a different channel, but the basic idea is just the same.
try this:
https://github.com/korovkin/limiter
limiter := NewConcurrencyLimiter(10)
limiter.Execute(func() {
uploadToServer()
})
limiter.Wait()

Resources