I have a channel of thousands of IDs that need to be processed in parallel inside goroutines. How could I implement a lock so that goroutines cannot process the same id at the same time, should it be repeated in the channel?
package main
import (
"fmt"
"sync"
"strconv"
"time"
)
var wg sync.WaitGroup
func main() {
var data []string
for d := 0; d < 30; d++ {
data = append(data, "id1")
data = append(data, "id2")
data = append(data, "id3")
}
chanData := createChan(data)
for i := 0; i < 10; i++ {
wg.Add(1)
process(chanData, i)
}
wg.Wait()
}
func createChan(data []string) <-chan string {
var out = make(chan string)
go func() {
for _, val := range data {
out <- val
}
close(out)
}()
return out
}
func process(ids <-chan string, i int) {
go func() {
defer wg.Done()
for id := range ids {
fmt.Println(id + " (goroutine " + strconv.Itoa(i) + ")")
time.Sleep(1 * time.Second)
}
}()
}
--edit:
All values need to be processed in any order, but "id1, "id2" & "id3" need to block so they cannot be processed by more than one goroutine at the same time.
The simplest solution here is to not send the duplicate values at all, and then no synchronization is required.
func createChan(data []string) <-chan string {
seen := make(map[string]bool)
var out = make(chan string)
go func() {
for _, val := range data {
if seen[val] {
continue
}
seen[val] = true
out <- val
}
close(out)
}()
return out
}
I've found a solution. Someone has written a package (github.com/EagleChen/mapmutex) to do exactly what I needed:
package main
import (
"fmt"
"github.com/EagleChen/mapmutex"
"strconv"
"sync"
"time"
)
var wg sync.WaitGroup
var mutex *mapmutex.Mutex
func main() {
mutex = mapmutex.NewMapMutex()
var data []string
for d := 0; d < 30; d++ {
data = append(data, "id1")
data = append(data, "id2")
data = append(data, "id3")
}
chanData := createChan(data)
for i := 0; i < 10; i++ {
wg.Add(1)
process(chanData, i)
}
wg.Wait()
}
func createChan(data []string) <-chan string {
var out = make(chan string)
go func() {
for _, val := range data {
out <- val
}
close(out)
}()
return out
}
func process(ids <-chan string, i int) {
go func() {
defer wg.Done()
for id := range ids {
if mutex.TryLock(id) {
fmt.Println(id + " (goroutine " + strconv.Itoa(i) + ")")
time.Sleep(1 * time.Second)
mutex.Unlock(id)
}
}
}()
}
Your problem as stated is difficult by definition and my first choice would be to re-architect the application to avoid it, but if that's not an option:
First, I assume that if a given ID is repeated you still want it processed twice, but not in parallel (if that's not the case and the 2nd instance must be ignored, it becomes even more difficult, because you have to remember every ID you have processed forever, so you don't run the task over it twice).
To achieve your goal, you must keep track of every ID that is being acted upon in a goroutine - a go map is your best option here (note that its size will grow up to as many goroutines as you spin in parallel!). The map itself must be protected by a lock, as it is modified from multiple goroutines.
Another simplification that I'd take is that it is OK for an ID removed from the channel to be added back to it if found to be currently processed by another gorotuine. Then, we need map[string]bool as the tracking device, plus a sync.Mutex to guard it. For simplicity, I assume the map, mutex and the channel are global variables; but that may not be convenient for you - arrange access to those as you see fit (arguments to the goroutine, closure, etc.).
import "sync"
var idmap map[string]bool
var mtx sync.Mutex
var queue chan string
func process_one_id(id string) {
busy := false
mtx.Lock()
if idmap[id] {
busy = true
} else {
idmap[id] = true
}
mtx.Unlock()
if busy { // put the ID back on the queue and exit
queue <- id
return
}
// ensure the 'busy' mark is cleared at the end:
defer func() { mtx.Lock(); delete(idmap, id); mtx.Unlock() }()
// do your processing here
// ....
}
Related
I've been searching a lot but could not find an answer for my problem yet.
I need to make multiple calls to an external API, but with different parameters concurrently.
And then for each call I need to init a struct for each dataset and process the data I receive from the API call. Bear in mind that I read each line of the incoming request and start immediately send it to the channel.
First problem I encounter was not obvious at the beginning due to the large quantity of data I'm receiving, is that each goroutine does not receive all the data that goes through the channel. (Which I learned by the research I've made). So what I need is a way of requeuing/redirect that data to the correct goroutine.
The function that sends the streamed response from a single dataset.
(I've cut useless parts of code that are out of context)
func (api *API) RequestData(ctx context.Context, c chan DWeatherResponse, dataset string, wg *sync.WaitGroup) error {
for {
line, err := reader.ReadBytes('\n')
s := string(line)
if err != nil {
log.Println("End of %s", dataset)
return err
}
data, err := extractDataFromStreamLine(s, dataset)
if err != nil {
continue
}
c <- *data
}
}
The function that will process the incoming data
func (s *StrikeStruct) Process(ch, requeue chan dweather.DWeatherResponse) {
for {
data, more := <-ch
if !more {
break
}
// data contains {dataset string, value float64, date time.Time}
// The s.Parameter needs to match the dataset
// IMPORTANT PART, checks if the received data is part of this struct dataset
// If not I want to send it to another go routine until it gets to the correct
one. There will be a max of 4 datasets but still this could not be the best approach to have
if !api.GetDataset(s.Parameter, data.Dataset) {
requeue <- data
continue
}
// Do stuff with the data from this point
}
}
Now on my own API endpoint I have the following:
ch := make(chan dweather.DWeatherResponse, 2)
requeue := make(chan dweather.DWeatherResponse)
final := make(chan strike.StrikePerYearResponse)
var wg sync.WaitGroup
for _, s := range args.Parameters.Strikes {
strike := strike.StrikePerYear{
Parameter: strike.Parameter(s.Dataset),
StrikeValue: s.Value,
}
// I receive and process the data in here
go strike.ProcessStrikePerYear(ch, requeue, final, string(s.Dataset))
}
go func() {
for {
data, _ := <-requeue
ch <- data
}
}()
// Creates a goroutine for each dataset
for _, dataset := range api.Params.Dataset {
wg.Add(1)
go api.RequestData(ctx, ch, dataset, &wg)
}
wg.Wait()
close(ch)
//Once the data is all processed it is all appended
var strikes []strike.StrikePerYearResponse
for range args.Fetch.Datasets {
strikes = append(strikes, <-final)
}
return strikes
The issue with this code is that as soon as I start receiving data from more than one endpoint the requeue will block and nothing more happens. If I remove that requeue logic data will be lost if it does not land on the correct goroutine.
My two questions are:
Why is the requeue blocking if it has a goroutine always ready to receive?
Should I take a different approach on how I'm processing the incoming data?
this is not a good way to solving your problem. you should change your solution. I suggest an implementation like the below:
import (
"fmt"
"sync"
)
// answer for https://stackoverflow.com/questions/68454226/how-to-handle-multiple-goroutines-that-share-the-same-channel
var (
finalResult = make(chan string)
)
// IData use for message dispatcher that all struct must implement its method
type IData interface {
IsThisForMe() bool
Process(*sync.WaitGroup)
}
//MainData can be your main struct like StrikePerYear
type MainData struct {
// add any props
Id int
Name string
}
type DataTyp1 struct {
MainData *MainData
}
func (d DataTyp1) IsThisForMe() bool {
// you can check your condition here to checking incoming data
if d.MainData.Id == 2 {
return true
}
return false
}
func (d DataTyp1) Process(wg *sync.WaitGroup) {
d.MainData.Name = "processed by DataTyp1"
// send result to final channel, you can change it as you want
finalResult <- d.MainData.Name
wg.Done()
}
type DataTyp2 struct {
MainData *MainData
}
func (d DataTyp2) IsThisForMe() bool {
// you can check your condition here to checking incoming data
if d.MainData.Id == 3 {
return true
}
return false
}
func (d DataTyp2) Process(wg *sync.WaitGroup) {
d.MainData.Name = "processed by DataTyp2"
// send result to final channel, you can change it as you want
finalResult <- d.MainData.Name
wg.Done()
}
//dispatcher will run new go routine for each request.
//you can implement a worker pool to preventing running too many go routines.
func dispatcher(incomingData *MainData, wg *sync.WaitGroup) {
// based on your requirements you can remove this go routing or not
go func() {
var p IData
p = DataTyp1{incomingData}
if p.IsThisForMe() {
go p.Process(wg)
return
}
p = DataTyp2{incomingData}
if p.IsThisForMe() {
go p.Process(wg)
return
}
}()
}
func main() {
dummyDataArray := []MainData{
MainData{Id: 2, Name: "this data #2"},
MainData{Id: 3, Name: "this data #3"},
}
wg := sync.WaitGroup{}
for i := range dummyDataArray {
wg.Add(1)
dispatcher(&dummyDataArray[i], &wg)
}
result := make([]string, 0)
done := make(chan struct{})
// data collector
go func() {
loop:for {
select {
case <-done:
break loop
case r := <-finalResult:
result = append(result, r)
}
}
}()
wg.Wait()
done<- struct{}{}
for _, s := range result {
fmt.Println(s)
}
}
Note: this is just for opening your mind for finding a better solution, and for sure this is not a production-ready code.
So i'm new to channels, waitgroups, mutex etc, and tried to create an application that queries a slice of a struct for data and if it finds the data it loads it into a map.
I'm basically trying to replicate a cache/db scenario (but having both in memory currently for ease of understanding).
Now, while querying the data, it gets queried both from the db and cache and i've put up an RWMutex for this; but while reading the data stored into either the cache or db using another go routines (via channels). It reads from both the (db go-routine) and the (cache go-routine). So what I did was everytime i get a read from cache go-routine i drain the db go-routing of one element.
package main
import (
"fmt"
"math/rand"
"strconv"
"sync"
"time"
)
type Book struct {
id int
name string
}
var cache = map[int]Book{}
var books []Book
var rnd = rand.New(rand.NewSource(time.Now().UnixNano()))
func main() {
cacheCh := make(chan Book)
dbCh := make(chan Book)
wg := &sync.WaitGroup{}
m := &sync.RWMutex{}
loadDb()
for i := 0; i < 10; i++ {
id := rnd.Intn(10)
wg.Add(1)
go func(id int, wg *sync.WaitGroup, m *sync.RWMutex, ch chan<- Book) {
if find, book := queryCache(id, m); find {
fmt.Println("Found Book In Cache: ", book)
ch <- book
}
wg.Done()
}(id, wg, m, cacheCh)
wg.Add(1)
go func(id int, wg *sync.WaitGroup, m *sync.RWMutex, ch chan<- Book) {
if find, book := queryDb(id, m); find {
ch <- book
}
wg.Done()
}(id, wg, m, dbCh)
go func(dbCh, cacheCh <-chan Book) {
var book Book
select {
case book = <-cacheCh:
msg := <-dbCh
fmt.Println("Drain DbCh From: ", msg, "\nBook From Cache: ", book.name)
case book = <-dbCh:
fmt.Println("Book From Database: ", book.name)
}
}(dbCh, cacheCh)
}
wg.Wait()
}
func queryCache(id int, m *sync.RWMutex) (bool, Book) {
m.RLock()
b, ok := cache[id]
m.RUnlock()
return ok, b
}
func queryDb(id int, m *sync.RWMutex) (bool, Book) {
for _, val := range books {
if val.id == id {
m.Lock()
cache[id] = val
m.Unlock()
return true, val
}
}
var bnf Book
return false, bnf
}
func loadDb() {
var book Book
for i := 0; i < 10; i++ {
book.id = i
book.name = "a" + strconv.Itoa(i)
books = append(books, book)
}
}
Also, I understand that in this code it will always query the db even if it finds a hit at the cache which is not ideal. But, this is just a test scenario where all that I have put as a priority is that the user receives the details in the fastest mode possible (ie if it is not present in the cache it shouldn't wait for a response from the cache before querying the db).
Please help if possible, i'm pretty new to this so it might be something trivial.
Sorry and Thanks.
The issues I can see mainly relate to assumptions about the order in which things will execute:
You are assuming that the db/cache retrieval go routines will return in order; there is no guarantee of this (in theory the last book requested could be added to the channel first and the entries on the cache chanel are likely to be ordered differently to the db channel).
The above point negates your extra msg := <-dbCh (because the database could return a result before the cache and, in that case, the channel is not drained). This leads to a deadlock.
Your code exits when the waitgroup completes and this can, and probably will, happen before the go routine that retrieves the results completes (the go routine with the select does not utilise the wait group) - this is not actually a problem because your code hits a deadlock before that point.
The code below (playground) is my attempt to address these issues while leaving your code mostly as-is (I did create a getBook function because I think this makes it easier to understand). I'm not going to guarantee that my code is error free but hopefully it helps.
package main
import (
"fmt"
"math/rand"
"strconv"
"sync"
"time"
)
type Book struct {
id int
name string
}
var cache = map[int]Book{}
var books []Book
var rnd = rand.New(rand.NewSource(time.Now().UnixNano())) // Note: Not random on playground as time is simulated
func main() {
wg := &sync.WaitGroup{}
m := &sync.RWMutex{}
loadDb()
for i := 0; i < 10; i++ {
id := rnd.Intn(10)
wg.Add(1) // Could also add 10 before entering the loop
go func(wg *sync.WaitGroup, m *sync.RWMutex, id int) {
getBook(m, id)
wg.Done() // The book has been retrieved; loop can exit when all books retrieved
}(wg, m, id)
}
wg.Wait() // Wait until all books have been retrieved
}
// getBook will retrieve a book (from the cache or the db)
func getBook(m *sync.RWMutex, id int) {
fmt.Printf("Requesting book: %d\n", id)
cacheCh := make(chan Book)
dbCh := make(chan Book)
go func(id int, m *sync.RWMutex, ch chan<- Book) {
if find, book := queryCache(id, m); find {
//fmt.Println("Found Book In Cache: ", book)
ch <- book
}
close(ch)
}(id, m, cacheCh)
go func(id int, m *sync.RWMutex, ch chan<- Book) {
if find, book := queryDb(id, m); find {
ch <- book
}
close(ch)
}(id, m, dbCh)
// Wait for a result from one of the above - Note that we make no assumptions
// about the order of the results but do assume that dbCh will ALWAYS return a book
// We want to return from this function as soon as a result is received (because in reality
// we would return the result)
for {
select {
case book, ok := <-cacheCh:
if !ok { // Book is not in the cache so we need to wait for the database query
cacheCh = nil // Prevents select from considering this
continue
}
fmt.Println("Book From Cache: ", book.name)
drainBookChan(dbCh) // The database will always return something (in reality you would cancel a context here to stop the database query)
return
case book := <-dbCh:
dbCh = nil // Nothing else will come through this channel
fmt.Println("Book From Database: ", book.name)
if cacheCh != nil {
drainBookChan(cacheCh)
}
return
}
}
}
// drainBookChan starts a go routine to drain the channel passed in
func drainBookChan(bChan chan Book) {
go func() {
for _ = range bChan {
}
}()
}
func queryCache(id int, m *sync.RWMutex) (bool, Book) {
time.Sleep(time.Duration(rnd.Intn(100)) * time.Millisecond) // Introduce some randomness (otherwise everything comes from DB)
m.RLock()
b, ok := cache[id]
m.RUnlock()
return ok, b
}
func queryDb(id int, m *sync.RWMutex) (bool, Book) {
time.Sleep(time.Duration(rnd.Intn(100)) * time.Millisecond) // Introduce some randomness (otherwise everything comes from DB)
for _, val := range books {
if val.id == id {
m.Lock()
cache[id] = val
m.Unlock()
return true, val
}
}
var bnf Book
return false, bnf
}
func loadDb() {
var book Book
for i := 0; i < 10; i++ {
book.id = i
book.name = "a" + strconv.Itoa(i)
books = append(books, book)
}
}
Functions WithMutex and WithoutMutex are giving different results.
WithoutMutex implementation is losing values even though I have Waitgroup set up.
What could be wrong?
Do not run on Playground
P.S. I am on Windows 10 and Go 1.8.1
package main
import (
"fmt"
"sync"
)
var p = fmt.Println
type MuType struct {
list []int
*sync.RWMutex
}
var muData *MuType
var data *NonMuType
type NonMuType struct {
list []int
}
func (data *MuType) add(i int, wg *sync.WaitGroup) {
data.Lock()
defer data.Unlock()
data.list = append(data.list, i)
wg.Done()
}
func (data *MuType) read() []int {
data.RLock()
defer data.RUnlock()
return data.list
}
func (nonmu *NonMuType) add(i int, wg *sync.WaitGroup) {
nonmu.list = append(nonmu.list, i)
wg.Done()
}
func (nonmu *NonMuType) read() []int {
return nonmu.list
}
func WithoutMutex() {
nonmu := &NonMuType{}
nonmu.list = make([]int, 0)
var wg = sync.WaitGroup{}
for i := 0; i < 10; i++ {
wg.Add(1)
go nonmu.add(i, &wg)
}
wg.Wait()
data = nonmu
p(data.read())
}
func WithMutex() {
mtx := &sync.RWMutex{}
withMu := &MuType{list: make([]int, 0)}
withMu.RWMutex = mtx
var wg = sync.WaitGroup{}
for i := 0; i < 10; i++ {
wg.Add(1)
go withMu.add(i, &wg)
}
wg.Wait()
muData = withMu
p(muData.read())
}
func stressTestWOMU(max int) {
p("Without Mutex")
for ii := 0; ii < max; ii++ {
WithoutMutex()
}
}
func stressTest(max int) {
p("With Mutex")
for ii := 0; ii < max; ii++ {
WithMutex()
}
}
func main() {
stressTestWOMU(20)
stressTest(20)
}
Slices are not safe for concurrent writes, so I am in no way surprised that WithoutMutex does not appear to be consistent at all, and has dropped items.
The WithMutex version consistently has 10 items, but in jumbled orders. This is also to be expected, since the mutex protects it so that only one can append at a time. There is no guarantee as to which goroutine will run in which order though, so it is a race to see which of the rapidly spawned goroutines will get to append first.
The waitgroup does not do anything to control access or enforce ordering. It merely provides a signal at the end that everything is done.
I tried to implement a locking version of reading/writing from a map in golang, but it doesn't return the desired result.
package main
import (
"sync"
"fmt"
)
var m = map[int]string{}
var lock = sync.RWMutex{}
func StoreUrl(id int, url string) {
for {
lock.Lock()
defer lock.Unlock()
m[id] = url
}
}
func LoadUrl(id int, ch chan string) {
for {
lock.RLock()
defer lock.RUnlock()
r := m[id]
ch <- r
}
}
func main() {
go StoreUrl(125, "www.google.com")
chb := make(chan string)
go LoadUrl(125, chb);
C := <-chb
fmt.Println("Result:", C)
}
The output is:
Result:
Meaning the value is not returned via the channel, which I don't get. Without the locking/goroutines it seems to work fine. What did I do wrong?
The code can also be found here:
https://play.golang.org/p/-WmRcMty5B
Infinite loops without sleep or some kind of IO are always bad idea.
In your code if you put a print statement at the start of StoreUrl, you will find that it never gets printed i.e the go routine was never started, the go call is setting putting the info about this new go routine in some run queue of the go scheduler but the scheduler hasn't ran yet to schedule that task. How do you run the scheduler? Do sleep/IO/channel reading/writing.
Another problem is that your infinite loop is taking lock and trying to take the lock again, which will cause it to deadlock. Defer only run after function exit and that function will never exit because of infinite loop.
Below is modified code that uses sleep to make sure every execution thread gets time to do its job.
package main
import (
"sync"
"fmt"
"time"
)
var m = map[int]string{}
var lock = sync.RWMutex{}
func StoreUrl(id int, url string) {
for {
lock.Lock()
m[id] = url
lock.Unlock()
time.Sleep(1)
}
}
func LoadUrl(id int, ch chan string) {
for {
lock.RLock()
r := m[id]
lock.RUnlock()
ch <- r
}
}
func main() {
go StoreUrl(125, "www.google.com")
time.Sleep(1)
chb := make(chan string)
go LoadUrl(125, chb);
C := <-chb
fmt.Println("Result:", C)
}
Edit: As #Jaun mentioned in the comment, you can also use runtime.Gosched() instead of sleep.
Usage of defer incorrect, defer execute at end of function, not for statement.
func StoreUrl(id int, url string) {
for {
func() {
lock.Lock()
defer lock.Unlock()
m[id] = url
}()
}
}
or
func StoreUrl(id int, url string) {
for {
lock.Lock()
m[id] = url
lock.Unlock()
}
}
We can't control the order of go routine, so add time.Sleep() to control the order.
code here:
https://play.golang.org/p/Bu8Lo46SA2
I have a channel which stores received data, I want to process it when one of following conditions is met:
1, the channel reaches its capacity.
2, the timer is fired since last process.
I saw the post
Golang - How to know a buffered channel is full
Update:
I inspired from that post and OneOfOne's advice, here is the play :
package main
import (
"fmt"
"math/rand"
"time"
)
var c chan int
var timer *time.Timer
const (
capacity = 5
timerDration = 3
)
func main() {
c = make(chan int, capacity)
timer = time.NewTimer(time.Second * timerDration)
go checkTimer()
go sendRecords("A")
go sendRecords("B")
go sendRecords("C")
time.Sleep(time.Second * 20)
}
func sendRecords(name string) {
for i := 0; i < 20; i++ {
fmt.Println(name+" sending record....", i)
sendOneRecord(i)
interval := time.Duration(rand.Intn(500))
time.Sleep(time.Millisecond * interval)
}
}
func sendOneRecord(record int) {
select {
case c <- record:
default:
fmt.Println("channel is full !!!")
process()
c <- record
timer.Reset(time.Second * timerDration)
}
}
func checkTimer() {
for {
select {
case <-timer.C:
fmt.Println("3s timer ----------")
process()
timer.Reset(time.Second * timerDration)
}
}
}
func process() {
for i := 0; i < capacity; i++ {
fmt.Println("process......", <-c)
}
}
This seems to work fine, but I have a concern, I want to block the channel writing from other goroutine when process() is called, is the code above capable to do so? Or should I add a mutex at the beginning of the process method?
Any elegant solution?
As was mentioned by #OneOfOne, select is really the only way to check if a channel is full.
If you are using the channel to effect batch processing, you could always create an unbuffered channel and have a goroutine pull items and append to a slice.
When the slice reaches a specific size, process the items.
Here's an example on play
package main
import (
"fmt"
"sync"
"time"
)
const BATCH_SIZE = 10
func batchProcessor(ch <-chan int) {
batch := make([]int, 0, BATCH_SIZE)
for i := range ch {
batch = append(batch, i)
if len(batch) == BATCH_SIZE {
fmt.Println("Process batch:", batch)
time.Sleep(time.Second)
batch = batch[:0] // trim back to zero size
}
}
fmt.Println("Process last batch:", batch)
}
func main() {
var wg sync.WaitGroup
ch := make(chan int)
wg.Add(1)
go func() {
batchProcessor(ch)
wg.Done()
}()
fmt.Println("Submitting tasks")
for i := 0; i < 55; i++ {
ch <- i
}
close(ch)
wg.Wait()
}
No, select is the only way to do it:
func (t *T) Send(v *Val) {
select {
case t.ch <- v:
default:
// handle v directly
}
}