Background: currently I am investigating the mechanism of distributed lock in order to implement it in our email service to deduplicate in the time the QPS is high.
The main article I am reading is this one https://redis.io/docs/reference/patterns/distributed-locks/#retry-on-failure from redis official.
In this article it mentioned in the retry section that
When a client is unable to acquire the lock, it should try again after a random delay in order to try to desynchronize multiple clients trying to acquire the lock for the same resource at the same time (this may result in a split brain condition where nobody wins)
Also the faster a client tries to acquire the lock in the majority of Redis instances, the smaller the window for a split brain condition (and the need for a retry), so ideally the client should try to send the SET commands to the N instances at the same time using multiplexing.
The fact that when a client needs to retry a lock, it waits a time which is comparably greater than the time needed to acquire the majority of locks, in order to probabilistically make split brain conditions during resource contention unlikely.
I read a few recommended open source implementation of redlock from git, indeed there is delay between retry, copy and pasted here for your convenience.
func (m *Mutex) LockContext(ctx context.Context) error {
if ctx == nil {
ctx = context.Background()
}
value, err := m.genValueFunc()
if err != nil {
return err
}
for i := 0; i < m.tries; i++ {
if i != 0 {
select {
case <-ctx.Done():
// Exit early if the context is done.
return ErrFailed
case <-time.After(m.delayFunc(i)):
// Fall-through when the delay timer completes.
}
}
start := time.Now()
n, err := func() (int, error) {
ctx, cancel := context.WithTimeout(ctx, time.Duration(int64(float64(m.expiry)*m.timeoutFactor)))
defer cancel()
return m.actOnPoolsAsync(func(pool redis.Pool) (bool, error) {
return m.acquire(ctx, pool, value)
})
}()
if n == 0 && err != nil {
return err
}
now := time.Now()
until := now.Add(m.expiry - now.Sub(start) - time.Duration(int64(float64(m.expiry)*m.driftFactor)))
if n >= m.quorum && now.Before(until) {
m.value = value
m.until = until
return nil
}
_, err = func() (int, error) {
ctx, cancel := context.WithTimeout(ctx, time.Duration(int64(float64(m.expiry)*m.timeoutFactor)))
defer cancel()
return m.actOnPoolsAsync(func(pool redis.Pool) (bool, error) {
return m.release(ctx, pool, value)
})
}()
if i == m.tries-1 && err != nil {
return err
}
}
return ErrFailed
}
What I do not understand is that why the existence of the retry can help to avoid brain split?
Intuitively the backoff time(delay) for the retry is used to avoid overloading the redis instance.
Related
Suppose you have a basic toy system that finds and processes all files in a directory (for some definition of "processes"). A basic diagram of how it operates could look like:
If this were a real-world distributed system, the "arrows" could actually be unbounded queues, and then it just works.
In a self-contained go application, it's tempting to model the "arrows" as channels. However, due to the self-referential nature of "generating more work by needing to list subdirectories", it's easy to see that a naive implementation would deadlock. For example (untested, forgive compile errors):
func ListDirWorker(dirs, files chan string) {
for dir := range dirs {
for _, path := range ListDir(dir) {
if isDir(path) {
dirs <- path
} else {
files <- path
}
}
}
}
}
If we imagine we've configured just a single List worker, all it takes is for a directory to have two subdirectories to basically deadlock this thing.
My brain wants there to be "unbounded channels" in golang, but the creators don't want that. What's the correct idiomatic way to model this stuff? I imagine there's something simpler than implementing a thread-safe queue and using that instead of channels. :)
Had a very similar problem to solve. Needed:
finite number of recursive workers (bounded parallelism)
content.Context for early cancelations (enforce timeout limits etc.)
partial results (some goroutines hit errors while others did not)
crawl completion (worker clean-up etc.) via recursive depth tracking
Below I describe the problem and the gist of the solution I arrived at
Problem: scrape a HR LDAP directory with no pagination support. Server-side limits also precluded bulk queries greater than 100K records. Needed small queries to work around these limitations. So recursively navigated the tree from the top (CEO) - listing employees (nodes) and recursing on managers (branches).
To avoid deadlocks - a single workItem channel was used not only by workers to read (get) work, but also to write (delegate) to other idle workers. This approach allowed for fast worker saturation.
Note: not included here, but worth adding, is to use a common API rate-limiter to avoid multiple workers collectively abusing/exceeding any server-side API rate limits.
To start the crawl, create the workers and return a results channel and an error channel. Some notes:
c.in the workItem channel must be unbuffered for delegation to work (more on this later)
c.rwg tracks collective recursion depth for all worker. When it reaches zero, all recursion is done and the crawl is complete
func (c *Crawler) Crawl(ctx context.Context, root Branch, workers int) (<-chan Result, <-chan error) {
errC := make(chan error, 1)
c.rwg = sync.WaitGroup{} // recursion depth waitgroup (to determine when all workers are done)
c.rwg.Add(1) // add to waitgroups *BEFORE* starting workers
c.in = make(chan workItem) // input channel: shared by all workers (to read from and also to write to when they need to delegate)
c.out = make(chan Result) // output channel: where all workers write their results
go func() {
workerErrC := c.createWorkers(ctx, workers)
c.in <- workItem{
branch: root, // initial place to start crawl
}
for err := range workerErrC {
if err != nil {
// tally for partial results - or abort on first error (see werr)
}
}
// summarize crawl success/failure via a single write to errC
errC <- werr // nil, partial results, aborted early etc.
close(errC)
}
return c.out, errC
}
Create a finite number of individual workers. The returned error channel receives an error for each individual worker:
func (c *Crawler) createWorkers(ctx context.Context, workers int) (<-chan error) {
errC := make(chan error)
var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++ {
i := i
go func() {
defer wg.Done()
var err error
defer func() {
errC <- err
}()
conn := Dial("somewhere:8080") // worker prep goes here (open network connect etc.)
for workItem := range c.in {
err = c.recurse(ctx, i+1, conn, workItem)
if err != nil {
return
}
}
}()
}
go func() {
c.rwg.Wait() // wait for all recursion to finish ...
close(c.in) // ... so safe to close input channel ...
wg.Wait() // ... wait for all workers to complete ...
close(errC) // .. finally signal to caller we're truly done
}()
return errC
}
recurse logic:
for any potentially blocking channel write, always check the ctx for cancelation, so we can abort early
c.in is deliberately unbuffered to ensure delegation works (see final note)
func (c *Crawler) recurse(ctx context.Context, workID int, conn *net.Conn, wi workItem) error {
defer c.rwg.Done() // decrement recursion count
select {
case <-ctx.Done():
return ctx.Err() // canceled/timeout etc.
case c.out <- Result{ /* Item: wi.. */}: // write to results channel (manager or employee)
}
items, err := getItems(conn) // WORKER CODE (e.g. get manager employees etc.)
if err != nil {
return err
}
for _, i := range items {
// leaf case
if i.IsLeaf() {
select {
case <-ctx.Done():
return ctx.Err()
case c.out <- Result{ Item: i.Leaf }:
}
continue
}
// branch case
wi := workItem{
branch: i.Branch,
}
c.rwg.Add(1) // about to recurse (or delegate-recursion)
select {
case c.in <- wi:
// delegated to another worker!
case <-ctx.Done(): // context canceled...
c.rwg.Done() // ... so undo above `c.rwg.Add(1)`
return ctx.Err()
default:
// no-one to delegated to (all busy) - so this worker will keep working
err = c.recurse(ctx, workID, conn, wi)
if err != nil {
return err
}
}
}
return nil
}
Delegation is key:
if a worker successfully writes to the worker channel, then it knows work has been delegated to another worker.
if it cannot, then the worker knows all workers are busy working (i.e. not waiting on work items) - and so it must recurse itself
So one gets both the benefits of recursion, but also leveraging a fixed-sized worker pool.
I am trying to parallelize a recursive problem in Go, and I am unsure what the best way to do this is.
I have a recursive function, which works like this:
func recFunc(input string) (result []string) {
for subInput := range getSubInputs(input) {
subOutput := recFunc(subInput)
result = result.append(result, subOutput...)
}
result = result.append(result, getOutput(input)...)
}
func main() {
output := recFunc("some_input")
...
}
So the function calls itself N times (where N is 0 at some level), generates its own output and returns everything in a list.
Now I want to make this function run in parallel. But I am unsure what the cleanest way to do this is. My Idea:
Have a "result" channel, to which all function calls send their result.
Collect the results in the main function.
Have a wait group, which determines when all results are collected.
The Problem: I need to wait for the wait group and collect all results in parallel. I can start a separate go function for this, but how do I ever quit this separate go function?
func recFunc(input string) (result []string, outputChannel chan []string, waitGroup &sync.WaitGroup) {
defer waitGroup.Done()
waitGroup.Add(len(getSubInputs(input))
for subInput := range getSubInputs(input) {
go recFunc(subInput)
}
outputChannel <-getOutput(input)
}
func main() {
outputChannel := make(chan []string)
waitGroup := sync.WaitGroup{}
waitGroup.Add(1)
go recFunc("some_input", outputChannel, &waitGroup)
result := []string{}
go func() {
nextResult := <- outputChannel
result = append(result, nextResult ...)
}
waitGroup.Wait()
}
Maybe there is a better way to do this? Or how can I ensure the anonymous go function, that collects the results, is quited when done?
tl;dr;
recursive algorithms should have bounded limits on expensive resources (network connections, goroutines, stack space etc.)
cancelation should be supported - to ensure expensive operations can be cleaned up quickly if a result is no longer needed
branch traversal should support error reporting; this allows errors to bubble up the stack & partial results to be returned without the entire recursion traversal to fail.
For asychronous results - whether using recursions or not - use of channels is recommended. Also, for long running jobs with many goroutines, provide a method for cancelation (context.Context) to aid with clean-up.
Since recursion can lead to exponential consumption of resources it's important to put limits in place (see bounded parallelism).
Below is a design patten I use a lot for asynchronous tasks:
always support taking a context.Context for cancelation
number of workers needed for the task
return a chan of results & a chan error (will only return one error or nil)
var (
workers = 10
ctx = context.TODO() // use request context here - otherwise context.Background()
input = "abc"
)
resultC, errC := recJob(ctx, workers, input) // returns results & `error` channels
// asynchronous results - so read that channel first in the event of partial results ...
for r := range resultC {
fmt.Println(r)
}
// ... then check for any errors
if err := <-errC; err != nil {
log.Fatal(err)
}
Recursion:
Since recursion quickly scales horizontally, one needs a consistent way to fill the finite list of workers with work but also ensure when workers are freed up, that they quickly pick up work from other (over-worked) workers.
Rather than create a manager layer, employ a cooperative peer system of workers:
each worker shares a single inputs channel
before recursing on inputs (subIinputs) check if any other workers are idle
if so, delegate to that worker
if not, current worker continues recursing that branch
With this algorithm, the finite count of workers quickly become saturated with work. Any workers which finish early with their branch - will quickly be delegated a sub-branch from another worker. Eventually all workers will run out of sub-branches, at which point all workers will be idled (blocked) and the recursion task can finish up.
Some careful coordination is needed to achieve this. Allowing the workers to write to the input channel helps with this peer coordination via delegation. A "recursion depth" WaitGroup is used to track when all branches have been exhausted across all workers.
(To include context support and error chaining - I updated your getSubInputs function to take a ctx and return an optional error):
func recFunc(ctx context.Context, input string, in chan string, out chan<- string, rwg *sync.WaitGroup) error {
defer rwg.Done() // decrement recursion count when a depth of recursion has completed
subInputs, err := getSubInputs(ctx, input)
if err != nil {
return err
}
for subInput := range subInputs {
rwg.Add(1) // about to recurse (or delegate recursion)
select {
case in <- subInput:
// delegated - to another goroutine
case <-ctx.Done():
// context canceled...
// but first we need to undo the earlier `rwg.Add(1)`
// as this work item was never delegated or handled by this worker
rwg.Done()
return ctx.Err()
default:
// noone available to delegate - so this worker will need to recurse this item themselves
err = recFunc(ctx, subInput, in, out, rwg)
if err != nil {
return err
}
}
select {
case <-ctx.Done():
// always check context when doing anything potentially blocking (in this case writing to `out`)
// context canceled
return ctx.Err()
case out <- subInput:
}
}
return nil
}
Connecting the Pieces:
recJob creates:
input & output channels - shared by all workers
"recursion" WaitGroup detects when all workers are idle
"output" channel can then safely be closed
error channel for all workers
kicks-off recursion workload by writing initial input to input channel
func recJob(ctx context.Context, workers int, input string) (resultsC <-chan string, errC <-chan error) {
// RW channels
out := make(chan string)
eC := make(chan error, 1)
// R-only channels returned to caller
resultsC, errC = out, eC
// create workers + waitgroup logic
go func() {
var err error // error that will be returned to call via error channel
defer func() {
close(out)
eC <- err
close(eC)
}()
var wg sync.WaitGroup
wg.Add(1)
in := make(chan string) // input channel: shared by all workers (to read from and also to write to when they need to delegate)
workerErrC := createWorkers(ctx, workers, in, out, &wg)
// get the ball rolling, pass input job to one of the workers
// Note: must be done *after* workers are created - otherwise deadlock
in <- input
errCount := 0
// wait for all worker error codes to return
for err2 := range workerErrC {
if err2 != nil {
log.Println("worker error:", err2)
errCount++
}
}
// all workers have completed
if errCount > 0 {
err = fmt.Errorf("PARTIAL RESULT: %d of %d workers encountered errors", errCount, workers)
return
}
log.Printf("All %d workers have FINISHED\n", workers)
}()
return
}
Finally, create the workers:
func createWorkers(ctx context.Context, workers int, in chan string, out chan<- string, rwg *sync.WaitGroup) (errC <-chan error) {
eC := make(chan error) // RW-version
errC = eC // RO-version (returned to caller)
// track the completeness of the workers - so we know when to wrap up
var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++ {
i := i
go func() {
defer wg.Done()
var err error
// ensure the current worker's return code gets returned
// via the common workers' error-channel
defer func() {
if err != nil {
log.Printf("worker #%3d ERRORED: %s\n", i+1, err)
} else {
log.Printf("worker #%3d FINISHED.\n", i+1)
}
eC <- err
}()
log.Printf("worker #%3d STARTED successfully\n", i+1)
// worker scans for input
for input := range in {
err = recFunc(ctx, input, in, out, rwg)
if err != nil {
log.Printf("worker #%3d recurseManagers ERROR: %s\n", i+1, err)
return
}
}
}()
}
go func() {
rwg.Wait() // wait for all recursion to finish
close(in) // safe to close input channel as all workers are blocked (i.e. no new inputs)
wg.Wait() // now wait for all workers to return
close(eC) // finally, signal to caller we're truly done by closing workers' error-channel
}()
return
}
I can start a separate go function for this, but how do I ever quit this separate go function?
You can range over the output channel in the separate go-routine. The go-routine, in that case, will exit safely, when the channel is closed
go func() {
for nextResult := range outputChannel {
result = append(result, nextResult ...)
}
}
So, now the thing that we need to take care of is that the channel is closed after all the go-routines spawned as part of the recursive function call have successfully existed
For that, you can use a shared waitgroup across all the go-routines and wait on that waitgroup in your main function, as you are already doing. Once the wait is over, close the outputChannel, so that the other go-routine also exits safely
func recFunc(input string, outputChannel chan, wg &sync.WaitGroup) {
defer wg.Done()
for subInput := range getSubInputs(input) {
wg.Add(1)
go recFunc(subInput)
}
outputChannel <-getOutput(input)
}
func main() {
outputChannel := make(chan []string)
waitGroup := sync.WaitGroup{}
waitGroup.Add(1)
go recFunc("some_input", outputChannel, &waitGroup)
result := []string{}
go func() {
for nextResult := range outputChannel {
result = append(result, nextResult ...)
}
}
waitGroup.Wait()
close(outputChannel)
}
PS: If you want to have bounded parallelism to limit the exponential growth, check this out
I'm using Go routines to send queries to PostgreSQL master and slave nodes in parallel. The first host that returns a valid result wins. Error cases are outside the scope of this question.
The caller is the only one that cares about the contents of a *sql.Rows object, so intentionally my function doesn't do any operations on those. I use buffered channels to retrieve return objects from the Go routines, so there should be no Go routine leak. Garbage collection should take care of the rest.
There is a problem I haven't taught about properly: the Rows objects that remain behind in the channel are never closed. When I call this function from a (read only) transaction, tx.Rollback() returns an error for every instance of non-closed Rows object: "unexpected command tag SELECT".
This function is called from higher level objects:
func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, error) {
rc := make(chan *sql.Rows, len(xs))
ec := make(chan error, len(xs))
for _, x := range xs {
go func(x executor) {
rows, err := x.QueryContext(ctx, query, args...)
switch { // Make sure only one of them is returned
case err != nil:
ec <- err
case rows != nil:
rc <- rows
}
}(x)
}
var me MultiError
for i := 0; i < len(xs); i++ {
select {
case err := <-ec:
me.append(err)
case rows := <-rc: // Return on the first success
return rows, nil
}
}
return nil, me.check()
}
Executors can be *sql.DB, *sql.Tx or anything that complies with the interface:
type executor interface {
ExecContext(ctx context.Context, query string, args ...interface{}) (sql.Result, error)
QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error)
QueryRowContext(ctx context.Context, query string, args ...interface{}) *sql.Row
}
Rollback logic:
func (mtx MultiTx) Rollback() error {
ec := make(chan error, len(mtx))
for _, tx := range mtx {
go func(tx *Tx) {
err := tx.Rollback()
ec <- err
}(tx)
}
var me MultiError
for i := 0; i < len(mtx); i++ {
if err := <-ec; err != nil {
me.append(err)
}
}
return me.check()
}
MultiTx is a collection of open transactions on multiple nodes. It is a higher level object that calls multiQuery
What would be the best approach to "clean up" unused rows? Options I'm thinking about not doing:
Cancel the context: I believe it will work inconsistently, multiple queries might already have returned by the time cancel() is called
Create a deferred Go routine which continues to drain the channels and close the rows objects: If a DB node is slow to respond, Rollback() is still called before rows.Close()
Use a sync.WaitGroup somewhere in the MultiTx type, maybe in combination with (2): This can cause Rollback to hang if one of the nodes is unresponsive. Also, I wouldn't be sure how I would implement that.
Ignore the Rollback errors: Ignoring errors never sounds like a good idea, they are there for a reason.
What would be the recommended way of approaching this?
Edit:
As suggested by #Peter, I've tried canceling the context, but it seems this also invalidates all the returned Rows from the query. On rows.Scan I'm getting context canceled error at the higher level caller.
This is what I've done so far:
func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, error) {
ctx, cancel := context.WithCancel(ctx)
defer cancel()
rc := make(chan *sql.Rows, len(xs))
ec := make(chan error, len(xs))
for _, x := range xs {
go func(x executor) {
rows, err := x.QueryContext(ctx, query, args...)
switch { // Make sure only one of them is returned
case err != nil:
ec <- err
case rows != nil:
rc <- rows
cancel() // Cancel on success
}
}(x)
}
var (
me MultiError
rows *sql.Rows
)
for i := 0; i < len(xs); i++ {
select {
case err := <-ec:
me.append(err)
case r := <-rc:
if rows == nil { // Only use the first rows
rows = r
} else {
r.Close() // Cleanup remaining rows, if there are any
}
}
}
if rows != nil {
return rows, nil
}
return nil, me.check()
}
Edit 2:
#Adrian mentioned:
we can't see the code that's actually using any of this.
This code is reused by type methods. First there is the transaction type. The issues in this question are appearing on the Rollback() method above.
// MultiTx holds a slice of open transactions to multiple nodes.
// All methods on this type run their sql.Tx variant in one Go routine per Node.
type MultiTx []*Tx
// QueryContext runs sql.Tx.QueryContext on the tranactions in separate Go routines.
// The first non-error result is returned immediately
// and errors from the other Nodes will be ignored.
//
// If all nodes respond with the same error, that exact error is returned as-is.
// If there is a variety of errors, they will be embedded in a MultiError return.
//
// Implements boil.ContextExecutor.
func (mtx MultiTx) QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
return multiQuery(ctx, mtx2Exec(mtx), query, args...)
}
Then there is:
// MultiNode holds a slice of Nodes.
// All methods on this type run their sql.DB variant in one Go routine per Node.
type MultiNode []*Node
// QueryContext runs sql.DB.QueryContext on the Nodes in separate Go routines.
// The first non-error result is returned immediately
// and errors from the other Nodes will be ignored.
//
// If all nodes respond with the same error, that exact error is returned as-is.
// If there is a variety of errors, they will be embedded in a MultiError return.
//
// Implements boil.ContextExecutor.
func (mn MultiNode) QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
return multiQuery(ctx, nodes2Exec(mn), query, args...)
}
These methods the public wrappers around the multiQuery() function. Now I realize that just sending the *Rows into a buffered channel to die, is actually a memory leak. In the transaction cases it becomes clear, as Rollback() starts to complain. But in the non-transaction variant, the *Rows inside the channel will never be garbage collected, as the driver might hold reference to it until rows.Close() is called.
I've written this package to by used by an ORM, sqlboiler. My higher level logic passes a MultiTX object to the ORM. From that point, I don't have any explicit control over the returned Rows. A simplistic approach would be that my higher level code cancels the context before Rollback(), but I don't like that:
It gives a non-intuitive API. This (idiomatic) approach would break:
ctx, cancel = context.WithCancel(context.Background())
defer cancel()
tx, _ := db.BeginTx(ctx)
defer tx.Rollback()
The ORM's interfaces also specify the regular, non-context aware Query() variants, which in my package's case will run against context.Background().
I'm starting to worry that this broken by design... Anyway, I will start by implementing a Go routine that will drain the channel and close the *Rows. After that I will see if I can implement some reasonable waiting / cancellation mechanism that won't affect the returned *Rows
I think that the function below will do what you require with the one provisio being that the context passed in should be cancelled when you are done with the results (otherwise one context.WithCancel will leak; I cannot see a way around that as cancelling it within the function will invalidate the returned sql.Rows).
Note that I have not had time to test this (would need to setup a database, implement your interfaces etc) so there may well be a bug hidden in the code (but I believe the basic algorithm is sound)
// queryResult holds the goroutine# and the result from that gorouting (need both so we can avoid cancelling the relevant context)
type queryResult struct {
no int
rows *sql.Rows
}
// multiQuery - Executes multiple queries and returns either the first to resutn a result or, if all fail, a multierror summarising the errors
// Important: This should be used for READ ONLY queries only (it is possible that more than one will complete)
// Note: The ctx passed in must be cancelled to avoid leaking a context (this routine cannot cancel the context used for the winning query)
func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, error) {
noOfQueries := len(xs)
rc := make(chan queryResult) // Channel for results; unbuffered because we only want one, and only one, result
ec := make(chan error) // errors get sent here - goroutines must send a result or 1 error
defer close(ec) // Ensure the error consolidation go routine will complete
// We need a way to cancel individual goroutines as we do not know which one will succeed
cancelFns := make([]context.CancelFunc, noOfQueries)
// All goroutines must terminate before we exit (otherwise the transaction maybe rolled back before they are cancelled leading to "unexpected command tag SELECT")
var wg sync.WaitGroup
wg.Add(noOfQueries)
for i, x := range xs {
var queryCtx context.Context
queryCtx, cancelFns[i] = context.WithCancel(ctx)
go func(ctx context.Context, queryNo int, x executor) {
defer wg.Done()
rows, err := x.QueryContext(ctx, query, args...)
if err != nil {
ec <- err // Error collection go routine guaranteed to run until all query goroutines complete
return
}
select {
case rc <- queryResult{queryNo, rows}:
return
case <-ctx.Done(): // If another query has already transmitted its results these should be thrown away
rows.Close() // not strictly required because closed context should tidy up
return
}
}(queryCtx, i, x)
}
// Start go routine that will send a MultiError to a channel if all queries fail
mec := make(chan MultiError)
go func() {
var me MultiError
errCount := 0
for err := range ec {
me.append(err)
errCount += 1
if errCount == noOfQueries {
mec <- me
return
}
}
}()
// Wait for one query to succeed or all queries to fail
select {
case me := <-mec:
for _, cancelFn := range cancelFns { // not strictly required so long as ctx is eventually cancelled
cancelFn()
}
wg.Wait()
return nil, me.check()
case result := <-rc:
for i, cancelFn := range cancelFns { // not strictly required so long as ctx is eventually cancelled
if i != result.no { // do not cancel the query that returned a result
cancelFn()
}
}
wg.Wait()
return result.rows, nil
}
}
Thanks to the comments from #Peter and the answer of #Brits, I got fresh ideas on how to approach this.
Blue print
3 out of 4 proposals from the question were needed to be implemented.
1. Cancel the Context
mtx.QueryContext() creates a descendant context and sets the CancelFunc in the MultiTx object.
The cancelWait() helper cancels an old context and waits for MultiTX.Done if its not nil. It is called on Rollback() and before every new query.
2. Drain the channel
In multiQuery(), Upon obtaining the first successful Rows, a Go routine is launched to drain and close the remaining Rows. The rows channel no longer needs to be buffered.
An additional Go routine and a WaitGroup is used to close the error and rows channels.
3. Return a done channel
Instead of the proposed WaitGroup, multiQuery() returns a done channel. The channel is closed once the drain & close routine has finished. mtx.QueryContext() sets done the channel on the MultiTx object.
Errors
Instead of the select block, only drain the error channel if there are now Rows. The error needs to remain buffered for this reason.
Code
// MultiTx holds a slice of open transactions to multiple nodes.
// All methods on this type run their sql.Tx variant in one Go routine per Node.
type MultiTx struct {
tx []*Tx
done chan struct{}
cancels context.CancelFunc
}
func (m *MultiTx) cancelWait() {
if m.cancel != nil {
m.cancel()
}
if m.done != nil {
<-m.done
}
// reset
m.done, m.cancel = nil, nil
}
// Context creates a child context and appends CancelFunc in MultiTx
func (m *MultiTx) context(ctx context.Context) context.Context {
m.cancelWait()
ctx, m.cancel = context.WithCancel(ctx)
return ctx
}
// QueryContext runs sql.Tx.QueryContext on the tranactions in separate Go routines.
func (m *MultiTx) QueryContext(ctx context.Context, query string, args ...interface{}) (rows *sql.Rows, err error) {
rows, m.done, err = multiQuery(m.context(ctx), mtx2Exec(m.tx), query, args...)
return rows, err
}
func (m *MultiTx) Rollback() error {
m.cancelWait()
ec := make(chan error, len(m.tx))
for _, tx := range m.tx {
go func(tx *Tx) {
err := tx.Rollback()
ec <- err
}(tx)
}
var me MultiError
for i := 0; i < len(m.tx); i++ {
if err := <-ec; err != nil {
me.append(err)
}
}
return me.check()
}
func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, chan struct{}, error) {
rc := make(chan *sql.Rows)
ec := make(chan error, len(xs))
var wg sync.WaitGroup
wg.Add(len(xs))
for _, x := range xs {
go func(x executor) {
rows, err := x.QueryContext(ctx, query, args...)
switch { // Make sure only one of them is returned
case err != nil:
ec <- err
case rows != nil:
rc <- rows
}
wg.Done()
}(x)
}
// Close channels when all query routines completed
go func() {
wg.Wait()
close(ec)
close(rc)
}()
rows, ok := <-rc
if ok { // ok will be false if channel closed before any rows
done := make(chan struct{}) // Done signals the caller that all remaining rows are properly closed
go func() {
for rows := range rc { // Drain channel and close unused Rows
rows.Close()
}
close(done)
}()
return rows, done, nil
}
// no rows, build error return
var me MultiError
for err := range ec {
me.append(err)
}
return nil, nil, me.check()
}
Edit: Cancel & wait for old contexts before every Query, as *sql.Tx is not Go routine save, all previous queries have to be done before a next call.
Several hundred MB of memory is allocated for 50 requests of 5 MB. Memory is allocated and is no longer released.
How can I clear my memory? Why can this happen?
I've tried on Ubuntu on my home pc and on VPS
package main
import (
"fmt"
"io/ioutil"
"net/http"
"time"
)
func main() {
fmt.Println("start")
for i := 0; i < 50; i++ {
go func() {
DoRequest()
}()
time.Sleep(10 * time.Millisecond)
}
time.Sleep(10 * time.Minute)
}
func DoRequest() error {
requestUrl := "https://blockchain.info/rawblock/0000000000000000000eebedea046425bd54626e6c56eb032e66e714d0141ea6"
req, err := http.NewRequest("GET", requestUrl, nil)
if err != nil {
return err
}
req.Header.Set("user-agent", "free")
httpClient := &http.Client{
Timeout: time.Second * 10,
}
resp, err := httpClient.Do(req)
if resp != nil {
defer resp.Body.Close()
}
if err != nil {
return err
}
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
return err
}
fmt.Println("bodylen", len(body))
return nil
}
Allocated somewhere 400MB
You are creating an http client for each go-routine.
Http client is designed to be create once & used many times. They are go-routine safe.
They allow for connection reuse & other efficiency savers.
Create the http client once in main (instead of in your go-routine) & then pass this single reference to all of your 50 go-routines.
Edit: Also, while it may not make a practical difference in your case, the order for a request is usually like so:
resp, err := httpClient.Do(req)
if err != nil {
return err // check error first
}
defer resp.Body.Close() // no error - so resp will *NOT* be nil - so this is safe
Edit 2: As #Adrian has mentioned: go's garbage collection is not instantaneous - nor should it be - as it is an expensive operation. If you no longer need a block of memory - simply don't reference it anymore. Let the GC do its job, so you can focus on yours!
If you're curious about the evolution of go's GC:
https://blog.golang.org/ismmkeynote (heavy on the technical side)
What kind of Garbage Collection does Go use?
for i := 0; i < 50; i++ {
go func() {
DoRequest()
}()
time.Sleep(10 * time.Millisecond)
}
Never create go-routines like this. Always make sure you create go-routines the way it not fill large ( all ) memory in any case ( including worst case )
Simple solution is control the count of go-routines can spawned ( or running ) at time.
You can pre-calculate memory to be occupied in worst case by multiplying max-number of go-routines you want to run at a time and max-memory can be used by one go-routine.
You can control instances of go-routines by using channles.
Refer first answer of this stackoverflow question
Always have x number of goroutines running at any time
Always use balanced solution between perforamce and required resources.
Update June 11,2019
Here is example go program
https://play.golang.org/p/HovNRgp6FxH
Let's assume I declared two maps and want to assign it in two different goroutines in error group. I don't perform any reads/write.
Should I protect assign operation with lock or I can omit it?
UPD3: In the Java Concurrency In Practice Brian Goetz's Part I Chapter 3 Shared Objects, mentioned:
Locking is not just about mutual exclusion; it is also memory
visibility. To ensure that all threads see the most up-to-date values
of shared mutable variables, the reading and writing threads must
synchronize on a common lock.
var (
mu sync.Mutex
one map[string]struct{}
two map[string]struct{}
)
g, gctx := errgroup.WithContext(ctx)
g.Go(func() error {
resp, err := invokeFirstService(gctx, request)
if err != nil {
return err
}
mu.Lock()
one = resp.One
mu.Unlock()
return nil
})
g.Go(func() error {
resp, err := invokeSecondService(gctx, request)
if err != nil {
return err
}
mu.Lock()
two = resp.Two
mu.Unlock()
return nil
})
if err := g.Wait(); err != nil {
return err
}
// UPD3: added lock and unlock section
m.Lock()
defer m.Unlock()
performAction(one, two)
UPD: added more context about variables
UPD2: what were my doubts: we have 3 goroutines - parent and two in the error group. there is no guarantee that our parent goroutine shared memory gets the last update after errgroup goroutines complete until we wrap access to shared memory with memory barriers
Group.Wait() blocks until all function calls from the Group.Go() method have returned, so this is a synchronization point. This ensures performAction(one, two) will not start before any writes to one and two are done, so in your example the mutex is unnecessary.
g, gctx := errgroup.WithContext(ctx)
g.Go(func() error {
// ...
one = resp.One
return nil
})
g.Go(func() error {
// ...
two = resp.Two
return nil
})
if err := g.Wait(); err != nil {
return err
}
// Here you can access one and two safely:
performAction(one, two)
If you would access one and two from other goroutines while the goroutines that write them run concurrently, then yes, you would need to lock them, e.g.:
// This goroutine runs concurrently, so all concurrent access must be synchronized:
go func() {
mu.Lock()
fmt.Println(one, two)
mu.Unlock()
}()
g, gctx := errgroup.WithContext(ctx)
g.Go(func() error {
// ...
mu.Lock()
one = resp.One
mu.Unlock()
return nil
})
g.Go(func() error {
// ...
mu.Lock()
two = resp.Two
mu.Unlock()
return nil
})
if err := g.Wait(); err != nil {
return err
}
// Note that you don't need to lock here
// if the first concurrent goroutine only reads one and two.
performAction(one, two)
Also note that in the above example you could use sync.RWMutex, and in the goroutine that reads them, RWMutex.RLock() and RWMutex.RUnlock() would also be sufficient.
In this case, only one goroutine can access the map. I don't think you need a lock.