I have a system with about 100 req/sec. And sometimes it doesn't respond until I restart my go program. I found out that this is because I open transaction and don't close it in some places. That's why all connections were occupied by opened transactions and I couldn't open another one After this I added this code
defer func() {
if r := recover(); r != nil {
tx.Rollback()
return
}
if err == nil {
err = tx.Commit()
} else {
tx.Rollback()
}
}()
This made my program work for a month without interruption. But just now it happened again. Probably because of this issue. Is there a better way to close transaction? Or maybe a way to close a transaction if it was open for 1 minute?
If you want to roll back the transaction after 1 minute, you can use the standard Go timeout pattern (below).
However, there may be better ways to deal with your problems. You don’t give a lot of information about it, but you are probably using standard HTTP requests with contexts. In that case, it may help you to know that a transaction started within a context is automatically rolled back when the context is cancelled. One way to make sure that a context is cancelled if something gets stuck is to give it a deadline.
Excerpted from Channels - Timeouts. The original authors were Chris Lucas, Kwarrtz and Rodolfo Carvalho. Attribution details can be found on the contributor page. The source is licenced under CC BY-SA 3.0 and may be found in the Documentation archive. Reference topic ID: 1263 and example ID: 6050.
Timeouts
Channels are often used to implement timeouts.
func main() {
// Create a buffered channel to prevent a goroutine leak. The buffer
// ensures that the goroutine below can eventually terminate, even if
// the timeout is met. Without the buffer, the send on the channel
// blocks forever, waiting for a read that will never happen, and the
// goroutine is leaked.
ch := make(chan struct{}, 1)
go func() {
time.Sleep(10 * time.Second)
ch <- struct{}{}
}()
select {
case <-ch:
// Work completed before timeout.
case <-time.After(1 * time.Second):
// Work was not completed after 1 second.
}
}
Related
I'm having an issue regarding the disposing of kafka consumer in the end of program execution. Here is code responsible for closing the consumer
func(kc *KafkaConsumer) Dispose() {
Sugar.Info("Disposing of consumer")
kc.mu.Lock()
kc.Consumer.Close();
Sugar.Info("Disposed of consumer")
kc.mu.Unlock()
}
As you might have already noticed, i'm making use of sync.Mutex, inasmuch as consumer is accessed by multiple goroutines. Below is another snippet responsible for reading messages from kafka
func (kc *KafkaConsumer) Consume(signalChan chan os.Signal, ctx context.Context) {
for{
select{
case sig := <-signalChan:
Sugar.Info("Caught signal %v", sig)
break
case <-ctx.Done():
Sugar.Info("Got context done message. Closing consumer...")
kc.Dispose()
break
default:
for{
message, err := kc.Consumer.ReadMessage(-1); if err != nil{
Log.Error(err.Error())
return
}
Sugar.Infof("Got a new message %v",message)
resp := make(chan *KafkaResponseEntity)
go router.UseMessage(*message, resp, ctx)
//Potential deadlock
response := <-resp
/*
Explicit commit of an offset in order to ensure
that request has been successfully processed
*/
kc.Consumer.Commit()
Sugar.Info("Successfully commited an offset")
Sugar.Infof("Just got a response %v", response)
go producer.KP.Produce(response.PaymentId, response.Bytes, "some_random_topic")
}
}
}
}
The problem is that when closing the consumer, program execution simply stalls.
Are there any issues? Should i use cond along with mutex? I'd be very glad if you provide thorough explanation of what might went wrong in my code.
Thanks in advance.
I suspect this is hanging because of:
kc.Consumer.ReadMessage(-1)
Which the documentation states will block indefinitely, hence why it's not closed. The simplest approach would be to make that value a positive time duration (e.g. 1 * time.Second), but then you may get time out errors if messages are not consumed within the timeouts. The time out error is generally innocuous, but is something to account for, from the linked documentation:
Timeout is returned as (nil, err) where err is
`err.(kafka.Error).Code() == kafka.ErrTimedOut`
I'm not yet sure of a good way to utilize the indefinite blocking and allow it to be interrupted. If anyone does know please post the findings!
Why does this receiver go routine refuse to terminate when the connection is closed
This runs as expected but then randomly, every 20-10,000x it is called, a receiver will fail to shutdown, which then causes a go routine leak, leading to 100% cpu.
Note: If I log all errors, I will see read on a closed channel if the conn.SetReadDeadline is commented out. When used, I see i/o timeout as the error.
This ran for 10k cycles, where the main process starts 11 pairs of these send/receivers and they process 1000 jobs before the main process sends the shutdown signal. This setup ran for > 6 hours without any issue to to 10k cycles mark overnight, but this morning I can't get it to run more than 20 cycles without getting a receiver flagged as not shutting down and exiting.
func sender(w worker, ch channel) {
var j job
for {
select {
case <-ch.quit: // shutdown broadcast, exit
w.Close()
ch.stopped <- w.id // debug, send stop confirmed
return
case j = <-w.job: // worker designated jobs
case j = <-ch.spawner: // FCFS jobs
}
... prepare job ...
w.WriteToUDP(buf, w.addr)
}
func receiver(w worker, ch channel) {
deadline := 100 * time.Millisecond
out:
for {
w.SetReadDeadline(time.Now().Add(deadline))
// blocking read, should release on close (or deadline)
n, err = w.c.Read(buf)
select {
case <-ch.quit: // shutdown broadcast, exit
ch.stopped <- w.id+100 // debug, receiver stop confirmed
return
default:
}
if n == 0 || err != nil {
continue
}
update := &update{id: w.id}
... process update logic ...
select {
case <-ch.quit: // shutting down
break out
case ch.update <- update
}
}
I need a reliable way to get the receiver to shutdown when it gets either the shutdown broadcast OR the conn is closed. Functionally, closing the channel should be enough and is the preferred method according to the go package documentation, see Conn interface.
I upgraded to the most recent go, which is 1.12.1 with no change.
Running on MacOS in development and CentOS in production.
Any run into this problem?
If so, how did you reliably fix it?
Possible Solution
My very verbose and icky solution that seems to possibly work, as a work around, is to do this:
1) start the sender in a go routine, like this (above, unchanged)
2) start the receiver in a go routine, like this (below)
func receive(w worker, ch channel) {
request := make(chan []byte, 1)
reader := make(chan []byte, 1)
defer func() {
close(request) // exit signaling
w.c.Close() // exit signaling
//close(reader)
}()
go func() {
// untried senario, for if we still have leaks -> 100% cpu
// we may need to be totally reliant on closing request or ch.quit
// defer w.c.Close()
deadline := 100 * time.Millisecond
var n int
var err error
for buf := range request {
for {
select {
case <-ch.quit: // shutdown signal
return
default:
}
w.c.SetReadDeadline(time.Now().Add(deadline))
n, err = w.c.Read(buf)
if err != nil { // timeout or close
continue
}
break
}
select {
case <-ch.quit: // shutdown signal
return
case reader <- buf[:n]:
//default:
}
}
}()
var buf []byte
out:
for {
request <- make([]byte, messageSize)
select {
case <-ch.quit: // shutting down
break out
case buf = <-reader:
}
update := &update{id: w.id}
... process update logic ...
select {
case <-ch.quit: // shutting down
break out
case ch.update <- update
}
}
My question is, why does this horrendous version 2, that spawns a new go routine to read from the blocking c.Read(buf) seem to work more reliably, meaning it does not leak when the shutdown signal is send, when the much simpler first version didn't ... and it seems to be essentially the same thing due to the blocking c.Read(buf).
Downgrading my question is NOT helpful when this is a legitimate and verifiably repeatable issue, the question remains unanswered.
Thanks everyone for the responses.
So. There wasn't ever a stack trace. In fact, I got NO errors at all, not a race detection or anything and it wasn't deadlocked, a go routine just would not shut down and exit, and it wasn't consistently reproducible. I've been running the same data for two weeks.
When the go routine would fail to report that was exiting, it would simple spin out of control and drive the CPU to 100%, but only AFTER all the others exited and the system moved on. I never saw memory grow. CPU would gradually tick up to 200%, 300%, 400% and that's when the system had to be rebooted.
I logged when it was a leak was happening, it was always a different one, and I'd get one leak after 380 prior successful runs (of 23 pairs of go routines running in unision), next time 1832 before one receiver leaked, next time only 23, with exact same data being chewed on at same starting point. The leaked receiver just spun out of control, but only after the group of 22 others companions had all shutdown and exited successfully and the system moved to the next batch. It would not fail consistently, other than it was guaranteed to leak at some point.
After many days, numerous rewrites, and a million log before/after every action, this finally seems to be what the issue was, and after digging through the library I'm not sure why exactly, nor WHY it only happens randomly.
For whatever reason, the golang.org/x/net/dns/dnsmessage library will randomly freak out if you parse and go straight to skipping questions without reading a question first. No idea why that matters, hello, skipping questions means you don't care about that header section and to mark it as processed, and it works fine for literally a million times in a row, but then doesn't, so you seem to be required to read a question BEFORE you can SkipAllQuestions, since this seems to have been the solution. I'm 18,525 batches in, and adding that turned the leaks off.
var p dnsmessage.Parser
h, err := p.Start(buf[:n])
if err != nil {
continue // what!?
}
switch {
case h.RCode == dnsmessage.RCodeSuccess:
q, err := p.Question() // should only have one question
if q.Type != w.Type || err != nil {
continue // what!?, impossible
}
// Note: if you do NOT do the above first, you're asking for pain! (tr)
if err := p.SkipAllQuestions(); err != nil {
continue // what!?
}
// Do not count as "received" until we have passed above point and
// validated that response had a question that we could skip...
I'm trying to develop a simple API for Slack and I want to return something to the user right away to avoid the 3000 ms timeout.
Here are my questions:
Why the This should be printed to Slack first doesn't get printed right away, instead I only got the last message which is The long and blocking process completed? But it appears in ngrok log though.
Why is my function still reaching the 3000 ms limit even though I'm already using a go routine? Is it because of the done channel?
func testFunc(w http.ResponseWriter, r *http.Request) {
// Return to the user ASAP to avoid 3000ms timeout.
// But this doesn't work. Nothing is returned but
// the message appeared in ngrok log.
fmt.Fprintln(w, "This should be printed to Slack first!")
// Get the response URL.
r.ParseForm()
responseURL := r.FormValue("response_url")
done := make(chan bool)
go func() {
fmt.Println("Warning! This is a long and blocking process.")
time.Sleep(5 * time.Second)
done <- true
}()
// This works! I received this message. But I still reached the 3000ms timeout.
func(d bool) {
if d == true {
payload := map[string]string{"text": "The long and blocking process completed!"}
j, err := json.Marshal(payload)
if err != nil {
w.WriteHeader(http.StatusInternalServerError)
}
http.Post(responseURL, "application/json", bytes.NewBuffer(j))
}
}(<-done)
}
http.ResponseWriter streams are buffered by default. If you want data to be sent to a client in realtime (e.g. HTTP SSE), you need to flush the stream after each 'event':
wf, ok := w.(http.Flusher)
if !ok {
http.Error(w, "Streaming unsupported!", http.StatusInternalServerError)
return
}
fmt.Fprintln(w, "This should be printed to Slack first!")
wf.Flush()
Flushing is expensive - so take advantage of go's buffering. There will always be an implicit flush once your handler finally exits (hence why you saw your output 'late').
I am learning stage of golang. Here is my understanding:
1. any operations with channel is blocking
2. You are writing on the channel in the
go func() {
fmt.Println("Warning! This is a long and blocking process.")
time.Sleep(5 * time.Second)
done <- true
}()
The scheduler is still moving in the main function and trying to read from the channel but it is waiting to see, something has written on the channel or not. So it got blocked, when the above function is done with writing on the channel, controls came back and main start executing again.
Note: Experts will be able to explain better.
I'm implementing a feature where I need to read files from a directory, parse and export them to a REST service at a regular interval. As part of the same I would like to gracefully handle the program termination (SIGKILL, SIGQUIT etc).
Towards the same I would like to know how to implement Context based cancellation of process.
For executing the flow in regular interval I'm using gocron.
cmd/scheduler.go
func scheduleTask(){
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
s := gocron.NewScheduler()
s.Every(10).Minutes().Do(processTask, ctx)
s.RunAll() // run immediate
<-s.Start() // schedule
for {
select {
case <-(ctx).Done():
fmt.Print("context done")
s.Remove(processTask)
s.Clear()
cancel()
default:
}
}
}
func processTask(ctx *context.Context){
task.Export(ctx)
}
task/export.go
func Export(ctx *context.Context){
pendingFiles, err := filepath.Glob("/tmp/pending/" + "*_task.json")
//error handling
//as there can be 100s of files, I would like to break the loop when context.Done() to return asap & clean up the resources here as well
for _, fileName := range pendingFiles {
exportItem(fileName string)
}
}
func exportItem(fileName string){
data, err := ReadFile(fileName) //not shown here for brevity
//err handling
err = postHTTPData(string(data)) //not shown for brevity
//err handling
}
For the process management, I think the other component is the actual handling of signals, and managing the context from those signals.
I'm not sure of the specifics of go-cron (they have an example showing some of these concepts on their github) but in general I think that the steps involved are:
Registration of os signals handler
Waiting to receive a signal
Canceling top level context in response to a signal
Example:
sigCh := make(chan os.Signal, 1)
defer close(sigCh)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGQUIT, syscall.SIGINT)
<-sigCh
cancel()
I'm not sure how this will look in the context of go-cron, but the context that the signal handling code cancels should be a parent of the context that the task and job is given.
Worked this out myself just now. I've always felt the blog post on contexts was A LOT of material to try and understand so a simpler demonstration would be nice.
There are many scenarios you may encounter. Each one is different and will require adaptation. Here's one example:
Say you have a channel that could run for an indeterminate amount of time.
indeterminateChannel := make(chan string)
for s := range indeterminateChannel{
fmt.Println(s)
}
Your producer might look something like:
for {
indeterminateChannel <- "Terry"
}
We don't control the producer, so we need someway to cut out of your print loop if the producer exceeds your time limit.
indeterminateChannel := make(chan string)
// Close the channel when our code exits so OUR for loop no longer occupies
// resources and the goroutine exits.
// The producer will have a problem, but we don't care about their problems.
// In this instance.
defer close(indeterminateChannel)
// we wait for this context to time out below the goroutine.
ctx, cancel := context.WithTimeout(context.TODO(), time.Minute*1)
defer cancel()
go func() {
for s := range indeterminateChannel{
fmt.Println(s)
}
}()
<- ctx.Done // wait for the context to terminate based on a timeout.
You can also check ctx.Err to see if the context exited due to a timeout or because it was canceled.
You might also want to learn about how to properly check if the context failed due to a deadline: How to check if an error is "deadline exceeded" error?
Or if the context was canceled: How to check if a request was cancelled
(I don't believe my issue is a duplicate of this QA: go routine blocking the others one, because I'm running Go 1.9 which has the preemptive scheduler whereas that question was asked for Go 1.2).
My Go program calls into a C library wrapped by another Go-lang library that makes a blocking call that can last over 60 seconds. I want to add a timeout so it returns in 3 seconds:
Old code with long block:
// InvokeSomething is part of a Go wrapper library that calls the C library read_something function. I cannot change this code.
func InvokeSomething() ([]Something, error) {
ret := clib.read_something(&input) // this can block for 60 seconds
if ret.Code > 1 {
return nil, CreateError(ret)
}
return ret.Something, nil
}
// This is my code I can change:
func MyCode() {
something, err := InvokeSomething()
// etc
}
My code with a go-routine, channel, and timeout, based on this Go example: https://gobyexample.com/timeouts
type somethingResult struct {
Something []Something
Err error
}
func MyCodeWithTimeout() {
ch = make(chan somethingResult, 1);
go func() {
something, err := InvokeSomething() // blocks here for 60 seconds
ret := somethingResult{ something, err }
ch <- ret
}()
select {
case result := <-ch:
// etc
case <-time.After(time.Second *3):
// report timeout
}
}
However when I run MyCodeWithTimeout it still takes 60 seconds before it executes the case <-time.After(time.Second * 3) block.
I know that attempting to read from an unbuffered channel with nothing in it will block, but I created the channel with a buffered size of 1 so as far as I can tell I'm doing it correctly. I'm surprised the Go scheduler isn't preempting my goroutine, or does that depend on execution being in go-lang code and not an external native library?
Update:
I read that the Go-scheduler, at least in 2015, is actually "semi-preemptive" and it doesn't preempt OS threads that are in "external code": https://github.com/golang/go/issues/11462
you can think of the Go scheduler as being partially preemptive. It's by no means fully cooperative, since user code generally has no control over scheduling points, but it's also not able to preempt at arbitrary points
I heard that runtime.LockOSThread() might help, so I changed the function to this:
func MyCodeWithTimeout() {
ch = make(chan somethingResult, 1);
defer close(ch)
go func() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
something, err := InvokeSomething() // blocks here for 60 seconds
ret := somethingResult{ something, err }
ch <- ret
}()
select {
case result := <-ch:
// etc
case <-time.After(time.Second *3):
// report timeout
}
}
...however it didn't help at all and it still blocks for 60 seconds.
Your proposed solution to do thread locking in the goroutine started in MyCodeWithTimeout() does not give guarantee MyCodeWithTimeout() will return after 3 seconds, and the reason for this is that first: no guarantee that the started goroutine will get scheduled and reach the point to lock the thread to the goroutine, and second: because even if the external command or syscall gets called and returns within 3 seconds, there is no guarantee that the other goroutine running MyCodeWithTimeout() will get scheduled to receive the result.
Instead do the thread locking in MyCodeWithTimeout(), not in the goroutine it starts:
func MyCodeWithTimeout() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
ch = make(chan somethingResult, 1);
defer close(ch)
go func() {
something, err := InvokeSomething() // blocks here for 60 seconds
ret := somethingResult{ something, err }
ch <- ret
}()
select {
case result := <-ch:
// etc
case <-time.After(time.Second *3):
// report timeout
}
}
Now if MyCodeWithTimeout() execution starts, it will lock the goroutine to the OS thread, and you can be sure that this goroutine will notice the value sent on the timers value.
NOTE: This is better if you want it to return within 3 seconds, but this sill will not give guarantee, as the timer that fires (sends a value on its channel) runs in its own goroutine, and this thread locking has no effect on the scheduling of that goroutine.
If you want guarantee, you can't rely on other goroutines giving the "exit" signal, you can only rely on this happening in your goroutine running the MyCodeWithTimeout() function (because since you did thread locking, you can be sure it gets scheduled).
An "ugly" solution which spins up CPU usage for a given CPU core would be:
for end := time.Now().Add(time.Second * 3); time.Now().Before(end); {
// Do non-blocking check:
select {
case result := <-ch:
// Process result
default: // Must have default to be non-blocking
}
}
Note that the "urge" of using time.Sleep() in this loop would take away the guarantee, as time.Sleep() may use goroutines in its implementation and certainly does not guarantee to return exactly after the given duration.
Also note that if you have 8 CPU cores and runtime.GOMAXPROCS(0) returns 8 for you, and your goroutines are still "starving", this may be a temporary solution, but you still have more serious problems using Go's concurrency primitives in your app (or a lack of using them), and you should investigate and "fix" those. Locking a thread to a goroutine may even make it worse for the rest of the goroutines.