Race condition reading stdout and stderr of child process - go

In Go, I'm trying to:
start a subprocess
read from stdout and stderr separately
implement an overall timeout
After much googling, we've come up with some code that seems to do the job, most of the time. But there seems to be a race condition whereby some output is not read.
The problem seems to only occur on Linux, not Windows.
Following the simplest possible solution found with google, we tried creating a context with a timeout:
context.WithTimeout(context.Background(), 10*time.Second)
While this worked most of the time, we were able to find cases where it would just hang forever. There was some aspect of the child process that caused this to deadlock. (Something to do with grandchildren that were not sufficiently dissasociated from the child process, and thus caused the child to never completely exit.)
Also, it seemed that in some cases the error that is returned when the timeout occurrs would indicate a timeout, but would only be delivered after the process had actually exited (thus making the whole concept of the timeout useless).
func GetOutputsWithTimeout(command string, args []string, timeout int) (io.ReadCloser, io.ReadCloser, int, error) {
start := time.Now()
procLogger.Tracef("Initializing %s %+v", command, args)
cmd := exec.Command(command, args...)
// get pipes to standard output/error
stdout, err := cmd.StdoutPipe()
if err != nil {
return emptyReader(), emptyReader(), -1, fmt.Errorf("cmd.StdoutPipe() error: %+v", err.Error())
}
stderr, err := cmd.StderrPipe()
if err != nil {
return emptyReader(), emptyReader(), -1, fmt.Errorf("cmd.StderrPipe() error: %+v", err.Error())
}
// setup buffers to capture standard output and standard error
var buf bytes.Buffer
var ebuf bytes.Buffer
// create a channel to capture any errors from wait
done := make(chan error)
// create a semaphore to indicate when both pipes are closed
var wg sync.WaitGroup
wg.Add(2)
go func() {
if _, err := buf.ReadFrom(stdout); err != nil {
procLogger.Debugf("%s: Error Slurping stdout: %+v", command, err)
}
wg.Done()
}()
go func() {
if _, err := ebuf.ReadFrom(stderr); err != nil {
procLogger.Debugf("%s: Error Slurping stderr: %+v", command, err)
}
wg.Done()
}()
// start process
procLogger.Debugf("Starting %s", command)
if err := cmd.Start(); err != nil {
procLogger.Errorf("%s: failed to start: %+v", command, err)
return emptyReader(), emptyReader(), -1, fmt.Errorf("cmd.Start() error: %+v", err.Error())
}
go func() {
procLogger.Debugf("Waiting for %s (%d) to finish", command, cmd.Process.Pid)
err := cmd.Wait() // this can be 'forced' by the killing of the process
procLogger.Tracef("%s finished: errStatus=%+v", command, err) // err could be nil here
//notify select of completion, and the status
done <- err
}()
// Wait for timeout or completion.
select {
// Timed out
case <-time.After(time.Duration(timeout) * time.Second):
elapsed := time.Since(start)
procLogger.Errorf("%s: timeout after %.1f\n", command, elapsed.Seconds())
if err := TerminateTree(cmd); err != nil {
return ioutil.NopCloser(&buf), ioutil.NopCloser(&ebuf), -1,
fmt.Errorf("failed to kill %s, pid=%d: %+v",
command, cmd.Process.Pid, err)
}
wg.Wait() // this *should* take care of waiting for stdout and stderr to be collected after we killed the process
return ioutil.NopCloser(&buf), ioutil.NopCloser(&ebuf), -1,
fmt.Errorf("%s: timeout %d s reached, pid=%d process killed",
command, timeout, cmd.Process.Pid)
//Exited normally or with a non-zero exit code
case err := <-done:
wg.Wait() // this *should* take care of waiting for stdout and stderr to be collected after the process terminated naturally.
elapsed := time.Since(start)
procLogger.Tracef("%s: Done after %.1f\n", command, elapsed.Seconds())
rc := -1
// Note that we have to use go1.10 compatible mechanism.
if err != nil {
procLogger.Tracef("%s exited with error: %+v", command, err)
exitErr, ok := err.(*exec.ExitError)
if ok {
ws := exitErr.Sys().(syscall.WaitStatus)
rc = ws.ExitStatus()
}
procLogger.Debugf("%s exited with status %d", command, rc)
return ioutil.NopCloser(&buf), ioutil.NopCloser(&ebuf), rc,
fmt.Errorf("%s: process done with error: %+v",
command, err)
} else {
ws := cmd.ProcessState.Sys().(syscall.WaitStatus)
rc = ws.ExitStatus()
}
procLogger.Debugf("%s exited with status %d", command, rc)
return ioutil.NopCloser(&buf), ioutil.NopCloser(&ebuf), rc, nil
}
//NOTREACHED: should not reach this line!
}
Calling GetOutputsWithTimeout("uname",[]string{"-mpi"},10) will return the expected single line of output most of the time. But sometimes it will return no output, as if the goroutine that reads stdout didn't start soon enough to "catch" all the output (or exited early?) The "most of the time" strongly suggests a race condition.
We will also sometimes see errors from the goroutines about "file already closed" (this seems to happen with the timeout condition, but will happen at other "normal" times as well).
I would have thought that starting the goroutines before the cmd.Start() would have ensured that no output would be missed, and that using the WaitGroup would guarantee they would both complete before reading the buffers.
So how are we missing output? Is there still a race condition between the two "reader" goroutines and the cmd.Start()? Should we ensure those two are running using yet another WaitGroup?
Or is there a problem with the implementation of ReadFrom()?
Note that we are currently using go1.10 due to backward-compatibility problems with older OSs but the same effect occurs with go1.12.4.
Or are we overthinking this, and a simple implementation with context.WithTimeout() would do the job?

But sometimes it will return no output, as if the goroutine that reads stdout didn't start soon enough to "catch" all the output
This is impossible, because a pipe can't "lose" data. If the process is writing to stdout and the Go program isn't reading yet, the process will block.
The simplest way to approach the problem is:
Launch goroutines to collect stdout, stderr
Launch a timer that kills the process
Start the process
Wait for it to finish (or be killed by the timer) with .Wait()
If timer is fired, return timeout error
Handle wait error
func GetOutputsWithTimeout(command string, args []string, timeout int) ([]byte, []byte, int, error) {
cmd := exec.Command(command, args...)
// get pipes to standard output/error
stdout, err := cmd.StdoutPipe()
if err != nil {
return nil, nil, -1, fmt.Errorf("cmd.StdoutPipe() error: %+v", err.Error())
}
stderr, err := cmd.StderrPipe()
if err != nil {
return nil, nil, -1, fmt.Errorf("cmd.StderrPipe() error: %+v", err.Error())
}
// setup buffers to capture standard output and standard error
var stdoutBuf, stderrBuf []byte
// create 3 goroutines: stdout, stderr, timer.
// Use a waitgroup to wait.
var wg sync.WaitGroup
wg.Add(2)
go func() {
var err error
if stdoutBuf, err = ioutil.ReadAll(stdout); err != nil {
log.Printf("%s: Error Slurping stdout: %+v", command, err)
}
wg.Done()
}()
go func() {
var err error
if stderrBuf, err = ioutil.ReadAll(stderr); err != nil {
log.Printf("%s: Error Slurping stderr: %+v", command, err)
}
wg.Done()
}()
t := time.AfterFunc(time.Duration(timeout)*time.Second, func() {
cmd.Process.Kill()
})
// start process
if err := cmd.Start(); err != nil {
t.Stop()
return nil, nil, -1, fmt.Errorf("cmd.Start() error: %+v", err.Error())
}
err = cmd.Wait()
timedOut := !t.Stop()
wg.Wait()
// check if the timer timed out.
if timedOut {
return stdoutBuf, stderrBuf, -1,
fmt.Errorf("%s: timeout %d s reached, pid=%d process killed",
command, timeout, cmd.Process.Pid)
}
if err != nil {
rc := -1
if exitErr, ok := err.(*exec.ExitError); ok {
rc = exitErr.Sys().(syscall.WaitStatus).ExitStatus()
}
return stdoutBuf, stderrBuf, rc,
fmt.Errorf("%s: process done with error: %+v",
command, err)
}
// cmd.Wait docs say that if err == nil, exit code is 0
return stdoutBuf, stderrBuf, 0, nil
}

Related

goroutine hangs when passing data

given the following variables
wg.Add(4)
pDayResCh := make(chan map[string]map[string]int)
pDayErrCh := make(chan error)
The following code hangs
// Convert the previous, current, and next day prayers and put them in map
go func() {
fmt.Println("executing first go routine") // TODO: Remove this line after debug
data, err := dayPrayerMapConv(previousDayPrayers)
fmt.Printf("first goroutine result %v\n", data) // TODO: Remove this line after debug
if err != nil {
fmt.Printf("first goroutine err != nil %v\n", err)
pDayErrCh <- err
}
fmt.Printf("first goroutine putting data into channel")
pDayResCh <- data
fmt.Printf("first go routine finished") // TODO: Remove this line after debug
wg.Done()
}()
pDayErr := <-pDayErrCh
close(pDayErrCh)
if pDayErr != nil {
return pDayErr
}
fmt.Println("pday err finised")
p.PreviousDayPrayers = <-pDayResCh
close(pDayResCh)
This is the result of the print statement
first goroutine result map[Asr:map[Hour:3 Minute:28] Dhuhr:map[Hour:12 Minute:23] Fajr:map[Hour:5 Minute:32] Isha:map[Hour:7 Minute:5] Maghrib:map[Hour:6 Minute:13]]
first goroutine putting data into channel
So there is data in the data variable that should have been passed into pDayResCh, but it seems to get stuck there, why?
Due to the condition in my goroutine where I check for the error
if err != nil {
fmt.Printf("first goroutine err != nil %v\n", err)
pDayErrCh <- err
}
Because error is nil, that data never gets passed into the channel, thus hangs
Removing the condition and passing the err regardless solves the issue.

How to immediately exit from exec command via context cancellation?

I am streaming command output to a client with this code. The command is built with context cancellation. The client sends a "cancel" request to the server which notifies the client's cancelCh which triggers cancel().
The issue I'm having is when the command is cancelled, the rest of the command output streams to the client as if the command was not cancelled. After the command is completed, exit status 1 is received; which shows that the command was indeed cancelled.
If I move the done channel to block after cmd.Wait() instead of before, I get the behavior I expect. The client immediately gets exit status 1 and no more data is sent. But that seems to cause a data race issue: https://github.com/golang/go/issues/19685. That issue is old but I think it's relevant.
What is the proper way to stream output to the client in real-time while also immediately exiting via context cancellation?
go func() {
defer func() {
cancel()
}()
<-client.cancelCh
}()
output := make(chan []byte)
go execute(cmd, output)
for data := range output {
fmt.Fprintf(w, "data: %s\n\n", data)
flusher.Flush()
}
func execute(cmd *exec.Cmd, output chan []byte) {
defer close(output)
cmdReader, err := cmd.StdoutPipe()
if err != nil {
output <- []byte(fmt.Sprintf("Error getting stdout pipe: %v", err))
return
}
cmd.Stderr = cmd.Stdout
scanner := bufio.NewScanner(cmdReader)
done := make(chan struct{})
go func() {
for scanner.Scan() {
output <- scanner.Bytes()
}
done <- struct{}{}
}()
err = cmd.Start()
if err != nil {
output <- []byte(fmt.Sprintf("Error executing: %v", err))
return
}
<-done
err = cmd.Wait()
if err != nil {
output <- []byte(err.Error())
}
//<-done
}

Will goroutine leakage happen that channel with one buffer which have two inputs but only one output?

I have a function that is used to forward a message between two io.ReadWriters. Once an error happens, I need to log the error and return. But I think I may have a goroutine leakage problem in my code:
func transport(rw1, rw2 io.ReadWriter) error {
errc := make(chan error, 1) // only one buffer
go func() {
_, err := io.Copy(rw1, rw2)
errc <- err
}()
go func() {
_, err := io.Copy(rw2, rw1)
errc <- err
}()
err := <-errc // only one error catched
if err != nil && err == io.EOF {
err = nil
}
return err
}
Because there is only one error can be caught in this function, will the second goroutine exit and garbaged normally? Or should I write one more err <- errc to receive another error.
The value from one goroutine is received and the other is buffered. Both goroutines can send to the channel and exit. There is no leak.
You might want to receive both values to ensure that application detects an error when the first goroutine to send is successful and the second goroutine encounters an error.
var err error
for i := 0; i < 2; i++ {
if e := <-errc; e != nil {
err = e
}
}
Because io.Copy does not return io.EOF, there's no need to check for io.EOF when collecting the errors.
The code can be simplified to use a single goroutine:
errc := make(chan error, 1)
go func() {
_, err := io.Copy(rw1, rw2)
errc <- err
}()
_, err := io.Copy(rw2, rw1)
if e := <-errc; e != nil {
err = e
}

Leaking goroutine when a non-blocking readline hangs

Assuming you have a structure like this:
ch := make(chan string)
errCh := make(chan error)
go func() {
line, _, err := bufio.NewReader(r).ReadLine()
if err != nil {
errCh <- err
} else {
ch <- string(line)
}
}()
select {
case err := <-errCh:
return "", err
case line := <-ch:
return line, nil
case <-time.After(5 * time.Second):
return "", TimeoutError
}
In the case of the 5 second timeout, the goroutine hangs until ReadLine returns, which may never happen. My project is a long-running server, so I don't want a buildup of stuck goroutines.
ReadLine will not return until either the process exits or the method reads a line. There's no deadline or timeout mechanism for pipes.
The goroutine will block if the call to ReadLine returns after the timeout. This can be fixed by using buffered channels:
ch := make(chan string, 1)
errCh := make(chan error, 1)
The application should call Wait to cleanup resources associated with the command. The goroutine is a good place to call it:
go func() {
line, _, err := bufio.NewReader(r).ReadLine()
if err != nil {
errCh <- err
} else {
ch <- string(line)
}
cmd.Wait() // <-- add this line
}()
This will cause the goroutine to block, the very thing you are trying to avoid. The alternative is that the application leaks resources for each command.

Why goroutine leaks

I read Twelve Go Best Practices and encounter and interesting example on page 30.
func sendMsg(msg, addr string) error {
conn, err := net.Dial("tcp", addr)
if err != nil {
return err
}
defer conn.Close()
_, err = fmt.Fprint(conn, msg)
return err
}
func broadcastMsg(msg string, addrs []string) error {
errc := make(chan error)
for _, addr := range addrs {
go func(addr string) {
errc <- sendMsg(msg, addr)
fmt.Println("done")
}(addr)
}
for _ = range addrs {
if err := <-errc; err != nil {
return err
}
}
return nil
}
func main() {
addr := []string{"localhost:8080", "http://google.com"}
err := broadcastMsg("hi", addr)
time.Sleep(time.Second)
if err != nil {
fmt.Println(err)
return
}
fmt.Println("everything went fine")
}
The programmer mentioned, that happens to the code above:
the goroutine is blocked on the chan write
the goroutine holds a reference to the chan
the chan will never be garbage collected
Why the goroutine is blocked here? The main thread is blocked, until it receive data from goroutine. After it continues the for loop. Not?
Why the errc chan will be never garbage collected? Because I do not close the channel, after goroutine is finished?
One problem I see is that inside broadcastMsg() after goroutines have started:
for _ = range addrs {
if err := <-errc; err != nil {
return err
}
}
If a non-nil error is received from errc, broadcastMsg() returns immediately with that error and does not receive futher values from the channel, which means further goroutines will never get unblocked because errc is unbuffered.
Possible Fixes
A possible fix would be to use a buffered channel, big enough to not block any of the goroutines, in this case:
errc := make(chan error, len(addrs))
Or even if a non-nil error is received from the channel, still proceed to receive as many times as many goroutines send on it:
var errRec error
for _ = range addrs {
if err := <-errc; err != nil {
if errRec == nil {
errRec = err
}
}
}
return errRec
Or as mentioned in the linked talk on slide #33: use a "quit" channel to prevent the started goroutines to remain blocked after broadcastMsg() has completed/returned.
You have a list of two addresses (localhost, google). To each of these you're sending a message (hi), using one goroutine per address. And the goroutine sends error (which may be nil) to errc channel.
If you send something to a channel, you also need something that reads the values from that channel, otherwise it will block (unless it's a buffered channel, but even buffered channels block once their buffer is full).
So your reading loop looks like this:
for _ = range addrs {
if err := <-errc; err != nil {
return err
}
}
If the first address returns an error which is not nil, the loop returns. The subsequent error values are never read from the channel thus it blocks.

Resources