When I use ZMQ to receive messages over TCP on mac, I got the program using 100% of CPU after some hours. By using lsof, I found kqueue got into a wired state
myprogram 56486 user 13u KQUEUE count=4, state=0x8
and the grogram keeps waiting on something.
Can someone tell me what does state 8 mean in queue?
codes details:
language: Golang
package: github.com/pebbe/zmq4
poller := zmq.NewPoller()
// sub is the zmq socket over TCP that receives message
poller.Add(sub, zmq.POLLIN)
for {
polled, _ := poller.Poll(zeroMQTimeout)
if len(polled) >= 1 && polled[0].Events&zmq.POLLIN != 0 {
msg, _ := sub.RecvMessageBytes(0)
go handleMsg(msg[1])
} else {
fmt.Printf("something happened at %s\n", time.Now())
}
}
by loging, I see that the main goroutine hangs somewhere in the for loop, and it is likely to be at sub.RecvMessageBytes.
Using a blocking receive instead of a poller does not improve the situation.
Related
I have a server process that does lots of things. One of those is receiving UNIX datagram packets. After a while it stops receiving them, while the rest of the server's functionality (appears) to continue just fine.
The socket was established with:
unixSock, err := net.ListenUnixgram("unixgram", &net.UnixAddr{Name: "/tmp/controller_socket"})
and there is a loop receiving messages that looks like this:
func packetHandler(self *controller, c net.PacketConn) {
msg := make([]byte, MAX_MESSAGE_SIZE)
for {
msg = msg[:cap(msg)]
n, addr, err := c.ReadFrom(msg)
if err == io.EOF {
return
}
if n > 0 {
gotMessage(self, msg[:n], addr)
}
}
}
After the server starts this all runs just fine. Then, after some time (maybe a day or more) sending processes fail to send to the socket, as illustrated by this command:
$ echo foo | socat - UNIX-SENDTO:/tmp/controller_socket
2022/10/11 10:50:34 socat[1776033] E sendto(5, 0x55ae7454f000, 4, 0, AF=1 "/tmp/controller_socket", 24): Connection refused
netstat confirms that the socket is being listened on and with a receive queue of length 2:
unix 2 [ ] DGRAM 46822161 /tmp/controller_socket
and looking at the server process using /proc we see that fd=3 is indeed for that socket:
3 -> 'socket:[46822161]'
If I attach a debugger to the server process I can see a goroutine that is indeed attempting to read from the socket. It is blocked in c.ReadFrom() and setting a breakpoint on the following line confirms that it is not returning with an unhandled error. The debugger shows that c.(*net.UnixConn).conn.fd.pfd.Sysfd == 3.
In summary:
There is a UNIX datagram listener opened by the server process.
There is a goroutine in that process blocked on receive on that listener.
Senders get "Connection refused".
What has gone wrong?
go version go1.18.1 linux/amd64
Why does this receiver go routine refuse to terminate when the connection is closed
This runs as expected but then randomly, every 20-10,000x it is called, a receiver will fail to shutdown, which then causes a go routine leak, leading to 100% cpu.
Note: If I log all errors, I will see read on a closed channel if the conn.SetReadDeadline is commented out. When used, I see i/o timeout as the error.
This ran for 10k cycles, where the main process starts 11 pairs of these send/receivers and they process 1000 jobs before the main process sends the shutdown signal. This setup ran for > 6 hours without any issue to to 10k cycles mark overnight, but this morning I can't get it to run more than 20 cycles without getting a receiver flagged as not shutting down and exiting.
func sender(w worker, ch channel) {
var j job
for {
select {
case <-ch.quit: // shutdown broadcast, exit
w.Close()
ch.stopped <- w.id // debug, send stop confirmed
return
case j = <-w.job: // worker designated jobs
case j = <-ch.spawner: // FCFS jobs
}
... prepare job ...
w.WriteToUDP(buf, w.addr)
}
func receiver(w worker, ch channel) {
deadline := 100 * time.Millisecond
out:
for {
w.SetReadDeadline(time.Now().Add(deadline))
// blocking read, should release on close (or deadline)
n, err = w.c.Read(buf)
select {
case <-ch.quit: // shutdown broadcast, exit
ch.stopped <- w.id+100 // debug, receiver stop confirmed
return
default:
}
if n == 0 || err != nil {
continue
}
update := &update{id: w.id}
... process update logic ...
select {
case <-ch.quit: // shutting down
break out
case ch.update <- update
}
}
I need a reliable way to get the receiver to shutdown when it gets either the shutdown broadcast OR the conn is closed. Functionally, closing the channel should be enough and is the preferred method according to the go package documentation, see Conn interface.
I upgraded to the most recent go, which is 1.12.1 with no change.
Running on MacOS in development and CentOS in production.
Any run into this problem?
If so, how did you reliably fix it?
Possible Solution
My very verbose and icky solution that seems to possibly work, as a work around, is to do this:
1) start the sender in a go routine, like this (above, unchanged)
2) start the receiver in a go routine, like this (below)
func receive(w worker, ch channel) {
request := make(chan []byte, 1)
reader := make(chan []byte, 1)
defer func() {
close(request) // exit signaling
w.c.Close() // exit signaling
//close(reader)
}()
go func() {
// untried senario, for if we still have leaks -> 100% cpu
// we may need to be totally reliant on closing request or ch.quit
// defer w.c.Close()
deadline := 100 * time.Millisecond
var n int
var err error
for buf := range request {
for {
select {
case <-ch.quit: // shutdown signal
return
default:
}
w.c.SetReadDeadline(time.Now().Add(deadline))
n, err = w.c.Read(buf)
if err != nil { // timeout or close
continue
}
break
}
select {
case <-ch.quit: // shutdown signal
return
case reader <- buf[:n]:
//default:
}
}
}()
var buf []byte
out:
for {
request <- make([]byte, messageSize)
select {
case <-ch.quit: // shutting down
break out
case buf = <-reader:
}
update := &update{id: w.id}
... process update logic ...
select {
case <-ch.quit: // shutting down
break out
case ch.update <- update
}
}
My question is, why does this horrendous version 2, that spawns a new go routine to read from the blocking c.Read(buf) seem to work more reliably, meaning it does not leak when the shutdown signal is send, when the much simpler first version didn't ... and it seems to be essentially the same thing due to the blocking c.Read(buf).
Downgrading my question is NOT helpful when this is a legitimate and verifiably repeatable issue, the question remains unanswered.
Thanks everyone for the responses.
So. There wasn't ever a stack trace. In fact, I got NO errors at all, not a race detection or anything and it wasn't deadlocked, a go routine just would not shut down and exit, and it wasn't consistently reproducible. I've been running the same data for two weeks.
When the go routine would fail to report that was exiting, it would simple spin out of control and drive the CPU to 100%, but only AFTER all the others exited and the system moved on. I never saw memory grow. CPU would gradually tick up to 200%, 300%, 400% and that's when the system had to be rebooted.
I logged when it was a leak was happening, it was always a different one, and I'd get one leak after 380 prior successful runs (of 23 pairs of go routines running in unision), next time 1832 before one receiver leaked, next time only 23, with exact same data being chewed on at same starting point. The leaked receiver just spun out of control, but only after the group of 22 others companions had all shutdown and exited successfully and the system moved to the next batch. It would not fail consistently, other than it was guaranteed to leak at some point.
After many days, numerous rewrites, and a million log before/after every action, this finally seems to be what the issue was, and after digging through the library I'm not sure why exactly, nor WHY it only happens randomly.
For whatever reason, the golang.org/x/net/dns/dnsmessage library will randomly freak out if you parse and go straight to skipping questions without reading a question first. No idea why that matters, hello, skipping questions means you don't care about that header section and to mark it as processed, and it works fine for literally a million times in a row, but then doesn't, so you seem to be required to read a question BEFORE you can SkipAllQuestions, since this seems to have been the solution. I'm 18,525 batches in, and adding that turned the leaks off.
var p dnsmessage.Parser
h, err := p.Start(buf[:n])
if err != nil {
continue // what!?
}
switch {
case h.RCode == dnsmessage.RCodeSuccess:
q, err := p.Question() // should only have one question
if q.Type != w.Type || err != nil {
continue // what!?, impossible
}
// Note: if you do NOT do the above first, you're asking for pain! (tr)
if err := p.SkipAllQuestions(); err != nil {
continue // what!?
}
// Do not count as "received" until we have passed above point and
// validated that response had a question that we could skip...
I'm new to goroutines and trying to work out the idiomatic way to organise this code. My program will generate async status events that I want to transmit to a server over a websocket. Right now I have a global channel messagesToServer to receive the status messages. The idea is it that will send the data if we currently have a websocket open, or quietly drop it if the connection to the server is currently closed or unavailable.
Relevant snippets are below. I don't really like the non-blocking send - if for some reason my writer goroutine took a while to process a message I think it could end up dropping a quick second message for no reason?
But if I use a blocking send, sendStatusToServer could block something that shouldn't be blocked if the connection is offline. I could try to track connected/disconnected state but if a message was sent at the same time as the disconnection occurred I think there would be a race condition.
Is there a tidy way I can write this?
var (
messagesToServer chan common.StationStatus
)
// ...
func sendStatusToServer(msg common.StationStatus) {
// Must be non-blocking in case we're not connected
select {
case messagesToServer <- msg:
break
default:
break
}
}
// ...
// after making websocket connection
log.Println("Connected to central server");
finished := make(chan struct{})
// Writer
go func() {
for {
select {
case msg := <-messagesToServer:
var buff bytes.Buffer
enc := gob.NewEncoder(&buff)
err = enc.Encode(msg)
conn.WriteMessage(websocket.BinaryMessage, buff.Bytes()); // ignore errors by design
case <-finished:
return;
}
}
}()
// Reader as busy loop on this goroutine
for {
messageType, p, err := conn.ReadMessage()
I have some code that is a job dispatcher and is collating a large amount of data from lots of TCP sockets. This code is a result of an approach to Large number of transient objects - avoiding contention and it largely works with CPU usage down a huge amount and locking not an issue now either.
From time to time my application locks up and the "Channel length" log is the only thing that keeps repeating as data is still coming in from my sockets. However the count remains at 5000 and no downstream processing is taking place.
I think the issue might be a race condition and the line it is possibly getting hung up on is channel <- msg within the select of the jobDispatcher. Trouble is I can't work out how to verify this.
I suspect that as select can take items at random the goroutine is returning and the shutdownChan doesn't have a chance to process. Then data hits inboundFromTCP and it blocks!
Someone might spot something really obviously wrong here. And offer a solution hopefully!?
var MessageQueue = make(chan *trackingPacket_v1, 5000)
func init() {
go jobDispatcher(MessageQueue)
}
func addMessage(trackingPacket *trackingPacket_v1) {
// Send the packet to the buffered queue!
log.Println("Channel length:", len(MessageQueue))
MessageQueue <- trackingPacket
}
func jobDispatcher(inboundFromTCP chan *trackingPacket_v1) {
var channelMap = make(map[string]chan *trackingPacket_v1)
// Channel that listens for the strings that want to exit
shutdownChan := make(chan string)
for {
select {
case msg := <-inboundFromTCP:
log.Println("Got packet", msg.Avr)
channel, ok := channelMap[msg.Avr]
if !ok {
packetChan := make(chan *trackingPacket_v1)
channelMap[msg.Avr] = packetChan
go processPackets(packetChan, shutdownChan, msg.Avr)
packetChan <- msg
continue
}
channel <- msg
case shutdownString := <-shutdownChan:
log.Println("Shutting down:", shutdownString)
channel, ok := channelMap[shutdownString]
if ok {
delete(channelMap, shutdownString)
close(channel)
}
}
}
}
func processPackets(ch chan *trackingPacket_v1, shutdown chan string, id string) {
var messages = []*trackingPacket_v1{}
tickChan := time.NewTicker(time.Second * 1)
defer tickChan.Stop()
hasCheckedData := false
for {
select {
case msg := <-ch:
log.Println("Got a messages for", id)
messages = append(messages, msg)
hasCheckedData = false
case <-tickChan.C:
messages = cullChanMessages(messages)
if len(messages) == 0 {
messages = nil
shutdown <- id
return
}
// No point running checking when packets have not changed!!
if hasCheckedData == false {
processMLATCandidatesFromChan(messages)
hasCheckedData = true
}
case <-time.After(time.Duration(time.Second * 60)):
log.Println("This channel has been around for 60 seconds which is too much, kill it")
messages = nil
shutdown <- id
return
}
}
}
Update 01/20/16
I tried to rework with the channelMap as a global with some mutex locking but it ended up deadlocking still.
Slightly tweaked the code, still locks but I don't see how this one does!!
https://play.golang.org/p/PGpISU4XBJ
Update 01/21/17
After some recommendations I put this into a standalone working example so people can see. https://play.golang.org/p/88zT7hBLeD
It is a long running process so will need running locally on a machine as the playground kills it. Hopefully this will help get to the bottom of it!
I'm guessing that your problem is getting stuck doing this channel <- msg at the same time as the other goroutine is doing shutdown <- id.
Since neither the channel nor the shutdown channels are buffered, they block waiting for a receiver. And they can deadlock waiting for the other side to become available.
There are a couple of ways to fix it. You could declare both of those channels with a buffer of 1.
Or instead of signalling by sending a shutdown message, you could do what Google's context package does and send a shutdown signal by closing the shutdown channel. Look at https://golang.org/pkg/context/ especially WithCancel, WithDeadline and the Done functions.
You might be able to use context to remove your own shutdown channel and timeout code.
And JimB has a point about shutting down the goroutine while it might still be receiving on the channel. What you should do is send the shutdown message (or close, or cancel the context) and continue to process messages until your ch channel is closed (detect that with case msg, ok := <-ch:), which would happen after the shutdown is received by the sender.
That way you get all of the messages that were incoming until the shutdown actually happened, and should avoid a second deadlock.
I'm new to Go but in this code here
case msg := <-inboundFromTCP:
log.Println("Got packet", msg.Avr)
channel, ok := channelMap[msg.Avr]
if !ok {
packetChan := make(chan *trackingPacket_v1)
channelMap[msg.Avr] = packetChan
go processPackets(packetChan, shutdownChan, msg.Avr)
packetChan <- msg
continue
}
channel <- msg
Aren't you putting something in channel (unbuffered?) here
channel, ok := channelMap[msg.Avr]
So wouldn't you need to empty out that channel before you can add the msg here?
channel <- msg
Like I said, I'm new to Go so I hope I'm not being goofy. :)
When I close a browser I want to disconnect a websocket in 3 seconds instead of 1 minute. The following just keep writing into a void without error until the tcp ip timeout I guess, not the SetWriteDeadline.
f := func(ws *websocket.Conn) {
for {
select {
case msg := <-out:
ws.SetWriteDeadline(time.Now().Add(3 * time.Second))
if _, err := ws.Write([]byte(msg)); err != nil {
fmt.Println(err)
return
}
case <-time.After(3 * time.Second):
fmt.Println("timeout 3")
return
}
}
}
return websocket.Handler(f)
I need to wait for this err
write tcp [::1]:8080->[::1]:65459: write: broken pipe
before it finally closes the connection, which takes about a minute or more.
You are you using WriteDeadline correctly. The deadline specifies the time for writing data to the TCP stack's buffers, not the time that the peer receives the data (if it does at all).
To reliably detect closed connections, the application should send PINGs to the peer and wait for the expected PONGs. The package you are using does not support this functionality, but the Gorilla package does. The Gorilla chat application shows how use PING and PONG to detect closed connections.