automatic gRPC unix reconnect after EOF - go

I have an application (let's call it client) connecting to another process (let's call it server) on the same machine via gRPC. The communication goes over unix socket.
If server is restarted, my client gets an EOF and does not re-establish the connection, although I expected the clientConn to handle the reconnection automatically.
Why isn't the dialer taking care of the reconnection?
I expect it to do so with the backoff params I passed.
Below some pseudo-MWE.
Run establish the initial connection, then spawns goroutineOne
goroutineOne waits for the connection to be ready and delegates the send to fooUpdater
fooUpdater streams the data, or returns in case of errors
for waitUntilReady I used the pseudo-code referenced by this answer to get a new stream.
func main() {
go func() {
if err := Run(ctx); err != nil {
log.Errorf("connection error: %v", err)
}
ctxCancel()
}()
// some wait logic
}
func Run(ctx context.Context) {
backoffConfig := backoff.Config{
BaseDelay: time.Duration(1 * time.Second),
Multiplier: backoff.DefaultConfig.Multiplier,
Jitter: backoff.DefaultConfig.Jitter,
MaxDelay: time.Duration(120 * time.Second),
}
myConn, err := grpc.DialContext(ctx,
"/var/run/foo.bar",
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithConnectParams(grpc.ConnectParams{Backoff: backoffConfig, MinConnectTimeout: time.Duration(1 * time.Second)}),
grpc.WithContextDialer(func(ctx context.Context, addr string) (net.Conn, error) {
d := net.Dialer{}
c, err := d.DialContext(ctx, "unix", addr)
if err != nil {
return nil, fmt.Errorf("connection to unix://%s failed: %w", addr, err)
}
return c, nil
}),
)
if err != nil {
return fmt.Errorf("could not establish socket for foo: %w", err)
}
defer myConn.Close()
return goroutineOne()
}
func goroutineOne() {
reconnect := make(chan struct{})
for {
if ready := waitUntilReady(ctx, myConn, time.Duration(2*time.Minute)); !ready {
return fmt.Errorf("myConn: %w, timeout: %s", ErrWaitReadyTimeout, "2m")
}
go func() {
if err := fooUpdater(ctx, dataBuffer, myConn); err != nil {
log.Errorf("foo updater: %v", err)
}
reconnect <- struct{}{}
}()
select {
case <-ctx.Done():
return nil
case <-reconnect:
}
}
}
func fooUpdater(ctx context.Context, dataBuffer custom.CircularBuffer, myConn *grpc.ClientConn) error {
clientStream, err := myConn.Stream(ctx) // custom pb code, returns grpc.ClientConn.NewStream(...)
if err != nil {
return fmt.Errorf("could not obtain stream: %w", err)
}
for {
select {
case <-ctx.Done():
return nil
case data := <-dataBuffer:
if err := clientStream.Send(data); err != nil {
return fmt.Errorf("could not send data: %w", err)
}
}
}
}
func waitUntilReady(ctx context.Context, conn *grpc.ClientConn, maxTimeout time.Duration) bool {
ctx, cancel := context.WithTimeout(ctx, maxTimeout)
defer cancel()
currentState := conn.GetState()
timeoutValid := true
for currentState != connectivity.Ready && timeoutValid {
timeoutValid = conn.WaitForStateChange(ctx, currentState)
currentState = conn.GetState()
// debug print currentState -> prints IDLE
}
return currentState == connectivity.Ready
}
Debugging hints also welcome :)

Based on the provided code and information, there might be an issue with how ctx.Done is being utilized.
The ctx.Done() is being used in fooUpdater and goroutineOnefunctions. When connection breaks, I believe that the ctx.Done() gets called in both functions, with the following execution order:
Connection breaks, the ctx.Done case in the fooUpdater function gets called, exiting the function. The select statement in the goroutineOne function also executes the ctx.Done case, which exists the function, and the client doesn't reconnect.
Try debugging it to check if both select case blocks get executed, but I believe that is the issue here.

According to the GRPC documentation, the connection is re-established if there is a transient failure otherwise it fails immediately. You can try to verify that the failure is transient by printing the connectivity state.
You should print the error code also to understand Why RPC failed.
Maybe what you have tried is not considered a transient failure.
Also, according to the following entry retry logic does not work with streams: grpc-java: Proper handling of retry on client for service streaming call
Here are the links to the corresponding docs:
https://grpc.github.io/grpc/core/md_doc_connectivity-semantics-and-api.html
https://pkg.go.dev/google.golang.org/grpc#section-readme
Also, check the following entry:
Ways to wait if server is not available in gRPC from client side

Related

Setting idletimeout in Go

I have a function in go which is handling connections which are coming through tcp and handled via ssh. I am trying to set an idle timeout by creating struct in the connection function.
Use case - a customer should be able to make a connection and upload/download multiple files
Reference - IdleTimeout in tcp server
Function code:
type Conn struct {
net.Conn
idleTimeout time.Duration
}
func HandleConn(conn net.Conn) {
var err error
rAddr := conn.RemoteAddr()
session := shortuuid.New()
config := LoadSSHServerConfig(session)
blocklistItem := blocklist.GetBlockListItem(rAddr)
if blocklistItem.IsBlocked() {
conn.Close()
atomic.AddInt64(&stats.Stats.BlockedConnections, 1)
return
}
func (c *Conn) Read(b []byte) (int, error) {
err := c.Conn.SetReadDeadline(time.Now().Add(c.idleTimeout))
if err != nil {
return 0, err
}
return c.Conn.Read(b)
}
sConn, chans, reqs, err := ssh.NewServerConn(conn, config)
if err != nil {
if err == io.EOF {
log.Errorw("SSH: Handshaking was terminated", log.Fields{
"address": rAddr,
"error": err,
"session": session})
} else {
log.Errorw("SSH: Error on handshaking", log.Fields{
"address": rAddr,
"error": err,
"session": session})
}
atomic.AddInt64(&stats.Stats.AuthorizationFailed, 1)
return
}
log.Infow("connection accepted", log.Fields{
"user": sConn.User(),
})
if user, ok := users[session]; ok {
log.Infow("SSH: Connection accepted", log.Fields{
"user": user.LogFields(),
"clientVersion": string(sConn.ClientVersion())})
atomic.AddInt64(&stats.Stats.AuthorizationSucceeded, 1)
// The incoming Request channel must be serviced.
go ssh.DiscardRequests(reqs)
// Key ID: sConn.Permissions.Extensions["key-id"]
handleServerConn(user, chans)
log.Infow("connection finished", log.Fields{"user": user.LogFields()})
log.Infow("checking connections", log.Fields{
//"cc": Stats.AcceptedConnections,
"cc2": &stats.Stats.AcceptedConnections})
// Remove connection from local cache
delete(users, session)
} else {
log.Infow("user not found from memory", log.Fields{"username": sConn.User()})
}
}
This code is coming from the Listen function:
func Listen() {
listener, err := net.Listen("tcp", sshListen)
if err != nil {
panic(err)
}
if useProxyProtocol {
listener = &proxyproto.Listener{
Listener: listener,
ProxyHeaderTimeout: time.Second * 10,
}
}
for {
// Once a ServerConfig has been configured, connections can be accepted.
conn, err := listener.Accept()
if err != nil {
log.Errorw("SSH: Error accepting incoming connection", log.Fields{"error": err})
atomic.AddInt64(&stats.Stats.FailedConnections, 1)
continue
}
// Before use, a handshake must be performed on the incoming net.Conn.
// It must be handled in a separate goroutine,
// otherwise one user could easily block entire loop.
// For example, user could be asked to trust server key fingerprint and hangs.
go HandleConn(conn)
}
}
Is that even possible to set a deadline for only the connections which have been idle for 20 secinds (no upload/downloads).
EDIT 1 : Following #LiamKelly's suggestions, I have made the changes in the code. Now the code is like
type SshProxyConn struct {
net.Conn
idleTimeout time.Duration
}
func (c *SshProxyConn) Read(b []byte) (int, error) {
err := c.Conn.SetReadDeadline(time.Now().Add(c.idleTimeout))
if err != nil {
return 0, err
}
return c.Conn.Read(b)
}
func HandleConn(conn net.Conn) {
//lines of code as above
sshproxyconn := &SshProxyConn{nil, time.Second * 20}
Conn, chans, reqs, err := ssh.NewServerConn(sshproxyconn, config)
//lines of code
}
But now the issue is that SSH is not happening. I am getting the error "Connection closed" when I try to do ssh. Is it still waiting for "conn" variable in the function call?
Is that even possible to set a deadline for only the connections which have been idle for 20 [seconds]
Ok so first a general disclaimer, I am going to assume go-protoproxy implements the Conn interface as we would expected. Also as you hinted at before, I don't think you can put a a struct method inside another function (I also recommend renaming it something unique to prevent Conn vs net.Conn confusion).
type SshProxyConn struct {
net.Conn
idleTimeout time.Duration
}
func (c *SshProxyConn) Read(b []byte) (int, error) {
err := c.Conn.SetReadDeadline(time.Now().Add(c.idleTimeout))
if err != nil {
return 0, err
}
return c.Conn.Read(b)
}
func HandleConn(conn net.Conn) {
This makes is more clear what your primary issue is, which you passed the normal net.Conn to your SSH server, not your wrapper class. So
sConn, chans, reqs, err := ssh.NewServerConn(conn, config)
should be EDIT
sshproxyconn := &SshProxyConn{conn, time.Second * 20}
Conn, chans, reqs, err := ssh.NewServerConn(sshproxyconn , config)

Cancelling a net.Listener via Context in Golang

I'm implementing a TCP server application that accepts incoming TCP connections in an infinite loop.
I'm trying to use Context throughout the application to allow shutting down, which is generally working great.
The one thing I'm struggling with is cancelling a net.Listener that is waiting on Accept(). I'm using a ListenConfig which, I believe, has the advantage of taking a Context when then creating a Listener. However, cancelling this Context does not have the intended effect of aborting the Accept call.
Here's a small app that demonstrates the same problem:
package main
import (
"context"
"fmt"
"net"
"time"
)
func main() {
lc := net.ListenConfig{}
ctx, cancel := context.WithCancel(context.Background())
go func() {
time.Sleep(2*time.Second)
fmt.Println("cancelling context...")
cancel()
}()
ln, err := lc.Listen(ctx, "tcp", ":9801")
if err != nil {
fmt.Println("error creating listener:", err)
} else {
fmt.Println("listen returned without error")
defer ln.Close()
}
conn, err := ln.Accept()
if err != nil {
fmt.Println("accept returned error:", err)
} else {
fmt.Println("accept returned without error")
defer conn.Close()
}
}
I expect that, if no clients connect, when the Context is cancelled 2 seconds after startup, the Accept() should abort. However, it just sits there until you Ctrl-C out.
Is my expectation wrong? If so, what is the point of the Context passed to ListenConfig.Listen()?
Is there another way to achieve the same goal?
I believe you should be closing the listener when your timeout runs out. Then, when Accept returns an error, check that it's intentional (e.g. the timeout elapsed).
This blog post shows how to do a safe shutdown of a TCP server without a context. The interesting part of the code is:
type Server struct {
listener net.Listener
quit chan interface{}
wg sync.WaitGroup
}
func NewServer(addr string) *Server {
s := &Server{
quit: make(chan interface{}),
}
l, err := net.Listen("tcp", addr)
if err != nil {
log.Fatal(err)
}
s.listener = l
s.wg.Add(1)
go s.serve()
return s
}
func (s *Server) Stop() {
close(s.quit)
s.listener.Close()
s.wg.Wait()
}
func (s *Server) serve() {
defer s.wg.Done()
for {
conn, err := s.listener.Accept()
if err != nil {
select {
case <-s.quit:
return
default:
log.Println("accept error", err)
}
} else {
s.wg.Add(1)
go func() {
s.handleConection(conn)
s.wg.Done()
}()
}
}
}
func (s *Server) handleConection(conn net.Conn) {
defer conn.Close()
buf := make([]byte, 2048)
for {
n, err := conn.Read(buf)
if err != nil && err != io.EOF {
log.Println("read error", err)
return
}
if n == 0 {
return
}
log.Printf("received from %v: %s", conn.RemoteAddr(), string(buf[:n]))
}
}
In your case you should call Stop when the context runs out.
If you look at the source code of TCPConn.Accept, you'll see it basically calls the underlying socket accept, and the context is not piped through there. But Accept is simple to cancel by closing the listener, so piping the context all the way isn't strictly necessary.

"context deadline exceeded" AFTER first client-to-server request & response

In a simple gRPC example, I connect to the server, request to send a text message, and in return a success message is sent. Once the server is live, the first client request is successful, with no errors.
However, after the first try, each subsequent request (identical to the first) return this error, and does not return a response as the results (the text message is still sent, but the generated ID is not sent back):
rpc error: code = DeadlineExceeded desc = context deadline exceeded
After debugging a little bit, I found that the error
func (c *messageSchedulerClient) SendText(ctx context.Context, in *TextRequest, opts ...grpc.CallOption) (*MessageReply, error) {
...
err := c.cc.Invoke(ctx, "/communication.MessageScheduler/SendText", in, out, opts...)
...
return nil, err
}
returns
rpc error: code = DeadlineExceeded desc = context deadline exceeded
Here is my client code:
func main() {
// Set up a connection to the server.
conn, err := grpc.Dial(address, grpc.WithInsecure(), grpc.WithBlock())
if err != nil {
log.Fatalf("did not connect: %v", err)
}
c := pb.NewMessageSchedulerClient(conn)
var r *pb.MessageReply
r, err = pbSendText(c, false)
if err != nil {
log.Fatalf("could not greet: %v", err)
}
log.Printf("Greeting: %s", r.GetId())
err = conn.Close()
if err != nil {
log.Printf("Connection Close Error: %v", err)
}
return
}
func pbSendText(c pb.MessageSchedulerClient, instant bool) (*pb.MessageReply, error) {
ctx, cancel := context.WithTimeout(context.Background(), time.Second * 5)
minute := timestamppb.New(time.Now().Add(time.Second * 30))
r, err := c.SendText(ctx, &pb.TextRequest{})
if err != nil {
log.Printf("DEBUG MESSAGE: Error after c.SendText(ctx, in): %v", err)
}
cancel()
return r, err
}
and the server code is...
func (s *MessageServer) SendText(ctx context.Context, in *pb.TextRequest) (*pb.MessageReply, error) {
return &pb.MessageReply{Id: GeneratePublicId()}, nil
}
func GrpcServe() {
lis, err := net.Listen("tcp", port)
if err != nil {
log.Fatalf("failed to listen: %v", err)
}
s := grpc.NewServer()
pb.RegisterMessageSchedulerServer(s, &MessageServer{})
if err := s.Serve(lis); err != nil {
log.Fatalf("failed to serve: %v", err)
}
return
}
const Alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHJKMNOPQRSTUVWXYZ0123457"
// executes instantly, O(n)
func GeneratePublicId() string {
return id.NewWithAlphabet(Alphabet)
}
I've tried changing the context to .TODO from .Background. Doesn't work. I am SURE it's something so simple that I'm missing.
Again, the first time I execute the application from the client side, it works. But after the first client application execution, meaning application execution and beyond -- until I restart the gRPC Server, it gives me the error.
The GeneratePublicId function executes almost instantly, as it is O(n) and uses a random number function.

gobwas/ws Send opPing over net.Conn

Can somebody help me understand what am I doing wrong here, all I'm trying to do is write a Ping message over a net.Conn instance (server), and reply back with a Pong which is expected on a net.Conn instance (client).
I have annotated the code with some errors that I receive.
reader.go
func read(conn net.Conn) {
for {
conn.SetReadDeadline(time.Now().Add(2 * time.Second))
_, op, err := wsutil.ReadClientData(conn)
if err != nil {
log.Printf("wsmanager read: %v", err) // read: write pipe: deadline exceeded
return
}
if op != ws.OpPing {
continue
}
c.conn.SetWriteDeadline(time.Now().Add(3 * time.Second))
if err = wsutil.WriteServerMessage(c.conn, ws.OpPong, []byte{}); err != nil {
log.Printf("wsmanager: send pong error: %v", err)
return
}
}
}
// reader_test.go
client, server := net.Pipe()
go read(server) // starts the loop above
err := wsutil.WriteClientMessage(server, ws.OpPing, []byte{})
if err != nil {
t.Fatalf("failed sending pings message %v", err)
}
_, op, err := wsutil.ReadServerData(client)
if err != nil {
t.Errorf("exp no err, got %v", err)
}
if op != ws.OpPong {
t.Errorf("exp ws.OpPong, got %v", op)
}
Thank you for using this library :)
As the doc states, the ReadData functions read data from the connection; that is, application specific data, not the control messages. Control frames are handled implicitly in these functions. If you want to read any kind of message, you could use wsutil.Reader or the plain ws.Read functions.
https://godoc.org/github.com/gobwas/ws/wsutil#ReadClientData

Synchronizing a Test Server During Tests

Summary: I'm running into a race condition during testing where my server is not reliably ready to serve requests before making client requests against it. How can I block only until the listener is ready, and still maintain composable public APIs without requiring users to BYO net.Listener?
We see the following error as the goroutine that spins up our (blocking) server in the background isn't listening before we call client.Do(req) in the TestRun test function.
--- FAIL: TestRun/Server_accepts_HTTP_requests (0.00s)
/home/matt/repos/admission-control/server_test.go:64: failed to make a request: Get https://127.0.0.1:37877: dial tcp 127.0.0.1:37877: connect: connection refused
I'm not using httptest.Server directly as I'm attempting to test the blocking & cancellation characteristics of my own server componenent.
I create an httptest.NewUnstartedServer, clone its *tls.Config into a new http.Server after starting it with StartTLS(), and then close it, before calling *AdmissionServer.Run(). This also has the benefit of giving me a *http.Client with the matching RootCAs configured.
Testing TLS is important here as the daemon this exposes lives in a TLS-only environment.
func newTestServer(ctx context.Context, t *testing.T) *httptest.Server {
testHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintln(w, "OK")
})
testSrv := httptest.NewUnstartedServer(testHandler)
admissionServer, err := NewServer(nil, &noopLogger{})
if err != nil {
t.Fatalf("admission server creation failed: %s", err)
return nil
}
// We start the test server, copy its config out, and close it down so we can
// start our own server. This is because httptest.Server only generates a
// self-signed TLS config after starting it.
testSrv.StartTLS()
admissionServer.srv = &http.Server{
Addr: testSrv.Listener.Addr().String(),
Handler: testHandler,
TLSConfig: testSrv.TLS.Clone(),
}
testSrv.Close()
// We need a better synchronization primitive here that doesn't block
// but allows the underlying listener to be ready before
// serving client requests.
go func() {
if err := admissionServer.Run(ctx); err != nil {
t.Fatalf("server returned unexpectedly: %s", err)
}
}()
return testSrv
}
// Test that we can start a minimal AdmissionServer and handle a request.
func TestRun(t *testing.T) {
testSrv := newTestServer(context.TODO(), t)
t.Run("Server accepts HTTP requests", func(t *testing.T) {
client := testSrv.Client()
req, err := http.NewRequest(http.MethodGet, testSrv.URL, nil)
if err != nil {
t.Fatalf("request creation failed: %s", err)
}
resp, err := client.Do(req)
if err != nil {
t.Fatalf("failed to make a request: %s", err)
}
// Later sub-tests will test cancellation propagation, signal handling, etc.
For posterity, this is our composable Run function, that listens in a goroutine and then blocks on our cancellation & error channels in a for-select:
type AdmissionServer struct {
srv *http.Server
logger log.Logger
GracePeriod time.Duration
}
func (as *AdmissionServer) Run(ctx context.Context) error {
sigChan := make(chan os.Signal, 1)
defer close(sigChan)
signal.Notify(sigChan, os.Interrupt, syscall.SIGTERM)
// run in goroutine
errs := make(chan error)
defer close(errs)
go func() {
as.logger.Log(
"msg", fmt.Sprintf("admission control listening on '%s'", as.srv.Addr),
)
if err := as.srv.ListenAndServeTLS("", ""); err != nil && err != http.ErrServerClosed {
errs <- err
as.logger.Log(
"err", err.Error(),
"msg", "the server exited",
)
return
}
return
}()
// Block indefinitely until we receive an interrupt, cancellation or error
// signal.
for {
select {
case sig := <-sigChan:
as.logger.Log(
"msg", fmt.Sprintf("signal received: %s", sig),
)
return as.shutdown(ctx, as.GracePeriod)
case err := <-errs:
as.logger.Log(
"msg", fmt.Sprintf("listener error: %s", err),
)
// We don't need to explictly call shutdown here, as
// *http.Server.ListenAndServe closes the listener when returning an error.
return err
case <-ctx.Done():
as.logger.Log(
"msg", fmt.Sprintf("cancellation received: %s", ctx.Err()),
)
return as.shutdown(ctx, as.GracePeriod)
}
}
}
Notes:
There is a (simple) constructor for an *AdmissionServer: I've left it out for brevity. The AdmissionServer is composable and accepts a *http.Server so that it can be plugged into existing applications easily.
The wrapped http.Server type that we create a listener from doesn't itself expose any way to tell if its listening; at best we can try to listen again and catch the error (e.g. port already bound to another listener), which does not seem robust as the net package doesn't expose a useful typed error for this.
You can just attempt to connect to the server before starting the test suite, as part of the initialization process.
For example, I usually have a function like this in my tests:
// waitForServer attempts to establish a TCP connection to localhost:<port>
// in a given amount of time. It returns upon a successful connection;
// ptherwise exits with an error.
func waitForServer(port string) {
backoff := 50 * time.Millisecond
for i := 0; i < 10; i++ {
conn, err := net.DialTimeout("tcp", ":"+port, 1*time.Second)
if err != nil {
time.Sleep(backoff)
continue
}
err = conn.Close()
if err != nil {
log.Fatal(err)
}
return
}
log.Fatalf("Server on port %s not up after 10 attempts", port)
}
Then in my TestMain() I do:
func TestMain(m *testing.M) {
go startServer()
waitForServer(serverPort)
// run the suite
os.Exit(m.Run())
}

Resources