Problem with gRPC setup. Getting an intermittent RPC unavailable error - go

I have a grpc server and client that works as expected most of the time, but do get a "transport is closing" error occasionally:
rpc error: code = Unavailable desc = transport is closing
I'm wondering if it's a problem with my setup. The client is pretty basic
connection, err := grpc.Dial(address, grpc.WithInsecure(), grpc.WithBlock())
pb.NewAppClient(connection)
defer connection.Close()
and calls are made with a timeout like
ctx, cancel := context.WithTimeout(ctx, 300*time.Millisecond)
defer cancel()
client.MyGRPCMethod(ctx, params)
One other thing I'm doing is checking the connection to see if it's either open, idle or connecting, and reusing the connection if so. Otherwise, redialing.
Nothing special configuration is happening with the server
grpc.NewServer()
Are there any common mistakes setting up a grpc client/server that I might be making?

After much search, I have finally come to an acceptable and logical solution to this problem.
The root-cause is this: The underlying TCP connection is closed abruptly, but neither the gRPC Client nor Server are 'notified' of this event.
The challenge is at multiple levels:
Kernel's management of TCP sockets
Any intermediary load-balancers/reverse-proxies (by Cloud Providers or otherwise) and how they manage TCP sockets
Your application layer itself and it's networking requirements - whether it can reuse the same connection for future requests not
My solution turned out to be fairly simple:
server = grpc.NewServer(
grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionIdle: 5 * time.Minute, // <--- This fixes it!
}),
)
This ensures that the gRPC server closes the underlying TCP socket gracefully itself before any abrupt kills from the kernel or intermediary servers (AWS and Google Cloud Load Balancers both have larger timeouts than 5 minutes).
The added bonus you will find here is also that any places where you're using multiple connections, any leaks introduced by clients that forget to Close the connection will also not affect your server.
My $0.02: Don't blindly trust any organisation's (even Google's) ability to design and maintain API. This is a classic case of defaults-gone-wrong.

One other thing I'm doing is checking the connection to see if it's either open, idle or connecting, and reusing the connection if so. Otherwise, redialing.
grpc will manage your connections for you, reconnecting when needed, so you should never need to monitor it after creating it unless you have very specific needs.
"transport is closing" has many different reasons for happening; please see the relevant question in our FAQ and let us know if you still have questions: https://github.com/grpc/grpc-go#the-rpc-failed-with-error-code--unavailable-desc--transport-is-closing

I had about the same issue earlier this year . After about 15 minuets I had servers close the connection.
My solution which is working was to create my connection with grpc.Dial once on my main function then create the pb.NewAppClient(connection) on each request. Since the connection was already created latency wasn't an issue. After the request was done I closed the client.

Related

What would happen if a process established multiple PostgreSQL connections and terminated without closing them?

I'm writing a DLL for a purchased software.
The software will perform multi-threaded calculations on certain tasks.
My job is to output the relative result into a database.
However, due to the limited support of the software, it is kind of difficult to do multi-threaded output of the data.
The key problem is that there is no info on the last execution of the DLL function.
Therefore, the database connection will not be closed.
So may I ask if I leave the connection open and terminate the process, what would be the potential problems?
My platform is winserver 2008, and PostgreSQL 10.
I don't understand the background information you are giving, but I can answer the question:
If a PostgreSQL client process dies without closing the database (and TCP) connection, the PostgreSQL server process (“backend process”) that servers this connection will not realize this immediately.
Of course, as soon as the server tries to communicate to the client, e.g. to return some results, TCP it will notice that the partner has gone away and will return an error.
However, often the backend process is idle, waiting for the client to send the next request. In this case, it would never notice that its partner has died. This could eventually cause max_connections to be exhausted with dead connections.
Because this is a common problem in networking, TCP provides the “keepalive” functionality: when a connection has been idle for a while (2 hours by default), the operating system will send a so-called “keepalive packet” and wait for a response from the other side. Sending keepalive packets is repeated several times (5 times by default) in short intervals (1 second by default), and if no response is received, the connection is closed by the operating system, the backend process receives an error message and terminates.
PostgreSQL provides parameters with which you can configure these settings on the server side: tcp_keepalives_idle, tcp_keepalives_count and tcp_keepalives_interval. If you set tcp_keepalives_idle to a shorter value, dead connections will be detected and removed faster, at the cost of some little added network traffic.

Why RabbitMQ won't the connection stay open when not in use?

I have used http://github.com/streadway/amqp package in my application in order to handle connections to a remote RabbitMQ server. Everything is ok and works fine but when a connection is idle for a long period of time f.g 6 hours it gets closed. I check NotifyClose(make(chan *amqp.Error)) all time in my go routine and it returns :
Exception (501) Reason: "write tcp
192.168.133.53:55424->192.168.134.34:5672: write: broken pipe"
Why this error happens? (is there any problem in my code?)
How long a connection can be idle?
How to prevent this problem?
As Cosmic Ossifrage says, the error is saying your RabbitMQ client has disconnected.
There are so many things that could sit between your client and server that can/will drop dormant connections that it's not worth focusing on how long your connection can be dormant for. You want to set the requested heartbeat interval in your connection manager.
https://www.rabbitmq.com/heartbeats.html
I'm not familiar with the framework you're using but I see it has a defaultHeartbeat field in connection.go. You might need to experiment with the value to find the best balance is to stop the connection being killed but not hit the server too often with keep-alive traffic.

How to check if a Conn is active without sending/receiving data?

In Go/Golang, once a connection object (Conn) is created with the following code:
conn, err := net.Dial("tcp","33.33.33.33:444")
if err != nil {
// good connection
}
I would like to preserve the conn value for later on verifying if the connection is active. I dont want to re-connect from time to time to check the connection as it causes various TIME_WAITs on the OS, so overall my requirements are:
create a connection
preserve the connection object
capture if the connection drops for any reason
do not send or receive any data
Any thoughts on how to achieve this ? Is there a way to capture that the connection is discontinued without sending or receiving data or reconnecting ?
I don't think it is possible to do without performing an operation. If it is infrequently used, when you try to read you may get an error if the client (or some proxy) closed the connection. If that happens then reconnect and retry.
Many protocols will bake in a heartbeat mechanism to facilitate this kind of thing. Then you can read constantly (with SetDeadline if you want) and know within a heartbeat frame that something went wrong.
For example, I use a redis client that supports connection pooling. When I retrieve an idele connection from the pool, I immediately perform a PING operation. If that succeeds, I know the connection is ready to use. If not, I get another idle one, or connect anew.

Stale connection with Pheanstalk

I'm using beanstalkd to offload some work to other machines. The setup is a bit unusual, the server is on the internet (public ip) but the consumers are behind adsl lines on some peoples homes. So there is a linux server as client going out through a dynamic ip and connecting to the server to get a job. It's all PHP and I'm using pheanstalk library.
Everything runs smoothly for some time, but then the adsl changes the IP (every 24h hours the provider forces a disconnect-reconnect) the client just hangs, never to go out of "reserve".
I thought that putting a timeout on the reserve would help it, but it didn't. As it seems, the client issues a command and blocks, it never checks the timeout. It just issues a reserve-with-timeout (instead of a simple reserve) and it is the servers responsibility to return a TIME_OUT as the timeout occurs. The problem is, the connection is broken (but the TCP/IP doesn't know about that yet until any of the sides try to talk to the other side) and if the client blocked reading, it will never return.
The library seems to have support for some kind of timeouts locally (for example when trying to connect to server), but it does not seem to contemplate this scenario.
How could I detect the stale connection and force a reconnect? Is there some kind of keepalive on the protocol (and on the pheanstalk itself)?
Thanks!
You could try to close each connection right after the request is answered and reopen a new connection each time.
There is no close() function but you deleting the Pheanstaly Object with unset($pheanstalk) will close it.
This explanation is quite helpful:
Pheanstalk (PHP client for beanstalk) - how do connections work?
I haven't tried it yet, but I came up with the idea of connecting to the beanstalk server through an SSH tunnel. We can enable the ServerAliveCountMax and ServerAliveInterval options on the tunnel, so that a network or server failure will cause the tunnel to close. This should then cause the pheanstalk client to report an error.

What is the difference between "ORA-12571: TNS packet writer failure" and "ORA-03135: connection lost contact"?

I am working in an environment where we get production issues from time to time related to Oracle connections. We use ODP.NET from ASP.NET applications, and we suspect the firewall closes connections that have been in the connection pool too long.
Sometimes we get an "ORA-12571: TNS packet writer failure" error, and sometimes we get "ORA-03135: connection lost contact."
I was wondering if someone has run into this and/or has an understanding of the difference between the 2 errors.
Using a mobile phone analogy:
ORA-12571 (Failure) Means call is dropped.
ORA-03135 (Connection Lost) Other party hung up.
My understanding is that 3135 occurs when a connection is lost. This doesn't tell you why the connection was lost, though. It may have been terminated by the server because the server failed to recieve a response to a probe for a certain amount of time, and assumed that the connection was dead. Or (I'm not sure about this) the exact reverse of that: the client failed to recieve a probe response from the server for a certain amount of time, so it assumed the connection was lost. The "certain amount of time" is cotrolled by SQLNET.EXPIRE_TIME=[minutes] in sqlnet.ora.
As for 12571, my (again vague) understanding is that there was a sudden failure to send a packet during communication with the server, and that this is typically caused by some software or hardware interfering with the connection (either by design, or by error). For instance, if you pull out your ethernet cable and then try to execute a query, you'll probably get this. Or if a firewall or anti-malware application decides to block the traffic.

Resources