gRPC socket closed by go-side Server - go

I'm trying to connect my Erlang code and Go code with gRPC, during which I found that if I connect to gRPC too many times at the same time my socket would be closed by Go server, further stops my Erlang client (a gen_server). No error info was given to me, just a simple code in Erlang showing that http2_client socket closed by peer #Port<some port info>.
I'm sure that the limit of concurrent streams gRPC allowed is not reached (as my debug log did not show that we reached that place), and after a carefully look I found the problem occurred in google.golang.org\grpc\server.goserveStreams (line 830), and my logs showed that all the streams successfully reached var wg sync.WaitGroup, but could not finish remaining parts to come back to defer st.Close().
Could someone please kindly help me with this strange error, or at least give me some directions I should look into?

I have managed to solve this problem now, and thus come to answer it.
It turned out that I was using a different log system for debugging, therefore the default Golang error messages are not collected and presented to me... Golang keeps screaming and shouting at me, but I was sitting in another room and cannot hear it.
To the socket problem, this is because the whole connection is closed, due to the lack of long connection support in my 3rd party library http2_client.erl. After manually add a heartbeat to each connection, all problems are solved and the gRPC works just fine.

Related

ZeroMQ assertion failed: socket handle no longer valid for some reason

Got a Windows 10 c++ program using ZeroMQ that aborts very often on the same group of computers due to assertion failures.
The assert statement is buried deep into the libzmq code.
On other machines, the same program runs fine without those problems (but in all fairness, that's with different OS build numbers and program configurations).
The assertion failure seems to happen because internal zeromq (socket and/or pipe based) connection(s)/handles get unexpectedly closed.
What could possibly cause something like that?
More information:
The assertion failure seems to have something to do with the channels/mailboxes that ZeroMQ uses for internal signaling. In older versions of the library this works with several loopback TCP sockets while modern versions rely on a solution involving IOCP (I/O completion ports).
Here's a long standing and possibly related issue where the original author himself talked about a similar crash that happened to him:
https://github.com/zeromq/libzmq/issues/1108
Working with the crash dumps of our application I see that the stack trace leading to the assert statement usually happens at point right after attempting to read from a socket (or socket file descriptor?). The read or receive action fails and then the library panics.
So, suddenly a socket handle no longer seems valid. Examples of errors that I see are "The resource is temporarily unavailable" and things like "Invalid handle/parameter".
Can it be that something or someone is forcefully closing the socket for us?
What could be causing this behavior?
This happens for an old version of zeromq (4.0.10) as well as a modern one (4.3.5). This leads me to believe that the fault is somewhere else if such different implementations fail roughly the same way.
When trying to reproduce the problem I can trigger a similar assertion failure for 4.0.x by manually force closing an internal TCP connection that ZeroMQ uses with TCPView. The resulting assertion failure is instant and the crash dump looks identical to what happens in the wild.
But the modern version doesn't seem to use loopback sockets, so I couldn't close the "private" connections there. Maybe they are using pipes or unix style sockets instead (which is now possible on Windows 10 I have heard).
For a moment I have considered ephemeral port exhaustion as a reason for all this trouble but that alone doesn't make sense to me: I don't expect the OS to force close existing connections, existing connections should keep working. You'd expect only new connections to fail then.
As #user253751 suggested, the culprit seems to be a particular piece of code in the application that closes the same HANDLE twice. A serious bug in our code, not ZeroMQ!
On Windows, closed handles immediately get reused, so anything that is opened right after the first CloseHandle is at risk of being unexpectely closed when the second CloseHandle strikes, due to the bug.

WebSocket. Which is correct close code for idle timeout?

Then more I research then more I think of it as a hypothetical question.
In my application I try to proceed all command frames correctly. But while building an application I've encountered one issue: NodeJS default http server closes socket after 120 seconds of inactivity. But that's fine, I can easily disable this timeout. But why not to make it actually controllable? So now I implemented an interface to adjust timeout delay. And now I have another issue: server just break the connection. Silently. That is not really good practice for WebSocket protocol, I should send close command frame first. But which status code should I provide?
Documentation describes a set of status codes, but in general they are (1) job is done, (2) server/client going down, (3) some error occurred, (4) protocol reserved:
https://www.rfc-editor.org/rfc/rfc6455#section-7.4.1
And it's unclear to me, which one to choose for idle timeout? It sounds like 1001 (going away) is closer one, but I see nothing in documentation, and found no one ever asked this question.
So which one should I choose? Any ideas?
I was puzzled too. Seems to be no answer that is easily googleable here in 2022.
In my case I've decided to go with 1002 Protocol Error, since not answering to ping is basically protocol violation

Meteor: WebSocket is already in CLOSING or CLOSED state

I don't really know if this is just Meteor of Angular-Meteor, but I receive this error message a lot (in client console). Of course this should be in response to bad code on my part. However I am wondering when should this appear and what situation it tries to describe?
Thanks and bye ...
I think it happens on auto-refresh of the local meteor server (ie: each time you modify a file).

Bittorrent protocol 'not available'/'end connection' response?

I like being able to use a torrent app to grab the latest TV show so that I can watch it at my lesiure. The problem is that the structure of the protocol tends to cause a lot of incoming noise on my connection for some time after I close the client. Since I also like to play online games sometimes this means that I have to make sure that my torrent client is shut off about an hour (depending on how long the tracker advertises me to the swarm) before I want to play a game. Otherwise I get a horrible connection to the game because of the persistent flood of incoming torrent requests.
I threw together a small Ruby app to watch the incoming requests so I'd know when the UTP traffic let up:
http://pastebin.com/TbP4TQrK
The thought occurred to me, though, that there may be some response that I could send to notify the clients that I'm no longer participating in the swarm and that they should stop sending requests. I glanced over the protocol specifications but I didn't find anything of the sort. Does anyone more familiar with the protocol know if there's such a response?
Thanks in advance for any advice.
If a bunch of peers on the internet has your IP and think that you're on their swarm, they will try to contact you a few times before giving up. There's nothing you can do about that. Telling them to stop one at a time will probably end up using more bandwidth that just ignoring the UDP packets would.
Now, there are a few things you can do to mitigate it though:
Make sure your client sends stopped requests to all its trackers. This is part of the protocol specification and most clients do this. If this is successful, the tracker won't tell anyone about you past that point. But peers remember having seen you, so it doesn't mean nobody will try to connect to you.
Turn off DHT. The DHT acts much like a tracker, except that it doesn't have the stopped message. It will take something like 15-30 minutes for your IP to time out once it's announced to the DHT.
I think it might also be relevant to ask yourself if these stray incoming 23 byte UDP packets really matter. Presumably you're not flooded by more than a few per second (probably less). Have you made any actual measurements or is it mostly paranoia to wait for them to let up?
I'm assuming you're playing some latency sensitive FPS, in which case the server will most likely blast you with at least 10-50 full MTU packets per second, without any congestion control. I would be surprised if you attract so many bittorrent connection attempts that it would cause any of the game packets to be dropped.

Meaning/cause of RPC Exception 'No interfaces have been exported.'

We have a fairly standard client/server application built using MS RPC. Both client and server are implemented in C++. The client establishes a session to the server, then makes repeated calls to it over a period of time before finally closing the session.
Periodically, however, especially under heavy load conditions, we are seeing an RPC exception show up with code 1754: RPC_S_NOTHING_TO_EXPORT.
It appears that this happens in the middle of a session. The user is logged on for a while, making successful calls, then one of the calls inexplicably returns this error. As far as we can tell, the server receives no indication that anything went wrong - and it definitely doesn't see the call the client made.
The error code appears to have permanent implications, as well. Having the client retry the connection doesn't work, either. However, if the user has multiple user sessions active simultaneously between the same client and server, the other connections are unaffected.
In essence, I have two questions:
Does anyone know what RPC_S_NOTHING_TO_EXPORT means? The MSDN documentation simply says: "No interfaces have been exported." ... Huh? The session was working fine for numerous instances of the same call up until this point...
Does anyone have any ideas as to how to identify the real problem? Note: Capturing network traffic is something we would rather avoid, if possible, as the problem is sporadic enough that we would likely go through multiple gigabytes of traffic before running into an occurrence.
Capturing network traffic would be one of the best ways to tackle this issue. If you can't do that, could you dump the client process and debug with WinDBG or Visual Studio? Perhaps compare a dump when operating normally versus in the error state?

Resources