Maintaining websocket connections with Kubernetes / multiple pods (Tornado) - websocket

I've run into a bit of a constraint with combining my Tornado web server websockets implementation with my Kubernetes deployment (I'm using GCP). I'm running an auto-scaler on my Kubernetes engine that automatically spins up new pods as the load on the server increases - this typically happens for long-running jobs (e.g. training ML models).
The problem is that my pods are keeping their own local state - this can mean that a user connection can exist between client -> pod 1, but the long-running job might be processing on pod 2 and sending updates from there. These updates get lost as the connection does not exist between client and pod 2.
I'm using a relatively boilerplate WS implementation, adding/removing user connections as follows:
class DefaultWebSocket(WebSocketHandler):
user_connections = set()
def check_origin(self, origin):
return True
def open(self):
"""
Establish a websocket connection between client and server
:param self: WebSocket object
"""
DefaultWebSocket.user_connections.add(self)
def on_close(self):
"""
Close websocket connection between client and server
:param self: WebSocket object
"""
DefaultWebSocket.user_connections.remove(self)
Any ideas/thoughts - either high or low-level - on how to solve this problem would be much appreciated! I've looked into stuff like k8scale.io and socket.io, but would prefer a 'native' solution.

Related

The fix client can receive incoming messages but cannot send outgoing heartbeat message

We have built a fix client. The fix client can receive incoming messages but cannot send outgoing heartbeat message or reply the TestRequest message after the last heartbeat was sent, something is triggered to stop sending heartbeat anymore from client side.
fix version: fix5.0
The same incident happened before, we have tcpdump for one session in that time
we deploy every fix session to separated k8s pods.
We doubted it's CPU resource issue because the load average is high around the issue time, but it's not solved after we add more cpu cores. we think the load average is high because of fix reconnection.
We doubted it's IO issue because we use AWS efs which shared by 3 sessions for logging and message store. but it's still not solved after we use pod affinity to assign 3 sessions to different nodes.
It's not a network issue either, since we can receive fix messages, other sessions worked well at that time. We have disabled SNAT in k8s cluster too.
We are using quickfixj 2.2.0 to create a fix client, we have 3 sessions, which are deployed to k8s pods in separated nodes.
rate session to get fx price from server
order session to get transaction(execution report) messages from server, we only send logon/heartbeat/logout messages to server.
backoffice session to get marketstatus
We use apache camel quickfixj component to make our programming easy. It works well in most time, but it keeps happening to reconnect to fix servers in 3 sessions, the frequency is like once a month, mostly only 2 sessions have issues.
heartbeatInt = 30s
The fix event messages at client side
20201004-21:10:53.203 Already disconnected: Verifying message failed: quickfix.SessionException: Logon state is not valid for message (MsgType=1)
20201004-21:10:53.271 MINA session created: local=/172.28.65.164:44974, class org.apache.mina.transport.socket.nio.NioSocketSession, remote=/10.60.45.132:11050
20201004-21:10:53.537 Initiated logon request
20201004-21:10:53.643 Setting DefaultApplVerID (1137=9) from Logon
20201004-21:10:53.643 Logon contains ResetSeqNumFlag=Y, resetting sequence numbers to 1
20201004-21:10:53.643 Received logon
The fix incoming messages at client side
8=FIXT.1.1☺9=65☺35=0☺34=2513☺49=Quote1☺52=20201004-21:09:02.887☺56=TA_Quote1☺10=186☺
8=FIXT.1.1☺9=65☺35=0☺34=2514☺49=Quote1☺52=20201004-21:09:33.089☺56=TA_Quote1☺10=185☺
8=FIXT.1.1☺9=74☺35=1☺34=2515☺49=Quote1☺52=20201004-21:09:48.090☺56=TA_Quote1☺112=TEST☺10=203☺
----- 21:10:53.203 Already disconnected ----
8=FIXT.1.1☺9=87☺35=A☺34=1☺49=Quote1☺52=20201004-21:10:53.639☺56=TA_Quote1☺98=0☺108=30☺141=Y☺1137=9☺10=183☺
8=FIXT.1.1☺9=62☺35=0☺34=2☺49=Quote1☺52=20201004-21:11:23.887☺56=TA_Quote1☺10=026☺
The fix outgoing messages at client side
8=FIXT.1.1☺9=65☺35=0☺34=2513☺49=TA_Quote1☺52=20201004-21:09:02.884☺56=Quote1☺10=183☺
---- no heartbeat message around 21:09:32 ----
---- 21:10:53.203 Already disconnected ---
8=FIXT.1.1☺9=134☺35=A☺34=1☺49=TA_Quote1☺52=20201004-21:10:53.433☺56=Quote1☺98=0☺108=30☺141=Y☺553=xxxx☺554=xxxxx☺1137=9☺10=098☺
8=FIXT.1.1☺9=62☺35=0☺34=2☺49=TA_Quote1☺52=20201004-21:11:23.884☺56=Quote1☺10=023☺
8=FIXT.1.1☺9=62☺35=0☺34=3☺49=TA_Quote1☺52=20201004-21:11:53.884☺56=Quote1☺10=027☺
Thread dump when TEST message from server was received.BTW, The gist is from our development environment which has the same deployment.
https://gist.github.com/hitxiang/345c8f699b4ad1271749e00b7517bef6
We had enabled the debug log at quickfixj, but not much information, only logs for messages receieved.
The sequence in time serial
20201101-23:56:02.742 Outgoing heartbeat should be sent at this time, Looks like it's sending, but hung at io writing - in Running state
20201101-23:56:18.651 test message from server side to trigger thread dump
20201101-22:57:45.654 server side began to close the connection
20201101-22:57:46.727 thread dump - right
20201101-23:57:48.363 logon message
20201101-22:58:56.515 thread dump - left
The right(2020-11-01T22:57:46.727Z): when it hangs, The left(2020-11-01T22:58:56.515Z): after reconnection
It looks like that the storage - aws efs we are using made the issue happen.
But the feedback from aws support is that nothing is wrong at aws efs side.
Maybe it's the network issue between k8s ec2 instance and aws efs.
First, we make the logging async at all session, make the disconnection happen less.
Second, for market session, we write the sequence files to local disk, the disconnection had gone at market session.
Third, at last we replaced the aws efs with aws ebs(persist volume in k8s) for all sessions. It works great now.
BTW, aws ebs is not high availability across zone, but it's better than fix disconnection.

client-mode="true" and retryInterval on the inbound adapter with Client Connection factory

In spring Documentation --> 32.6 TCP Adapters it is mentioned that we use clientMode = "true" then the inbound adapter is responsible for the connection with external server.
I have created a flow in which the TCP Adapter with client connection factory makes connection with external server the code for the flow is :
IntegrationFlow flow = IntegrationFlows.from(Tcp.inboundAdapter(Tcp.nioClient(hostConnection.getIpAddress(),Integer.parseInt(hostConnection.getPort()))
.serializer(customSerializer)
.deserializer(customSerializer)
.id(hostConnection.getConnectionNumber())).clientMode(true).retryInterval(1000).errorChannel("testChannel").id(hostConnection.getConnectionNumber()+"adapter"))
.enrichHeaders(f->f.header("CustomerCode",hostConnection.getConnectionNumber()))
.channel(directChannel())
.handle(Jms.outboundAdapter(ConnectionFactory())
.destination(hostConnection.getConnectionNumber()))
.get();
theFlow = this.flowContext.registration(flow).id(hostConnection.getConnectionNumber()+"outflow").register();
I have created multiple flow by iterating over the list of connections and
iterate the above code in for loop and register them in flowcontext with unique ID.
My clients are created successfully with no issue and then establish there connection as supported by topology.
Issue :
I have counted the number of client connection created successfully so I have counted that 7 client connection (7 Integration flow) made successfully and they initiate connection from themselves.
when I create 8th client connection (8th flow created and registered successfully) but the .clientMode(true) is not working means the client don't initiate connection itself after first failure means it try for the first time to make connection if connected successfully then no issue but in case of failure it don't retry again.
Also my other created clients i.e 7 clients connection which are created successfully they also stopped initiating connection from itself when they got disconnected.
Note: There is no issue with flow only the TCP Adapters they stop initiating the connection
The flow is created and registered successfully as there is no issue it is because when I run a control bus command #adapter_id.retryConnection() it got connected with the server.
I don't understand that what is the issue with my flow that i couldn't initiate a connection after a particular count i.e seven or is there limitation in creating number of clients.
One possibility is the taskScheduler's thread pool is exhausted - that shouldn't happen with the above configuration, but it depends on what else is in the application. Take a thread dump (e.g. jstack) to see what the taskScheduler threads are doing.
See the documentation for information about how to configure the threads in the scheduler. However, if it solves it, you should really figure out what task(s) are using scheduler threads for long tasks.
Also turn on DEBUG logging to see if it provides any clues.

Long-running Redshift transaction from Ruby

I run several sql statements in a transaction using Ruby pg gem. The problem that I bumped in is that connection times out on these queries due to firewall setup. Solution proposed here does not work, because it requires jdbc connections string, and I'm in Ruby (jRuby is not an option). Moving driver program to AWS to remove firewall is not an option either.
The code that I have is along the following lines:
conn = RedshiftHelper.get_redshift_connection
begin
conn.transaction do
# run my queries
end
ensure
conn.flush
conn.finish
end
I'm now looking into PG asynchronous API. I'm wondering if I can use is_busy to prevent firewall from timing out, or something to that effect. I can't find good documentation on the topic though. Appreciate any hints on that.
PS: I have solved this problem for a single query - I can trigger it asynchronously and track its completion using system STV_INFLIGHT Redshift table.Transaction does not work this way as I have to keep connection open.
Ok, I nailed it down. Here are the facts:
Redshift is based on Postgres 8.0. To check that, connect to Redshift instance using psql and see that it says "server version 8.0"
Keepalive requests are specified on the level of tcp socket (link).
Postgres 8.0 does not support keepalive option when specifying a connection string (link to 9.0 release changes, section E.19.3.9.1 on libpq)
PG gem in Ruby is a wrapper about libpq
Based on the facts above, tcp keepalive is not supported by Redshift. However, PG allows you to retrieve a socket that is used in the established connection. This means that even though libpq does not set keepalive feature, we still can use it manually. The solution thus:
class Connection
attr_accessor :socket, :pg_connection
def initialize(conn, socket)
#socket = socket
#pg_connection = conn
end
def method_missing(m, *args, &block)
#pg_connection.send(m, *args, &block)
end
def close
#socket.close
#pg_connection.close
end
def finish
#socket.close
#pg_connection.close
end
end
def get_connection
conn = PGconn.open(...)
socket_descriptor = conn.socket
socket = Socket.for_fd(socket_descriptor)
# Use TCP keep-alive feature
socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_KEEPALIVE, 1)
# Maximum keep-alive probes before asuming the connection is lost
socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_KEEPCNT, 5)
# Interval (in seconds) between keep-alive probes
socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_KEEPINTVL, 2)
# Maximum idle time (in seconds) before start sending keep-alive probes
socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_KEEPIDLE, 2)
socket.autoclose = true
return Connection.new(conn, socket)
end
The reason why I introduce a proxy Connection class is because of Ruby tendency to garbage-collect IO objects (like sockets) when they get out of scope. This means that we now need connection and socket to be in the same scope, which is achieved through this proxy class. My Ruby knowledge is not deep, so there may be a better way to handle the socket object.
This approach works, but I would be happy to learn if there are better/cleaner solutions.
The link you provided has the answer. I think you just want to follow the section at the top, which has settings for 3 different OS'es, pick the one you are running the code on (the client to the Amazon service).
Look in this section: To change TCP/IP timeout settings - this is the OS that your code is running on (i.e. The client for the Amazon Service is your Server probably)
Linux — If your client is running on Linux, run the following command as the root user.
-- details omitted --
Windows — If your client runs on Windows, edit the values for the following registry settings under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters:
-- details omitted --
Mac — If your client is a Mac, create or modify the /etc/sysctl.conf file with the following values:
-- Details omitted --

HTTP streaming connection (SSE) client disconnect not detected with Sinatra/Thin on Heroku

I am attempting to deploy a Sinatra streaming SSE response application on the Cedar stack. Unfortunately while it works perfectly in development, once deployed to Heroku the callback or errback never get called when a connection is called, leading to the connection pool getting filled up with stale connections (that never time out because data is still being sent to them on the server side.)
Relvant info from the Heroku documentation:
Long-polling and streaming responses
Cedar supports HTTP 1.1 features such as long-polling and streaming responses. An application has an initial 30 second window to respond with a single byte back to the client. However, each byte transmitted thereafter (either received from the client or sent by your application) resets a rolling 55 second window. If no data is sent during the 55 second window, the connection will be terminated.
If you’re sending a streaming response, such as with server-sent events, you’ll need to detect when the client has hung up, and make sure your app server closes the connection promptly. If the server keeps the connection open for 55 seconds without sending any data, you’ll see a request timeout.
This is exactly what I would like to do -- detect when the client has hung up, and close the connection promptly. However, something about the Heroku routing layer seems to prevent Sinatra from detecting the stream close event as it would normally.
Some sample code that can be used to replicate this:
require 'sinatra/base'
class MyApp < Sinatra::Base
set :path, '/tmp'
set :environment, 'production'
def initialize
#connections = []
EM::next_tick do
EM::add_periodic_timer(1) do
#connections.each do |out|
out << "connections: " << #connections.count << "\n"
end
puts "*** connections: #{#connections.count}"
end
end
end
get '/' do
stream(:keep_open) do |out|
#connections << out
puts "Stream opened from #{request.ip} (now #{#connections.size} open)"
out.callback do
#connections.delete(out)
puts "Stream closed from #{request.ip} (now #{#connections.size} open)"
end
end
end
end
I've put a sample app up at http://obscure-depths-3413.herokuapp.com/ using this code that illustrates the problem. When you connect, the amount of connections will increment, but when you disconnect they never go down. (Full source of demo with Gemfile etc is at https://gist.github.com/mroth/5853993)
I'm at wits end trying to debug this one. Anyone know how to fix it?
P.S. There appears to have been a similar bug in Sinatra but it was fixed a year ago. Also this issue only occurs on production in Heroku, but works fine when run locally.
P.S.2. This occurs when iterating over the connections objects as well, for example adding the following code:
EM::add_periodic_timer(10) do
num_conns = #connections.count
#connections.reject!(&:closed?)
new_conns = #connections.count
diff = num_conns - new_conns
puts "Purged #{diff} connections!" if diff > 0
end
Works great locally, but the connections never appear as closed on Heroku.
An update: after working directly with the Heroku routing team (who are great guys!), this is now fixed in their new routing layer, and should work properly in any platform.
I would do this check by hand sending, in a periodic time, alive signal where the client should respond if the message was received.
Please, look at this simple chat implementation https://gist.github.com/tlewin/5708745 that illustrate this concept.
The application communicates with the client using a simple JSON protocol. When the client receive the alive: true message, the application post back a response and the server store the last communication time.

How to gracefully shut down or remove AWS instances from an ELB group

I have a cloud of server instances running at Amazon using their load balancer to distribute the traffic. Now I am looking for a good way to gracefully scale the network down, without causing connection errors on the browser's side.
As far as I know, any connections of an instance will be rudely terminated when removed from the load balancer.
I would like to have a way to inform my instance like one minute before it gets shut down or to have the load balancer stop sending traffic to the dying instance, but without terminating existing connections to it.
My app is node.js based running on Ubuntu. I also have some special software running on it, so I prefer not to use the many PAAS offering node.js hosting.
Thanks for any hints.
I know this is an old question, but it should be noted that Amazon has recently added support for connection draining, which means that when an instance is removed from the loadbalancer, the instance will complete requests that were in progress before the instance was removed from the loadbalancer. No new requests will be routed to the instance that was removed. You can also supply a timeout for these requests, meaning any requests that run longer than the timeout window will be terminated after all.
To enable this behaviour, go to the Instances tab of your loadbalancer and change the Connection Draining behaviour.
This idea uses the ELB's capability to detect an unhealthy node and remove it from the pool BUT it relies upon the ELB behaving as expected in the assumptions below. This is something I've been meaning to test for myself but haven't had the time yet. I'll update the answer when I do.
Process Overview
The following logic could be wrapped and run at the time the node needs to be shut down.
Block new HTTP connections to nodeX but continue to allow existing connections
Wait for existing connections to drain, either by monitoring existing connections to your application or by allowing a "safe" amount of time.
Initiate a shutdown on the nodeX EC2 instance using the EC2 API directly or Abstracted scripts.
"safe" according to your application, which may not be possible to determine for some applications.
Assumptions that need to be tested
We know that ELB removes unhealthy instances from it's pool I would expect this to be graceful, so that:
A new connection to a recently closed port will be gracefully redirected to the next node in the pool
When a node is marked Bad, the already established connections to that node are unaffected.
possible test cases:
Fire HTTP connections at ELB (E.g. from a curl script) logging the
results during scripted opening an closing of one of the nodes
HTTP ports. You would need to experiment to find an
acceptable amount of time that allows ELB to always determine a state
change.
Maintain a long HTTP session, (E.g. file download) while blocking new
HTTP connections, the long session should hopefully continue.
1. How to block HTTP Connections
Use a local firewall on nodeX to block new sessions but continue to allow established sessions.
For example IP tables:
iptables -A INPUT -j DROP -p tcp --syn --destination-port <web service port>
The recommended way for distributing traffic from your ELB is to have an equal number of instances across multiple availability zones. For example:
ELB
Instance 1 (us-east-a)
Instance 2 (us-east-a)
Instance 3 (us-east-b)
Instance 4 (us-east-b)
Now there are two ELB APIs of interest provided that allow you to programmatically (or via the control panel) detach instances:
Deregister an instance
Disable an availability zone (which subsequently disables the instances within that zone)
The ELB Developer Guide has a section that describes the effects of disabling an availability zone. A note in that section is of particular interest:
Your load balancer always distributes traffic to all the enabled
Availability Zones. If all the instances in an Availability Zone are
deregistered or unhealthy before that Availability Zone is disabled
for the load balancer, all requests sent to that Availability Zone
will fail until DisableAvailabilityZonesForLoadBalancer calls for that
Availability Zone.
Whats interesting about the above note is that it could imply that if you call DisableAvailabilityZonesForLoadBalancer, the ELB could instantly start sending requests only to available zones - possibly resulting in a 0 downtime experience while you perform maintenance on the servers in the disabled availability zone.
The above 'theory' needs detailed testing or acknowledgement from an Amazon cloud engineer.
Seems like there have already been a number of responses here and some of them have good advice. But I think that in general your design is flawed. No matter how perfect you design your shutdown procedure to make sure that a clients connection is closed before shutting down a server you're still vulnerable.
The server could loose power.
Hardware failure causes server to fail.
Connection could be closed by a network issue.
Client looses internet or wifi.
I could go on with the list, but my point is that instead of designing for the system to always work correctly. Design it to handle failures. If you design a system that can handle a server loosing power at any time then you've created a very robust system. This isn't a problem with the ELB this is a problem with the current system architecture you have.
A caveat that was not discussed in the existing answers is that ELBs also use DNS records with 60 second TTLs to balance load between multiple ELB nodes (each having one or more of your instances attached to it).
This means that if you have instances in two different availability zones, you probably have two IP addresses for your ELB with a 60s TTL on their A records. When you remove the final instances from such an availability zone, your clients "might" still use the old IP address for at least a minute - faulty DNS resolvers might behave much worse.
Another time ELBs wear multiple IPs and have the same problem, is when in a single availability zone you have a very large number of instances which is too much for one ELB server to handle. ELB in that case will also create another server and add its IP to the list of A records with a 60 second TTL.
I can't comment cause of my low reputation. Here is some snippets I crafted that might be very useful for someone out there. It utilizes the aws cli tool to check when an instance been drained of connections.
You need an ec2-instance with provided python server behind an ELB.
from flask import Flask
import time
app = Flask(__name__)
#app.route("/")
def index():
return "ok\n"
#app.route("/wait/<int:secs>")
def wait(secs):
time.sleep(secs)
return str(secs) + "\n"
if __name__ == "__main__":
app.run(
host='0.0.0.0',
debug=True)
Then run following script from local workstation towards the ELB.
#!/bin/bash
which jq >> /dev/null || {
echo "Get jq from http://stedolan.github.com/jq"
}
# Fill in following vars
lbname="ELBNAME"
lburl="http://ELBURL.REGION.elb.amazonaws.com/wait/30"
instanceid="i-XXXXXXX"
getState () {
aws elb describe-instance-health \
--load-balancer-name $lbname \
--instance $instanceid | jq '.InstanceStates[0].State' -r
}
register () {
aws elb register-instances-with-load-balancer \
--load-balancer-name $lbname \
--instance $instanceid | jq .
}
deregister () {
aws elb deregister-instances-from-load-balancer \
--load-balancer-name $lbname \
--instance $instanceid | jq .
}
waitUntil () {
echo -n "Wait until state is $1"
while [ "$(getState)" != "$1" ]; do
echo -n "."
sleep 1
done
echo
}
# Actual Dance
# Make sure instance is registered. Check latency until node is deregistered
if [ "$(getState)" == "OutOfService" ]; then
register >> /dev/null
fi
waitUntil "InService"
curl $lburl &
sleep 1
deregister >> /dev/null
waitUntil "OutOfService"

Resources