Rails 3.0 intermittent Connection timed out, execution expired errors - ruby

We're on four Amazon EC2 instances (one load balancer, one db, and two app) and are constantly getting random timeouts. We get at least one a day, sometimes more. Here are some examples:
Errno::ETIMEDOUT: Connection timed out - connect(2)
/usr/local/rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/net/smtp.rb:546:in `initialize'
and
Timeout::Error: execution expired
[GEM_ROOT]/gems/activemodel-3.0.9/lib/active_model/attribute_methods.rb:354:in `match'
I'm not sure how to debug these as they are not related to application code or server load. CPU usage usually hovers below 10% with the biggest spike going up to 60%. The spikes are most likely due to running backups and do not correspond with the times of the timeout errors.
How can these types of errors be tracked down?

The first timeout looks like a legit connection timeout sending mail via SMTP. Are you hosting your own SMTP server or using a service?
Looks like sendgrid has been experiencing delays/timeouts the last couple of days:
We're currently seeing lots of volume in our queues and emails may be delayed for a short period. Stay tuned for updates. #status
Fix for SMTP Service Timeout/Fails
Setup a local mail relay that will hold mail and re-send if there are failures like this. We use a local Postfix relay in production for just this problem (so ActiveMailer uses sendmail to Postfix, which queues up mail and delivers via SMTP relay to Sendgrid).

Related

Blockers for maximum number of http requests from a pod

I have a Go app which is deployed to two 8 core pods instances on Kubernetes.
From it, I receive a list of ids than later on I use to retrieve some data from another service by sending each id to a POST endpoint.
I am using a bounded concurrency pattern to have a maximum number of simulataneous goroutines (and therefore, of requests) to this external service.
I set the limit of concurrency as:
sem := make(chan struct{}, MAX_GO_ROUTINES)
With this setup I started playing around with the MAX_GO_ROUTINES number by increasing it. I usually receive around 20000 ids to check. So I have played around by setting MAX_GO_ROUTINES from anywhere between 100 and 20000.
One thing I notice is as I go higher and higher some requests start to fail with the message: connection reset from this external service.
So my questions are:
What is the blocker in this case?
What is the limit of concurrent HTTP POST requests a server with 8 cores and 4GB of ram can send? Is it a memory limit? or file descriptors limit?
Is the error I am getting coming from my server or from the external one?
What is the blocker in this case?
As the comment mentioned: HTTP "connection reset" generally means:
the connection was unexpectedly closed by the peer. The server appears
to have dropped the connection on the unsuspecting HTTP client before
sending back a response. This is most likely due to the high load.
Most webservers (like nginx) have a queue where they stage connections while they wait to be serviced. When the queue exceeds some limit the connections may be shed and "reset". So it's most likely this is your upstream service being saturated (i.e. your app sends more requests than it can service and overloads its queue)
What is the limit of concurrent HTTP POST requests a server with 8 cores and 4GB of ram can send? Is it a memory limit? or file descriptors limit?
All :) At some point your particularl workload will overload a logical limit (like file descriptors) or a "physical" limit like memory. Unfortunately the only way to truly understand which resource is going to be exhausted (and which constraints you hit up against) is to run tests and profile and benchmark your workload :(
Is the error I am getting coming from my server or from the external one?
HTTP Connection reset is most likely the external, it indicates the connection peer (the upstream service) reset the connection.

How to deal with long running server responses and Load Balancer that treats this as stalled connection

Case/Assumption:
There is a server that is written by someone else.
This server has an endpoint GET /api/watch.
This endpoint is plain HTTP/1.1
This endpoint will write events like
{type:"foo", message:"bar"}
to the response stream once they appear (one event per line and then a flush).
Sometimes this server writes events every second to the output, sometimes every 15 minutes.
Between my client and this server there is a third-party Load Balancer which assumes a connection as staling if there is no action on the connection for more than 60 seconds and drops the connection without closing it.
The client is written in simple Golang and simply makes a GET request to this endpoint.
Once the connection is marked by the LB as staled the client (the same happens to curl, too) is not notified that the connection was dropped by the LB and is still waiting for stuff to receive in the response of the GET request.
So: What are my possibilities to deal with this situation?
What is not possible:
Modify the server.
Use another server.
Use something else than this endpoint and how it is written.
Modify the Load Balancer.
Use another LB.
Leave the LB out of the connection.
15 minutes is an incredibly long quiet period for basic HTTP - possibly better-suited to WebSockets. Short of a protocol change, you should be able to adjust the timeout period on the load balancer (hard to say since you didn't specify what the LB is) to better suit your use case, though not all load balancers will allow timeouts as high as 15 minutes. If you can't change the protocol and can't turn the timeout up high enough, you would have to send keepalive messages from the server every so often (just short of the timeout period, so maybe 55s with your current config, or a little less than the highest timeout period you're able to set on the LB). This would have to be something the client knew to discard, like {"type": "keepalive"} - something easily identifiable on the client side as a "fake" message for keepalive purposes.

JDBC pooling related to ntp sync?

We're having a connection timeout issue from an API pooling connections to an informix connection manager which forwards the queries to the appropriate informix database server.
Recently, I've set up the mail service and realized that we're having delays in receiving the mail send and after troubleshooting I saw that the database server is not syncronized at all with the API ( 2+ minutes difference ).
I've read somewhere that time sync is important when using jdbc pooling but I can't find to much information regarding this on internet. The timeout kinda makes sense because of the tcp keepalive.
Had anyone experienced or know about this ?
Thank you,
Mihai.
It is common to intermix database timestamps and local timestamps. This causes issues when the server times are different. If the mail server is looking for records before the current time, there could be a two minute delay before mail is sent.
Email may be delayed in transit between servers. Check the Received headers to see if there are any unexpected delays. (You will need to compensate for time variances on the servers.
Normally, you would use NTP to ensure the time is the same on all servers. Within a data center it should be able to synchronize times to a millisecond or so.

nodeJS being bombarded with reconnections after restart

We have a node instance that has about 2500 client socket connections, everything runs fine except occasionally then something happens to the service (restart or failover event in azure), when the node instances comes back up and all socket connections try to reconnect the service comes to a halt and the log just shows repeated socket connect/disconnects. Even if we stop the service and start it the same thing happens, we currently send out a package to our on premise servers to kill the users chrome sessions then everything works fine as users begin logging in again. We have the clients currently connecting with 'forceNew' and force web sockets only and not the default long polling than upgrade. Any one ever see this or have ideas?
In your socket.io client code, you can force the reconnects to be spread out in time more. The two configuration variables that appear to be most relevant here are:
reconnectionDelay
Determines how long socket.io will initially wait before attempting a reconnect (it should back off from there if the server is down awhile). You can increase this to make it less likely they are all trying to reconnect at the same time.
randomizationFactor
This is a number between 0 and 1.0 and defaults to 0.5. It determines how much the above delay is randomly modified to try to make client reconnects be more random and not all at the same time. You can increase this value to increase the randomness of the reconnect timing.
See client doc here for more details.
You may also want to explore your server configuration to see if it is as scalable as possible with moderate numbers of incoming socket requests. While nobody expects a server to be able to handle 2500 simultaneous connections all at once, the server should be able to queue up these connection requests and serve them as it gets time without immediately failing any incoming connection that can't immediately be handled. There is a desirable middle ground of some number of connections held in a queue (usually controllable by server-side TCP configuration parameters) and then when the queue gets too large connections are failed immediately and then socket.io should back-off and try again a little later. Adjusting the above variables will tell it to wait longer before retrying.
Also, I'm curious why you are using forceNew. That does not seem like it would help you. Forcing webSockets only (no initial polling) is a good thing.

How to fix "sequel::DatabaseDisconnectError - Mysql::Error: MySQL server has gone away" on Heroku

I have a simple Sinatra application hosted on Heroku and using Sequel to connect to a MySql database through the ClearDB addon.
The application works fine, except when it sits idle for more than a minute. In that case, the first request I make gives a "500 Internal Server Error", which heroku logs reveals to be:
sequel::DatabaseDisconnectError - Mysql::Error: MySQL server has gone away
If I refresh the page after this error, it works fine, and the error will not return until the application sits idle for another minute or so.
The application is running two dynos, so the problem is not being caused by the Heroku dyno idling you might see on a free account. I contacted ClearDB support, and they gave me this advice:
if you are using connection pooling, then you should set the idle
timeout at just below 60 seconds and/or set a keep-alive as I
mentioned below. If you are not using connection pooling, then you
must make sure that the app actually closes connections after queries
and doesn't rely on the network timeout to shut them down.
I understand that I could create a cron job to hit the server every 30s or so, but that seems an inelegant solution to the problem. The other suggestion about making sure the application closes connections I don't understand. I'm just using Sequel to make queries, I assumed that Sequel manages the connections for me under the hood. Do I need to configure it to ensure that it closes connections? How would I do that?
Your connection times out which is no big deal. Sequel can deal with that situation if you add the connection_validator extension to your DB:
DB.extension(:connection_validator)
As described in the documentation this extension
"detects an invalid connection, […] removes it from the pool and tries the next available connection, creating a new connection if no available connection is valid"

Resources