Frozen connection using ssh over Amazon EC2 using ubuntu - amazon-ec2

When I am connected to Amazon EC2 using the secure shell and don't type anything for a few minutes, everything freezes. I can't type anything or exit. After a few minutes I get a message from the server...
Last login: Fri Dec 6 23:21:28 2013 from pool-173-52-249-158.nycmny.east.verizon.net
ubuntu#ip-172-31-31-33:~$ Write failed: Broken pipe
Some of you have had to have this problem before. If you could shed some light on the situation for a newb using the cloud.

Try below options:
Explore ServerAliveCountMax and ServerAliveInterval. These settings are set in /etc/ssh/ssh_config on SSH client side.
from man ssh_config:
ServerAliveCountMax
Sets the number of server alive messages (see below) which may be sent without ssh(1) receiving any mes‐
sages back from the server. If this threshold is reached while server alive messages are being sent, ssh
will disconnect from the server, terminating the session. It is important to note that the use of server
alive messages is very different from TCPKeepAlive (below). The server alive messages are sent through
the encrypted channel and therefore will not be spoofable. The TCP keepalive option enabled by
TCPKeepAlive is spoofable. The server alive mechanism is valuable when the client or server depend on
knowing when a connection has become inactive.
The default value is 3. If, for example, ServerAliveInterval (see below) is set to 15 and
ServerAliveCountMax is left at the default, if the server becomes unresponsive, ssh will disconnect after
approximately 45 seconds. This option applies to protocol version 2 only; in protocol version 1 there is
no mechanism to request a response from the server to the server alive messages, so disconnection is the
responsibility of the TCP stack.
And
ServerAliveInterval
Sets a timeout interval in seconds after which if no data has been received from the server, ssh(1) will
send a message through the encrypted channel to request a response from the server. The default is 0,
indicating that these messages will not be sent to the server, or 300 if the BatchMode option is set.
This option applies to protocol version 2 only. ProtocolKeepAlives and SetupTimeOut are Debian-specific
compatibility aliases for this option.
Also similar settings are available from the server side which are ClientAliveInterval and ClientAliveCountMax. These settings palced in /etc/ssh/sshd_config on Server side.
from man sshd_config:
ClientAliveCountMax
Sets the number of client alive messages (see below) which may be sent without sshd(8) receiving any mes‐
sages back from the client. If this threshold is reached while client alive messages are being sent,
sshd will disconnect the client, terminating the session. It is important to note that the use of client
alive messages is very different from TCPKeepAlive (below). The client alive messages are sent through
the encrypted channel and therefore will not be spoofable. The TCP keepalive option enabled by
TCPKeepAlive is spoofable. The client alive mechanism is valuable when the client or server depend on
knowing when a connection has become inactive.
The default value is 3. If ClientAliveInterval (see below) is set to 15, and ClientAliveCountMax is left
at the default, unresponsive SSH clients will be disconnected after approximately 45 seconds. This
option applies to protocol version 2 only.
And
ClientAliveInterval
Sets a timeout interval in seconds after which if no data has been received from the client, sshd(8) will
send a message through the encrypted channel to request a response from the client. The default is 0,
indicating that these messages will not be sent to the client. This option applies to protocol version 2
only.

Looks like your firewall (from different locations) are dropping the sessions due to inactivity.
I would try just like #slayedbylucifer stated something like this in your ~/.ssh/config
Host *
ServerAliveInterval 60

Related

How to find Oracle server crash from OCI Client program

I have written a oracle client program using OCI library.
client send a request to server and hung because server crashed and not intimated to client.
How can i find server status from client side(using OCI API).?
Thanks
I think Oracle db module for Asterisk had a nice DCD(dead connection detection) implemented. There are various approaches(server side, client side).
In your case the easiest way would be to use TCP keepalive. Use enable=broken directive in tnsnames.ora.
Purpose
The keepalive feature on the supported TCP transports can be enabled
for a net service client by embedding (ENABLE=BROKEN) under the
DESCRIPTION parameter in the connect string. Keepalive allows the
caller to detect a dead remote server, although typically it will take
2 hours or more to notice. Operating system TCP configurables, which
vary by platform, define the actual keepalive timing details.
net_service_name=
(DESCRIPTION=
(enable=broken)
(ADDRESS=(PROTOCOL=tcp)(HOST=sales1-svr)(PORT=1521))
(ADDRESS=(PROTOCOL=tcp)(HOST=sales2-svr)(PORT=1521)))
(CONNECT_DATA=(SERVICE_NAME=sales.us.example.com))
Just beware you will also need root privileges. With default setting Linux kernel starts sending keepalive packets after 2 hours of inactivity. So you also have to change tcp_keepalive_time and tcp_keepalive_intvl in /etc/sysctl.conf. This is global server side settings and Oracle can not yet set keepalive interval for a single TCP connection.
One more comment: I recall there is some function called OCIPing.
This one can be used for testing too. But I'm not sure how to distinguish long running queries from dead server situation.

HornetQ client-failure-check-period

Suppose that after 30s (default client-failure-check-period) the client did not receive any packets from the server as a result of net connection problems.
Will the client now be disconnect from session/connection?
Suppose now I add this configration :
<retry-interval>1000</retry-interval>
<retry-interval-multiplier>1.5</retry-interval-multiplier>
<max-retry-interval>60000</max-retry-interval>
<reconnect-attempts>1000</reconnect-attempts>
What will happen now?
Will the client still get disconnected from session/connection but only after trying to reconnect 1000 times (until net is available again)? Or will it ignore the need to do disconnect?
Regarding your first question, and according to HornetQ documentation, that can be found under 17.2. Detecting failure from the client side:
As long as the client is receiving data from the server it will consider the connection to be still alive.
If the client does not receive any packets for client-failure-check-period milliseconds then it will consider the connection failed and will either initiate failover, or call any FailureListener instances (or ExceptionListener instances if you are using JMS) depending on how it has been configured.
Therefore the client will assume that the connection was in fact lost and start its failure processes.
For your second question, also according to the HornetQ documentation, that can be found under 34.3. Configuring reconnection/reattachment attributes:
reconnect-attempts. This optional parameter determines the total number of reconnect attempts to make before giving up and shutting down. A value of -1 signifies an unlimited number of attempts. The default value is 0.
So, yes, the connection will be dropped after 1000 attempts.

Open multiple keep-alives to a server in Go

I have an application that opens a keep-alive to a few remote servers (that I control). It sends a heart beat packet to keep this connection alive before the timeout.
This is how I created my transport:
// Keep-alive connection to the servers
tr := &http.Transport{}
client := &http.Client{Transport: tr}
If I use &http.Transport{MaxIdleConnsPerHost: 2} and set it to > 2, then I'm able to maintain multiple keep-alives per remote connection. However, these additional keep-alives per remote server are created by Go itself when there are concurrent requests that have to be made, and terminated automatically after the timeout expires.
My question is: how can I create the additional keep-alives, say 5 keep-alives per remote server myself when I initialize my transport (when I start Go) and keep them all alive? This would speed up subsequent requests greatly and speed is very important.
Based on inputs from the go-nuts group, to manually open multiple keep-alives to one server, we make that many simultaneous requests. Go then keeps these alive till the remote server times them out (default 5 seconds in Apache).
Please note that these connections cannot be more than the MaxIdleConnsPerHost which is 2 by default.
You can verify this behaviour using netstat -p tcp

tcp and apache keepalivetimouts

A few weeks ago I wrote a small program which created a socket to an apache webserver and made a request.
Back then I did not know that this web server had a KeepAliveTimeout of 5 seconds.
After my first request I waited 1 minute. After this I wanted to reuse my first socket for another webserver request, but got an error.
From Beej's Guide to Network Programming I learned that if recv returns 0, then the other side has closed its connection:
Wait! recv() can return 0. This can mean only one thing: the remote side has closed
the connection on you! A return value of 0 is recv()'s way of letting you know this
has occurred.
My questions are now:
What does Apache send when the KeepAliveTimeout is over - a FIN or a RST packet?
I know that using a TCP connection for 2 unrelated HTTP requests like in this scenario might
not be the best thing. But in order to understand TCP more the next question is:
After my first successful http request, and before sending the next HTTP request over the same socket, would there be somehow a possibility to get informed about this keepalivetimeout TCPsocket termination of the server other than receiving 0 from the next recv() call?
It will send a FIN. If you write a request to the server after that, send() will return -1 with errno/WSAGetLastError() = ECONNRESET.
would there be somehow a possibility to get informed about this keepalivetimeout tcp socket termination of the server
Yes, by reading the proper response header parameter, namely Keep-Alive: timeout=delta-seconds:
'timeout' Parameter
A host sets the value of the timeout parameter to the time that the host will allows an idle connection to remain open before it is closed. A connection is idle if no data is sent or received by a host.
The value of the timeout parameter is a single integer in seconds.
A host MAY keep an idle connection open for longer than the time that it indicates, but it SHOULD attempt to retain a connection for at least as long as indicated.
As you can see, it's up to the host to decide. Given it only SHOULD try to keep the connection open as long as promised, but it isn't required that it does in order to conform to the spec, so the server might decide to close and reuse the connection to serve another pending client.

How can I set the timeout on OCILogon2?

When the Oracle 10 databases are up and running fine, OCILogon2() will connect immediately. When the databases are turned off or inaccessible due to network issues - it will fail immediately.
However when our DBAs go into emergency maintenance and block incomming connections, it can take 5 to 10 minutes to timeout.
This is problematic for me since I've found that OCILogin2 isn't thread safe and we can only use it serially - and I connect to quite a few Oracle DBs. 3 blocked servers X 5-10 minutes = 15 to 30 minutes of lockup time
Does anyone know how to set the OCILogon2 connection timeout?
Thanks.
I'm currenty playing with OCI and it seems to me that it's impossible.
The only way I can think of is to use non-blocking mode. You'll need OCIServerAttach() and OCISessionBegin() instead of OCILogon() in this case. But when I tried this, OCISessionBegin() constantly returns OCI_ERROR with the following error code:
ORA-03123 operation would block
Cause: The attempted operation cannot complete now.
Action: Retry the operation later.
It looks strange and I don't yet know how to deal with it.
Possible workaround is to run your logon in another process, which you can kill after timeout...
We think we found the right file setting - but it's one of those problems where we have to wait until something rare and horrible occurs before we can verify it :-/
[sqlnet.ora]
SQLNET.OUTBOUND_CONNECT_TIMEOUT=60
From the Oracle docs..
http://download.oracle.com/docs/cd/B28359_01/network.111/b28317/sqlnet.htm#BIIFGFHI
5.2.35 SQLNET.OUTBOUND_ CONNECT _TIMEOUT
Purpose
Use the SQLNET.OUTBOUND_ CONNECT _TIMEOUT parameter to specify the time, in seconds, for a client to establish an Oracle Net connection to the database instance.
If an Oracle Net connection is not established in the time specified, the connect attempt is terminated. The client receives an ORA-12170: TNS:Connect timeout occurred error.
The outbound connect timeout interval is a superset of the TCP connect timeout interval, which specifies a limit on the time taken to establish a TCP connection. Additionally, the outbound connect timeout interval includes the time taken to be connected to an Oracle instance providing the requested service.
Without this parameter, a client connection request to the database server may block for the default TCP connect timeout duration (approximately 8 minutes on Linux) when the database server host system is unreachable.
The outbound connect timeout interval is only applicable for TCP, TCP with SSL, and IPC transport connections.
Default
None
Example
SQLNET.OUTBOUND_ CONNECT _TIMEOUT=10

Resources