Outbound connection intermittent failed to response - amazon-ec2

I'm experiencing intermittent failed to response when make an outbound connection such as RPC call, it is logged by my application (Java) like this :
org.apache.http.NoHttpResponseException: RPC_SERVER.com:443 failed to respond !
Outbound connection flow
Kubernetes Node -> ELB for internal NGINX -> internal NGINX ->[Upstream To]-> ELB RPC server -> RPC server instance
This problem is not occurred on usual EC2 (AWS).
I'm able to reproduce on my localhost by doing this
Run main application which act as client in port 9200
Run RPC server in port 9205
Client will make a connection to server using port 9202
Run $ socat TCP4-LISTEN:9202,reuseaddr TCP4:localhost:9205 that will listen on port 9202 and then forward it to 9205 (RPC Server)
Add rules on iptables using $ sudo iptables -A INPUT -p tcp --dport 9202 -j DROP
Trigger a RPC calling, and it will return the same error message as I desrcibe before
Hypothesis
Caused by NAT on kubernetes, as far as I know, NAT is using conntrack, conntrack and break the TCP connection if it was idle for some period of time, client will assume the connection is still established although it isn't. (CMIIW)
I also have tried scaling kube-dns into 10 replica, and the problem still occurred.
Node Specification
Use calico as network plugin
$ sysctl -a | grep conntrack
net.netfilter.nf_conntrack_acct = 0
net.netfilter.nf_conntrack_buckets = 65536
net.netfilter.nf_conntrack_checksum = 1
net.netfilter.nf_conntrack_count = 1585
net.netfilter.nf_conntrack_events = 1
net.netfilter.nf_conntrack_expect_max = 1024
net.netfilter.nf_conntrack_generic_timeout = 600
net.netfilter.nf_conntrack_helper = 1
net.netfilter.nf_conntrack_icmp_timeout = 30
net.netfilter.nf_conntrack_log_invalid = 0
net.netfilter.nf_conntrack_max = 262144
net.netfilter.nf_conntrack_tcp_be_liberal = 0
net.netfilter.nf_conntrack_tcp_loose = 1
net.netfilter.nf_conntrack_tcp_max_retrans = 3
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 3600
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_timestamp = 0
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 180
net.nf_conntrack_max = 262144
Kubelet config
[Service]
Restart=always
Environment="KUBELET_KUBECONFIG_ARGS=--kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"
Environment="KUBELET_CLOUD_ARGS=--cloud-provider=aws"
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_EXTRA_ARGS $KUBELET_CLOUD_ARGS
Kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T09:14:02Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.7", GitCommit:"8e1552342355496b62754e61ad5f802a0f3f1fa7", GitTreeState:"clean", BuildDate:"2017-09-28T23:56:03Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Kube-proxy Log
W1004 05:34:17.400700 8 server.go:190] WARNING: all flags other than --config, --write-config-to, and --cleanup-iptables are deprecated. Please begin using a config file ASAP.
I1004 05:34:17.405871 8 server.go:478] Using iptables Proxier.
W1004 05:34:17.414111 8 server.go:787] Failed to retrieve node info: nodes "ip-172-30-1-20" not found
W1004 05:34:17.414174 8 proxier.go:483] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
I1004 05:34:17.414288 8 server.go:513] Tearing down userspace rules.
I1004 05:34:17.443472 8 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 262144
I1004 05:34:17.443518 8 conntrack.go:52] Setting nf_conntrack_max to 262144
I1004 05:34:17.443555 8 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I1004 05:34:17.443584 8 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I1004 05:34:17.443851 8 config.go:102] Starting endpoints config controller
I1004 05:34:17.443888 8 config.go:202] Starting service config controller
I1004 05:34:17.443890 8 controller_utils.go:994] Waiting for caches to sync for endpoints config controller
I1004 05:34:17.443916 8 controller_utils.go:994] Waiting for caches to sync for service config controller
I1004 05:34:17.544155 8 controller_utils.go:1001] Caches are synced for service config controller
I1004 05:34:17.544155 8 controller_utils.go:1001] Caches are synced for endpoints config controller
$ lsb_release -s -d
Ubuntu 16.04.3 LTS

Check the value of sysctl net.netfilter.nf_conntrack_tcp_timeout_close_wait inside the pod that contains your program. It is possible that the value on the node that you listed (3600) isn't the same as the value inside the pod.
If the value in the pod is too small (e.g. 60), and your Java client half-closes the TCP connection with a FIN when it finishes transmitting, but the response takes longer than the close_wait timeout to arrive, nf_conntrack will lose the connection state and your client program will not receive the response.
You may need to change the behavior of the client program to not use a TCP half-close, OR modify the value of net.netfilter.nf_conntrack_tcp_timeout_close_wait to be larger. See https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/.

Related

When I started HAProxy under Windows Server, I got the following error

haproxy.exe -f haproxy.cfg -d
When I run HAProxy, I get an error:
'''
Available polling systems :
poll : pref=200, test result OK
select : pref=150, test result FAILED
Total: 2 (1 usable), will use poll.
Available filters :
[SPOE] spoe
[CACHE] cache
[FCGI] fcgi-app
[COMP] compression
[TRACE] trace
Using poll() as the polling mechanism.
[NOTICE] (1036) : haproxy version is 2.4.0-6cbbecf
[ALERT] (1036) : Starting proxy warelucent: cannot bind socket (Address already in use) [0.0.0.0:5672]
[ALERT] (1036) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.
'''
In the meantime, no other services are running, and I have the RabbitMQ service open.
My haproxy.cfg file is as follows:
'''
#logging options
global
log 127.0.0.1 local0 info
maxconn 1500
daemon
quiet
nbproc 20
defaults
log global
mode tcp
#if you set mode to tcp,then you nust change tcplog into httplog
option tcplog
option dontlognull
retries 3
option redispatch
maxconn 2000
timeout connect 10s
timeout client 10s
timeout server 10s
#front-end IP for consumers and producters
listen warelucent
bind 0.0.0.0:5672
#配置TCP模式
mode tcp
#balance url_param userid
#balance url_param session_id check_post 64
#balance hdr(User-Agent)
#balance hdr(host)
#balance hdr(Host) use_domain_only
#balance rdp-cookie
#balance leastconn
#balance source //ip
#简单的轮询
balance roundrobin
server one 1.1.1.1:5672 check inter 5000 rise 2 fall 2
server two 2.2.2.2:5672 check inter 5000 rise 2 fall 2
server three 3.3.3.3:5672 check inter 5000 rise 2 fall 2
listen stats
bind 127.0.0.1:8100
mode http
option httplog
stats enable
stats uri /rabbitmq-stats
stats refresh 5s
'''
Most of the Internet is due to the version, but I checked the official website, the version is the latest, and I also started the RabbitMQ service, so I don't know where the error is at present
(Address already in use) [0.0.0.0:5672]
it means that the port 5672 (RabbitMQ is already in use. Most likely you have a rabbitmq node running in that machine.
So just change the HA-PROXY port.

Aerospike Java client EOFException

I am trying to connect to a Aerospike single node I set up using Vagrant on MacOSX. My AMC is running on localhost:2200. I am unable to connect to it successfully.
import com.aerospike.client.AerospikeClient;
public class AerospikeDriver {
public static void main(String[] args) {
AerospikeClient client = new AerospikeClient("127.0.0.1", 2200);
client.close();
}
}
I am getting this error in the first line itself. I tried changing the port to 3000 as well. Same error. Sometimes, I get SocketException as well.
Exception in thread "main" com.aerospike.client.AerospikeException$Connection: Error Code -8: Failed to connect to host(s):
127.0.0.1 2200 Error Code -1: java.io.EOFException
at com.aerospike.client.cluster.Cluster.seedNodes(Cluster.java:532)
at com.aerospike.client.cluster.Cluster.tend(Cluster.java:425)
at com.aerospike.client.cluster.Cluster.waitTillStabilized(Cluster.java:380)
at com.aerospike.client.cluster.Cluster.initTendThread(Cluster.java:286)
at com.aerospike.client.cluster.Cluster.<init>(Cluster.java:243)
at com.aerospike.client.AerospikeClient.<init>(AerospikeClient.java:234)
at com.aerospike.client.AerospikeClient.<init>(AerospikeClient.java:175)
at AerospikeDriver.main(AerospikeDriver.java:5)
My maven dependency for aerospike client is this:
<dependency>
<groupId>com.aerospike</groupId>
<artifactId>aerospike-client</artifactId>
<version>4.1.11</version>
</dependency>
This is my aerospike conf:
# Aerospike database configuration file.
# This stanza must come first.
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
# service-threads 4
# transaction-queues 4
# transaction-threads-per-queue 4
proto-fd-max 15000
node-id-interface eth1
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
file /var/log/aerospike/udf.log {
context udf info
context aggr info
}
}
network {
service {
address eth1
port 3000
# access-address <Published IP>
# access-address <NAT IP>
}
heartbeat {
mode multicast
multicast-group 239.1.99.222
address eth1
port 9918
protocol v3
# To use unicast-mesh heartbeats, comment out the 3 lines above and
# use the following 4 lines instead.
# mode mesh
# port 3002
# mesh-address 10.1.1.1
# mesh-port 3002
interval 150
timeout 10
}
fabric {
port 3001
address eth1
}
info {
port 3003
}
}
#namespace test {
# replication-factor 2
# memory-size 4G
# default-ttl 30d # 30 days, use 0 to never expire/evict.
#
# storage-engine memory
#}
namespace test {
replication-factor 2
memory-size 2G
default-ttl 5d # 5 days, use 0 to never expire/evict.
# To use file storage backing, comment out the line above and use the
# following lines instead.
storage-engine device {
file /opt/aerospike/data/test.dat
filesize 5G
data-in-memory true # Store data in memory in addition to file.
}
}
What am I doing wrong? Need some help here. I am very new to aerospike. I tried searching everywhere, but couldn't find anything.
UPDATE
I am now using IP address 172.28.128.4 (got it from ifconfig command) and port 3000 to connect to aerospike. I am now getting Socket Timeout Exception.
If you setup a single node on vagrant on Mac, and you running you are application in an ide on mac - say eclipse - locoalhost on vagrant is exposed to the mac as 172.28.128.3 typically. running ifconfig in your vagrant shell will confirm that. if your application is running inside vagrant itself, then 127.0.0.1 should work, in each case, your application should specify port 3000. thats where aerospike server is listening. amc is a webserver that talks to aerospike on port 3000 and serves the dashboard on port 8081 by default. so its a monitoring and management gateway to aerospike via a web browser. also, in your aerospike config, suggest you use mesh config instead of multicast though for a single node it does not matter - you are not making a cluster. If you are new, if you download CE, you get complimentary access to Aerospike Intro course in Aerospike Academy. Take advantage of that - few hours investment. Otherwise here are some intro videos on youtube. ( 02-Intro to Aerospike and 03-handson )

How to open port 11211 for EC2 instance by security group

May I know how to open port 11211 for EC2 instance memcached server?
I'm trying to connect from Rails server to memcached server. However something is wrong with my security group setting.
What I did so far is
To launch 2 instances. One is Rails server, the other is memcached server.
To set up security groups
Rails server : Outbound => All traffic , All protocol, All port
memcached server : Inbound =>
ssh TCP, port 22, All source
Custom TCP Rule port, 11211, Rails server IP address
When I login to Rails server and execute below command, it looks working for port 22
$ telnet <memcached private IP address> 22
Trying <IP address>...
Connected to <IP address>.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.2p2 Ubuntu-4ubuntu2.4```
But when I check port 11211, it doesn't work.
$ telnet <memcached private IP address> 11211
The first question is why only port 22 is working? Even if I've set almost the same thing into security group???
When I login to memcached server and check the status, it looks working.
$ sudo /etc/init.d/memcached status
● memcached.service - memcached daemon
Loaded: loaded (/lib/systemd/system/memcached.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2018-02-14 14:23:40 UTC; 19h ago
Main PID: 7569 (memcached)
Tasks: 6
Memory: 628.0K
CPU: 2.093s
CGroup: /system.slice/memcached.service
└─7569 /usr/bin/memcached -m 64 -p 11211 -u memcache
$ sudo netstat -ltup4
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 *:11211 *:* LISTEN 6486/memcached
udp 0 0 *:11211 *:* 6486/memcached
Could you let me know what should I do?
you need to add the security group of you rail server in the memcached server inbound rule.
Add a new rule, select your protocol and port range. For "Source", type or select your security group.

docker + haproxy on mac 10.11.5 doesn't work

I am running an haproxy configuration on mac that works perfect on linux but I can't get the proxy to even respond. Here is my config:
defaults
mode http
timeout connect 5000ms
timeout client 5000ms
timeout server 5000ms
frontend http
bind *:80
acl oracle_content hdr(ContentType) -i application/vnd.api+json
acl oracle_accept hdr(Accept) -i application/vnd.api+json
use_backend oracle_be if oracle_content
use_backend oracle_be if oracle_accept
default_backend matrix_be
backend oracle_be
balance roundrobin
server oracle1 theoracle.stage.company.com:8080
backend matrix_be
balance roundrobin
server matrix1 192.168.1.6:3000
docker run -d --name cc -v /Users/cbongiorno/development/haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro haproxy
docker -v
Docker version 1.12.0, build 8eab29e
the only machine specific config is the IP adress of the matrix_be entry which has to be my local interface. It's not working on 2 macs and I have tried binding the proxy to multiple interfaces. I am not even getting a 504 which would indicate the proxy is fine but one of the backend services is misconfigured.
Ideas?
Due to current docker on mac limitations, the -p 80:80 flag must be passed even if the container declares port 80 open for business

(mac) dockers , how to connect to a redis server in the hosting machine from a containter

i am running a docker container on my mac machine using boot2docker:
I want to connect to redis-server i am running my hosting machine from inside the container.
I have managed to connect from the container to a service i am running on the host machine using curl http://192.168.3.124:5000 (getting results)
I have managed to connect to it , but I am not pulling data from it according to it's state.
redisServer = redis.StrictRedis(host='192.168.3.124', port= "6379"); redisServer.get("2") (gets no results, from the hosting machine that key is set)
details:
running the redis server :
[58781] 13 May 13:53:16.120 # Server started, Redis version 2.8.19
[58781] 13 May 13:53:16.120 * DB loaded from disk: 0.000 seconds
[58781] 13 May 13:53:16.120 * The server is now ready to accept connections on port 6379
ps aux |grep redis
partuck 58781 0.0 0.0 2469924 1652 s002 S+ 1:53PM 0:00.03 redis-server *:6379
partuck 58728 0.0 0.7 2583104 115260 ?? S 1:53PM 0:00.47 /usr/local/opt/redis/bin/redis-server 127.0.0.1:6379
from the
Your host IP in the virtualbox that boot2docker setups is (typically) 10.0.2.2.
So you should try connecting to 10.0.2.2:6379

Resources