What would be the best way to create a liveness probe based on incoming TCP traffic on particular port?
tcpdump and bash are available inside so it could be achieved by some script checking if there is incoming traffic on that port, but I wonder if there are better (cleaner) ways?
The example desired behaviour:
if there is no incoming traffic on port 1234 for the last 10 seconds the container crashes
if there is no incoming traffic on port 1234 for the last 10 seconds the container will be restarted with the below configuration. Also, note that there is no probe that makes the container crashes
livenessProbe:
tcpSocket:
port: 1234
periodSeconds: 10
failureThreshold: 1
Here is the documentation
Related
I'm configuring startup/liveness/readiness probes for kubernetes deployments serving spring boot services. As of the spring boot documentation it's best practice to use the corresponding liveness & readiness actuator endpoints as describes here:
https://spring.io/blog/2020/03/25/liveness-and-readiness-probes-with-spring-boot
What do you use for your startup probe?
What are your recommendations for failureThreshold, delay, period and timeout values?
Did you encounter issues when deploying isito sidecars to an existing setup?
I use the paths /actuator/health/readiness and /actuator/health/liveness :
readinessProbe:
initialDelaySeconds: 120
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
failureThreshold: 3
httpGet:
scheme: HTTP
path: /actuator/health/readiness
port: 8080
livenessProbe:
initialDelaySeconds: 120
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
failureThreshold: 3
httpGet:
scheme: HTTP
path: /actuator/health/liveness
port: 8080
for the recommendations, it depends on your needs and policies actually ( https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ )
no istio sidecars issues with this :)
do not forget to activate the endpoints in properties (cf https://www.baeldung.com/spring-liveness-readiness-probes):
management.endpoint.health.probes.enabled=true
management.health.livenessState.enabled=true
management.health.readinessState.enabled=true
In addition to #Bguess answer and the part:
for the recommendations, it depends on your needs and policies
Some of our microservices communicate with heavy/slow containers which need time to boot up, therefore we have to protect them with startup probes. The recommended way from https://kubernetes.io/ is to use liveness probe, what with the connection with spring-boot actuator endpoints results in:
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: http
periodSeconds: 5
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: http
periodSeconds: 5
startupProbe:
httpGet:
path: /actuator/health/liveness
port: http
failureThreshold: 25
periodSeconds: 10
The above setup makes sure that we probe liveness and readiness when the application is fully started (has 10*25=250 seconds to do so). As the doc says:
If such a probe is configured, it disables liveness and readiness checks until it succeeds, making sure those probes don't interfere with the application startup
Please note that
management.endpoint.health.probes.enabled=true
is not needed for the applications running in Kubernetes (doc)
These health groups are automatically enabled only if the application runs in a Kubernetes environment. You can enable them in any environment by using the management.endpoint.health.probes.enabled configuration property.
Hence you need it if you want to check the probes for example locally.
The startup probe is optional.
Originally, there were two types of probes in Kubernetes: readiness and liveness. However, people have encountered issues with slow-start containers. When a container takes a long time to start, Kubernetes does the first check on the liveness probe after initialDelaySeconds. If the check fails, Kubernetes attempts failureThreshold times with an interval of periodSeconds. If the liveness probe still fails, Kubernetes assumes that the container is not alive and restarts it. Unfortunately, the container will likely fail again, resulting in an endless cycle of restarting.
You may want to increase failureThreshold and periodSeconds to avoid the endless restarting, but it can cause longer detection and recovery times in case of a thread deadlock.
You may want to make the initialDelaySeconds longer to allow sufficient time for the container to start. However, it can be challenging to determine the appropriate delay since your application can run on various hardware. For instance, increasing initialDelaySeconds to 60 seconds to avoid this problem in one environment may cause unnecessary slow startup when deploying the service to a more advanced hardware that only requires 20 seconds to start. In such a scenario, Kubernetes waits for 60 seconds for the first liveness check, causing the pod to be idle for 40 seconds, and it still takes 60 seconds to serve.
To address this issue, Kubernetes introduced the startup probe in 1.16, which defers all other probes until a pod completes its startup process. For slow-starting pods, the startup probe can poll at short intervals with a high failure threshold until it is satisfied, at which point the other probes can begin.
If a container’s components take a long time to be ready except for the API component, the container can simply report 200 in the liveness probe, and the startup probe is not needed. Because the API component will be ready and report 200 very soon, Kubernetes will not restart the container endlessly, it will patiently wait until all the readiness probes indicate that the containers are all “ready” then take traffic to the pod.
The startup probe can be implemented in the same way as the liveness probe. Once the startup probe confirms that the container is initialized, the liveness probe will immediately report that the container is alive, leaving no room for Kubernetes to mistakenly restart the container.
Regarding the initialDelaySeconds, periodSeconds, failureThreshold, and timeout, it is really a balance between sensitivity and false positives. e.g., if you have a high failureThreshold and a high periodSeconds for the readiness probe, k8s cannot timely detect issues in the container and your pod continues to take traffic, hence many requests will fail. If you put a low failureThreshold and a low periodSeconds for the readiness probe, a temporary problem could take the pod out of traffic, that's a false positive. I kind of prefer default failureThreshold to 3 and periodSeconds to 5, successThreshold to 1 or 2.
BTW, don't use the default health check from Spring Boot, you always need to customize them. More details here: https://danielw.cn/health-check-probes-in-k8s
Before anything else, I have read about 30+ StackOverflow answers and none of them seem to address my particular flavour of this problem. Below I list all the answers I have already tried before asking for more advice.
I am trying to access my ec2 instance via socket in PHP from a different machine via fsockopen, pointed at my ec2 public IP (I have an Elastic fixed IP address 54.68.166.28) and designated port.
Behaviour: I can access the instance and the ChatScript application running inside from within the instance, via the public IP directly on the browser. But if I run the exact same webpage with the exact same socket call on an external machine targeting my instance's IP address (double checked it is the correct one) I get a 500 Internal Server Error when connecting on port 1024 (for my custom TCP connection), another 500 on port 443 (HTTPS). On port 80 (HTTP) it hangs 20+ seconds then gives me status 200 success, except it does not connect properly to the application and responds with nothing.
Troubleshooting:
I have set up my security group rules to accept incoming TCP from anywhere:
HTTP (80) TCP 80 0.0.0.0/0
HTTP (80) TCP 80 ::/0
HTTPS (443) TCP 443 0.0.0.0/0
HTTPS (443) TCP 443 ::/0
Custom (1024) TCP 1024 0.0.0.0/0
Custom (1024) TCP 1024 ::/0
Outbound rules span port range 0 - 65535 with destination 0.0.0.0/0, so should work.
I ssh every time without problems into the instance on port 22. SCP also works fine.
Checked $sudo service httpd status: running, which is why my UI displays there fine.
Checked $sudo /sbin/iptables -L and all my policies are set to ACCEPT with no rules
Checked $ netstat --listen -p and the app I am targeting is listening on port 0.0.0.0.0:1024.
Checked Network Utility and ports 80 and 1024 are registered as open. Port 443 is not. Pinging did not work for any of them, with 100% packet loss.
Checked my instance is associated to the security group with all the permissions - it is. IP is clearly correct or I could neither ssh nor serve webpages... which I can.
I stopped and restarted the instance.
I replaced the instance.
I think this is due diligence before asking for help... now I need it!
I realised my configuration was correct: the problem was that the hosted domain I used for the GUI, like most hosted domains, does not open custom ports, so tcp did not work.
We are using the http port 80 to run a SAP Portal response to an URL.
We made a restart to the server and the Operating System uses now port 80:
C:\Users>netstat -o -n -a | findstr 0.0.0.0:80
TCP 0.0.0.0:80 0.0.0.0:0 LISTENING 4
TCP 0.0.0.0:8081 0.0.0.0:0 LISTENING 1540
UDP 0.0.0.0:8082 *:* 1540
The process PID 4 is the operating system and using the ProcessExplorer application it figures out that is the Http.sys that is running now on port 80.
It was stopped and deactivated the http.sys but this has dependencies, and one is the World Wide Web Publishing Services (IIS) that we need.
Can I bind the http.sys port to be another port so that the dependencies that are related with this service could run without problems?
Thanks
Sílvia
Http.sys does not open ports on its own. It does on at the request of an application. Http.sys can be accessed by any application.
Reconfigure the application. There is no way to configure Http.sys.
Generally requests leaving are bound to a random port whereas the services the server you make requests to are bound to a specific port
Give your OS a second IP and bind http.sys to one IP and SAP Web Application Server to another.
netsh http add iplisten ipaddress=::1
I have two Ubuntu instances in the EC2 and I want to cluster them.
One ip will be refered as - X (the "net addr" ifconfig displayed IP) and its public ip will be reffered as PX.
the other ip is Y and its public is Y.
So now I did the following on both machines.
installed the latest rabbbitmq.
installed the management plugin.
opened the port for 5672 (rabbit) and 15672(management plugin)
connected to rabbit with my test app.
connected to the ui.
So now for the cluster.
I did the following commands
on X
rabbitmqctl cluster_status
got the node name which was 'rabbit#ip-X' (where X is the inner IP)
on Y
rabbitmqctl stop_app
rabbitmqctl join_cluster --ram rabbit#ip-X
I got
"The nodes provided are either offline or not running"
Obviously this is the private ip, so the other instance cant connect.
How do I tell the second instance where the first is located?
EDIT
Firewall is completely off, I have a telnet connection from one remote to the other
(to ports 5672(rmq),15672 (ui), 4369 (cluster port)).
The cookie on both servers (and the hash of the cookie in the logs is the same).
when recorded tcp when running the join cluster command and watched in wireshark. I saw the following (no ack. )
http://i.imgur.com/PLezLvQ.png
so I closed the firewall using
sudo ufw disable
(just for the tests) and I re-typed
sudo rabbitmqctl join_cluster --ram rabbit#ip-XX
and the connection was created - but terminated by the remote rabbit
here :
http://i.imgur.com/dxJLNfH.png
and the message is still
"The nodes provided are either offline or not running"
(the remote rabbit app is definitely running)
You need to make sure the nodes can access each other. RabbitMQ uses distributed Erlang primitives for communication across the nodes, so you also have to open up a few ports in the firewall. See:
http://learnyousomeerlang.com/distribunomicon#firewalls
for details.
You should also use the same data center for your nodes in the cluster, since RabbitMQ can get really sad on network partitions. If your nodes are in different data centers, you should use the shovel or federation plugin instead of clustering for replication of data.
Edit: don't forget to use the same Erlang cookie on all nodes, see http://www.rabbitmq.com/clustering.html for details.
The issue are probably TCP ports that should be opened.
You should do the following:
1) Create a Security Group for the Rabbit Servers (both will use it)
we will call it: rabbit-sg
2) In the Security Group, Define the following ports:
All TCP TCP 0 - 65535 sg-xxxx (rabbit-sg)
SSH TCP 22 0.0.0.0/0
Custom TCP Rule TCP 4369 0.0.0.0/0
Custom TCP Rule TCP 5672 0.0.0.0/0
Custom TCP Rule TCP 15672 0.0.0.0/0
Custom TCP Rule TCP 25672 0.0.0.0/0
Custom TCP Rule TCP 35197 0.0.0.0/0
Custom TCP Rule TCP 55672 0.0.0.0/0
3) make sure both EC2 use this security group,
note that we opened all TCP between the EC2
4) make sure the rabbit cookie is the same and that you reboot the EC2
after changing it in the slave EC2
My Ubuntu Server 11.04 free-tier instance security group opens SSH, HTTP, HTTPS to the public web and nothing else (not even the inter-group TCP/UDP/ICMP ports enabled by the default sec group).
But when I Nmap my server's public dns, it shows HTTP & HTTPS closed, with ftp (21), rtsp (554), and realserver (7070) all open. This would, of course, explain why I can't view the website I'm running on that instance, so I need to fix it.
This is a cross-post from the AWS EC2 forum, but since I've got no replies yet, I'm hoping for better luck here.
my SecGroup (no other rules for UDP or ICMP):
TCP
Port (Service) Source Action
22 (SSH) 0.0.0.0/0 Delete
80 (HTTP) 0.0.0.0/0 Delete
443 (HTTPS) 0.0.0.0/0 Delete
Nmap:
kurtosis#kurtosis-laptop:~/bin/AWS$ nmap ec2-184-73-70-26.compute-1.amazonaws.com
Starting Nmap 5.00 ( http://nmap.org ) at 2011-06-14 23:27 PDT
Interesting ports on ec2-184-73-70-26.compute-1.amazonaws.com (184.73.70.26):
Not shown: 994 filtered ports
PORT STATE SERVICE
21/tcp open ftp
22/tcp open ssh
80/tcp closed http
443/tcp closed https
554/tcp open rtsp
7070/tcp open realserver
Nmap done: 1 IP address (1 host up) scanned in 8.52 seconds
Why are http and https closed when my security group specifies they should be open, and why is ftp, rtsp, and realserver open when my security group does not include them at all? Anyone know why the discrepancy?
Are you sure your instance is a member of the security group you're modifying? In the EC2 Console you can see this by clicking on the Instance, it'll list the security groups it's a member of as "sg-12345".
Alternatively it may be an issue with just that instance - try terminating that instance and starting a new one to see if the problem persists.