Alerts for apps failing Marathon healthchecks - mesos

I've been configuring http healthchecks for all my apps in marathon which are working nicely, the trouble is marathon will keep stepping in and restarting a container failing it's healthcheck and I won't know unless I happen to be looking in the Marathon UI.
Is there a way to retrieve all apps that have a failed healthcheck so I can send an email alert or similar?

Marathon exposes information about failing healthcheck with event bus so you can write a simple service that will consume Marathons HealthChecks Event ("eventType": "instance_health_changed_event") and translate it to metric, alert you name it.
For a reference I can recommend allegro/appcop. This is the service that scales down unhealthy applications. Its code could be easily altered to do what you want.

Related

Resilience4j Circuit Breaker behaviour in Distributed system

I have integrated the Resilience4j circuit breaker in one of the spring boot applications that have multiple pods on K8s.
There are a couple of questions that I need to know
How do I track the circuit breaker status from the actuator on each pod, is there a way I can build a utility/dashboard for the same, on local I am getting the health via the below URL.
http://localhost:9090/actuator/health
There is an API that will disable the circuit breaker, but given the circuit breaker is activated on each pod individually.
How should I divert my call to a particular pod if I need to disable it on a pod via writing an API
If I need to disable it across all pods, what should be the strategy?
Circuit Breaker Library - https://resilience4j.readme.io/docs/getting-started-3
CB is not responsible for communicating actuator health information because R4J does not affect current microservice/pod health, just handles other pod problems. So if you can not reach another endpoint. The main task is to prevent you from constantly receiving the same errors from another service. For example, should you request a 404 endpoint again? An exception is generated, which you can work with to redirect it elsewhere. In a K8S environment, you have to repeat the request, and the K8S Service (if you're lucky) will route your request to a working pod replica.
If all replicas are down, then another problem exists. :) Which has nothing to do with R4J.
You can receive R4J status by metrics. Look this: https://resilience4j.readme.io/docs/micrometer
"If I need to disable it across all pods, what should be the strategy?" Example Deployment environment flag and "if" in code what avoid this block. :)

How to build spring based micro services state syncing after any node failure service crash

I have few micro services, they accept the data from customer and process the request asynchronously, later customer can come and check the status, to make my platform my robust, I am planning to providing HA with running same setup services (atleat 2) and also registering with eureka, and also all my services are behind load balancer
Now I stuck at providing a solution if node failure or services goes down after accepting the request.
Lets say I have Service-A1 and Service-A2, both have same capability.
Now Service-A1 accepts the request and gave accepted response to customer then started processing the job and updating its intermediate results in db, now due to some node failure or service crash it could not complete the job.
In this case I want other service to auto detect(get notified) to continue, so it can read the job status and continue the job request for completion.
Is there any feature in Spring Eureka or zookeeper to watch and notify other to continue. ?

How can you tell when a Kafka Streams app is in the "running" state?

Given a recently started Kafka Streams app, how can one reliably determine that it has reached the "RUNNING" state? This is in the context of a test program that launches one or more streams apps and needs to wait until they are running before submitting test messages.
I know about the .setStateListener method but I'm wondering if there is a way of detecting this state from outside the app process. I thought it might be exposed as a jmx metric but I couldn't find one in VisualVM
The state listener method is the way to go. There is no other out-of-the-box way to achieve what you want.
That said, you can do the following for example:
Expose a simple "health check" (or "running yes/no check") in your Kafka Streams application, e.g. via a REST endpoint (use whatever REST tooling you are familiar with).
The health check can be based on Kafka Streams' built-in state listener, which you already know about.
Your test program can then remotely query the health check endpoints of your various Kafka Streams application to determine when all of them are up and running.
Of course, you can use other ways to communicate readiness of a Kafka Streams application. The REST endpoint idea in (1) is just one example.
You can also let the Kafka Streams application write its readiness status into a Kafka topic, and your test program will subscribe to that topic to determine when all apps are ready.
Another option would be to provide a custom JMX metric in your Kafka Streams apps that your test program can then access.

How to manage microservice failure?

Let's say, I have several micro-services (REST API), the problem is, if one service is not accessible (let's call service "A" ) the data which was sending to service "A" will be saved in temporary database. And after service worked, the data will be sent again.
Question:
1. Should I create the service which pings to service "A" in every 10 seconds to know service works or not? Or is it possible to do it by task queue? Any suggestions?
Polling is a waste of bandwidth. You want to use a transactional queue.
Throw all your outbound messages in the queue, and have some other process to handle the messages.
How this will work is - after your process reads from the queue, and tries to send to the REST service:
If it works, commit the transaction (for the queue)
If it doesn't work, don't commit. Start a delay (minutes, seconds - you know best) until you read from the queue again.
You can use Circuit Breaker pattern for e.g. hystrix circuit breaker from netflix.
It is possible to open circuit-breaker base on a timeout or when service call fails or inaccessible.
There are multiple dimensions to your question. First you want to consider using an infrastructure that provides resilience and self healing. Meaning you want to deploy a cluster of containers, all containing your Service A. Now you use a load balancer or API gateway in front of your service to distribute calls/load. It will also periodically check for the health of your service. When it detects a container does not respond correctly it can kill the container and start another one. This can be provided by a container infrastructure such as kubernetes / docker swarm etc.
Now this does not protect you from losing any requests. In the event that a container malfunctions there will still be a short time between the failure and the next health check where requests may not be served. In many applications this is acceptable and the client side will just re-request and hit another (healthy container). If your application requires absolutely not losing requests you will have to cache the request in for example an API gateway and make sure it is kept until a Service has completed it (also called Circuit Breaker). An example technology would be Netflix Zuul with Hystrix. Using such a Gatekeeper with built in fault tolerance can increase the resiliency even further. As a side note - Using an API gateway can also solve issues with central authentication/authorization, routing and monitoring.
Another approach to add resilience / decouple is to use a fast streaming / message queue, such as Apache Kafka, for recording all incoming messages and have a message processor process them whenever ready. The trick then is to only mark the messages as processed when your request was served fully. This can also help in scenarios where faults can occur due to large number of requests that cannot be handled in real time by the Service (Asynchronous Decoupling with Cache).
Service "A" should fire a "ready" event when it becomes available. Just listen to that and resend your request.

mesos marathon not sending http callbacks

I have successfully created and run tasks on mesos using marathon. However, marathon is supposed to support http callbacks when you start it using
--event_subscriber http_callback --http_endpoints http://myip:3000/endpoints
However, this does not seem to actually send any callbacks to my service. Is there anything else that is supposed to be used in order to use the callbacks?
My issue stemmed from the fact that I had multiple versions of marathon running. The first version of marathon, which was considered master, was not configured to use callbacks. The second marathon was, which was considered slave, was configured to use callbacks.
As the documentation states, all requests to a slave will be forwarded to the master.

Resources