Wait until Kafka state store is queryable to start the service - spring-boot

Based on the suggestion of Confluent here, before we query the store we have to wait for it to become queryable, but at the same time we are receiving requests (http for example) even though the application cannot service them.
Wouldn't it be preferable to have this waiting time occurring during boot time?
Is there a way to wait until the state is "RUNNING" before the service is up and running?

Related

Vert.x web and micro services - Health check being starved

We use a very custom framework built on Vert.x to build our k8s micro services. This framework does a lot of the heavy lifting for teams, such as setting up all the endpoints and creating the health check endpoint, among other things.
One issue we see is that some of the micro services will start to "starve out" the health check when they get overloaded. So, as an app takes on heavy traffic, the health check, which runs every 20s, will time out. For these micro services I have checked that the teams are properly setting the "blocking" on endpoints that do any sort of blocking calls, like DB reads/writes or downstream API call.
The health check endpoint, being comprised of quick checks, does not. My understanding is that blocking handlers get pushed off to the work queue and non-blocking stay in the event loop, so my theory is that under heavy strain the event loop is filling up, and by the time it gets to the queued health check it's already past the timeout. I say this because we see the timeout of the Kubernetes-side, but out total-processing-time on the health check, which starts once the handler is called, is quick.
I attempted to alleviate this by pushing the health check into it's own Verticle, not quite understanding that you can't have multiple vertices on the same port (that was a misunderstanding on my part in reading the documentation).
So, my question is: What is the correct way to prioritize the health check? Is there a way to push these health checks to the front of the queue, or should we be looking more to some sort of "tuning"?

Process a stream of sessions on aws

Is there a way to implement somethong like Flink's session-window on aws with lambda and some way of managing messages?
We have a stream of small events with a session id. We cannot guarantee the order of the arriving events and we don't always have a session-finished event. We know that session ids are unique. We also know that when a session is finished it won't be restarted. We also know that when the session is active we will receive a message every minute or so. We need to process the entire session as a whole.
We want to wait for a silent time of X minutes, and if no messages arrive we will process the entire session as a whole.
This is exactly what Flink's silent window does, is there a way to do the same thing purely using aws lambda and it's triggers?
There can be 10s of millions of sessions at the same time
It's not possible with an AWS Lambda.
Lambdas are stateless, they are able to process messages one by one, but cannot offer any processing over a sequence of messages, which would be required for the kind of windowing logic you describe.
Maybe an option for you would be Kinesis Data Analytics? Under the hood, this one is actually Flink, although it's provided as a managed service by AWS, so maybe you'll get there the "lambda-like" experience you're looking for?

Service synchronization issue

I've created two services.
One of them (scheduler) only requests to the other (backoffice) for performing some "large" operations.
When backoffice receives a request:
first creates a mark (key on redis) in order to set that the process has started.
Each time a request is reached:
backoffice checks if the mark exist.
When it exists means that the previous process has not yet finished, and escape it.
Perform the large process.
When process is finished, the previous key in redis is removed.
It would be something like this:
if (key exists)
return;
make long process... (1);
remove key;
The problem arises when service is destroyed when the process has not already finished and then it doesn't removes the mark on redis. It means the process will never run again.
Is there any way to solve this kind of problems?
The way to solve this problem is use an existing engine as building custom scalable and robust solution for reliable service orchestration is really hard.
I recommend looking at Uber Cadence Workflow which would allow to convert your pseudocode into a real production application with minor changes.
You can fire a background job that updates timestamp under the key, e.g. every minute.
When service attempts to start the process it must verify key existence (as it does now) + timestamp under the key. If it is more than 1 minute ago then the previous attempt is stale and you can start over.
Sounds like you should be using a messaging queue to schedule tasks for the back office service. Queuing solutions like RabbitMQ allow you to manually acknowledge (or “ack”) that the process is complete. Whenever a subscriber crashes, the queue detects that the connection dropped without acknowledgement and will re-enqueue the same task which will be picked up by the next available subscriber. Here’s another thread talking about this problem specifically focused on messaging queues:
What happens to fetched messages when RabbitMQ consumer crashes?

How to manage microservice failure?

Let's say, I have several micro-services (REST API), the problem is, if one service is not accessible (let's call service "A" ) the data which was sending to service "A" will be saved in temporary database. And after service worked, the data will be sent again.
Question:
1. Should I create the service which pings to service "A" in every 10 seconds to know service works or not? Or is it possible to do it by task queue? Any suggestions?
Polling is a waste of bandwidth. You want to use a transactional queue.
Throw all your outbound messages in the queue, and have some other process to handle the messages.
How this will work is - after your process reads from the queue, and tries to send to the REST service:
If it works, commit the transaction (for the queue)
If it doesn't work, don't commit. Start a delay (minutes, seconds - you know best) until you read from the queue again.
You can use Circuit Breaker pattern for e.g. hystrix circuit breaker from netflix.
It is possible to open circuit-breaker base on a timeout or when service call fails or inaccessible.
There are multiple dimensions to your question. First you want to consider using an infrastructure that provides resilience and self healing. Meaning you want to deploy a cluster of containers, all containing your Service A. Now you use a load balancer or API gateway in front of your service to distribute calls/load. It will also periodically check for the health of your service. When it detects a container does not respond correctly it can kill the container and start another one. This can be provided by a container infrastructure such as kubernetes / docker swarm etc.
Now this does not protect you from losing any requests. In the event that a container malfunctions there will still be a short time between the failure and the next health check where requests may not be served. In many applications this is acceptable and the client side will just re-request and hit another (healthy container). If your application requires absolutely not losing requests you will have to cache the request in for example an API gateway and make sure it is kept until a Service has completed it (also called Circuit Breaker). An example technology would be Netflix Zuul with Hystrix. Using such a Gatekeeper with built in fault tolerance can increase the resiliency even further. As a side note - Using an API gateway can also solve issues with central authentication/authorization, routing and monitoring.
Another approach to add resilience / decouple is to use a fast streaming / message queue, such as Apache Kafka, for recording all incoming messages and have a message processor process them whenever ready. The trick then is to only mark the messages as processed when your request was served fully. This can also help in scenarios where faults can occur due to large number of requests that cannot be handled in real time by the Service (Asynchronous Decoupling with Cache).
Service "A" should fire a "ready" event when it becomes available. Just listen to that and resend your request.

How to suspend Mass Transit processing messages from the queue

I have a Mass Transit Service Bus that is listening to several queues and processing the messages. I would like to somehow pause the processing of new requests and wait for the current requests to complete so that I can run some housekeeping tasks.
A few of my own thoughts:
I have investigated the service bus BeforeConsumingMessage handler and although this would allow me to check for a 'Pause processing' flag in my database, I unsure how I would then actually pause the processing!
We are using RabbitMQ - could I use this to put the queues in a suspend state?
I have found so little on this subject that I wonder if it is an 'anti-pattern' and I should just stop my Mass Transit services if I want to run some housekeeping jobs and trust in any partially complete sagas to be picked up when the service bus starts back up. (Rather not go for this option, though).
So my question is: Is there a way to instruct the service bus to finish processing the current sagas but do not take any more messages from the queue?
Cleanly shutting down a MT service will wait for any messages in process to completely finished. Why sometimes it takes a little while for a service to shutdown. Shutting down the service is the best way to handle this, you are sure MT is not pulling any new messages.
If your sagas are serialized to a backing store, e.g. NHibernate, then the state will be saved until the service is restarted and the sagas will pick up in the state they were left after the last message was processed. You should be in good shape. We do this all the time for any maintenance periods.
If you REALLY must leave the service running, call Dispose on the IServiceBus instance. This will do the same thing, letting the current consumers finish then releasing all your resources. Once you have done maintenance you can create a new IServiceBus instance as needed.

Resources