I have a scenario where two or more instances of the same verticle will be instantiated. I want to make sure that only one of the instances is consuming the key 'keyx' and in order to do that I check on a Service Discovery instance if a certain type of record is there and, if not, I can safely say that no one is consuming 'keyx'.
Therefore, I publish a record on the Service Discovery instance and I subscribe to 'keyx'. All the other instances will now check with the Service Discovery that some instance is already registered for 'keyx'.
If the machine with the verticle instance has any serious problem, it will get the verticle killed and the record will still be on the Service Discovery (in this case, removing the record in the stop() method would not work because this method wouldn't be called) and all the other instances created will believe that an instance is still consuming 'keyx' when it is possibly not the case.
Does someone know any viable solution for this problem?
Thanks ;)
One way is to use a datastore that has automatic expiry for keys. Then clients must periodically re-add the key to the store to continue using it, and if they fail then the key is automatically removed. Redis offers this kind of feature (https://redis.io/commands/expire).
Alternatively if you dont have this feature, you can simply store a timestamp when you set the key. If another client reads the key but the time has expired it can safely use the key.
Related
I have a service which is running on many cloud run containers.
When a single container (A) receives a web request to do some work, I need all the other live containers to fetch some updated data from elasticsearch.
I would have expected ES to have a "listening" type of connection such as firebase but this is not possible.
Right now I am having to poll the database from each service.
Is there a better way to achieve this sort of cross container sync when using cloud run? Would pub/sub be the best solution here?
It's unusual but not impossible to achieve.
First of all, you have to understand the instance life cycle: the CPU is allocated only when a request is being processed. Else, the CPU is throttle ( bellow 5%). That's also for that you pay only when your instance is processing, and not when the instance is kept warm (and offloaded after a while).
That being said, it's totally useless and inefficient to update instances in background when a request is not being processed.
Therefore, the idea is to perform something when the instance receive a request. The bad thing is that this solution will increase the request latency (the instance start to sync his cache and then process the request).
Finally the solution is to store, somewhere, the latest cache update. You have to keep that pretty same information in your instance. When the instance receive a request, first thing, it compares its own cache date with the central data date.
If it's the same, no problem, continue the processing.
If the central data date is after the current instance date, update the instance data, and then process the request.
You can store the data, and the date of that data in Firestore for instance, or in MemoryStore, or in any other databases.
PubSub can be also a solution but more complex to implement. Each instance, when they start have to create a pull subscription on a topic. When the instance is killed, you have to delete that subscription.
Then, when a request comes in, your instance have to pull the subscription, and get the messages, if any, and update his local cache.
Could be faster than the previous solution, but harder to implement.
I am trying to dynamically add/ remove listeners in run time using the KafkaListenerEndpointRegistry. This class provides option to register new listener in run time, but does not have option to destroy/unregister an already running listener. We have a stop API to stop a particular container, but we have to destroy the container with a particular id and want to re-register it with same id but different set of topics.
Could anyone please let us know if we have a feasible solution to do it.
The registry has no API to completely remove containers.
Don't use the endpoint registry to create these containers, just use the container factory yourself (see the code in the registry), and keep track of them; then you can stop() and destroy() them as needed.
I have multiple instance of a service registered with eureka; when using FeignClient I am able to successfully contact those instances using the service name of the registered application.
But there is a "problem": if I shut down one of the instances (I have also verified that the instance has gone down correctly and it has immediately been unregistered) and then make some requests to the "gateway" that then calls the services via Feign, the load-balancer still tries for some time to contact the instance that is off, resulting in timeouts and obviously failure of the request.
How is it possible to avoid this behaviour? Is there any way to force the update of the online instances so that to avoid the timeout of the requests.
I have also tried to manually get all the online instances from the discovery-client at runtime during the application execution and the online instances list is correct (the discovery server notifies correctly about every shutdown/start of the instances almost immediately).
Why is that the FeignClient doesn't get "updated" and still calls the dead ones even if the in app discovery-client instances list has been updated?
Here you can find an example of the configuration I'm trying to use.
https://github.com/fearlessfara/feign-test
Problem:
Suppose there are two services A and B. Service A makes an API call to service B.
After a while service A falls down or to be lost due to network errors.
How another services will guess that an outbound call from service A is lost / never happen? I need some another concurrent app that will automatically react (run emergency code) if service A outbound CALL is lost.
What are cutting-edge solutions exist?
My thoughts, for example:
service A registers a call event in some middleware (event info, "running" status, timestamp, etc).
If this call is not completed after N seconds, some "call timeout" event in the middleware automatically starts the emergency code.
If the call is completed at the proper time service A marks the call status as "completed" in the same middleware and the emergency code will not be run.
P.S. I'm on Java stack.
Thanks!
I recommend to look into patterns such as Retry, Timeout, Circuit Breaker, Fallback and Healthcheck. Or you can also look into the Bulkhead pattern if concurrent calls and fault isolation are your concern.
There are many resources where these well-known patterns are explained, for instance:
https://www.infoworld.com/article/3310946/how-to-build-resilient-microservices.html
https://blog.codecentric.de/en/2019/06/resilience-design-patterns-retry-fallback-timeout-circuit-breaker/
I don't know which technology stack you are on but usually there is already some functionality for these concerns provided already that you can incorporate into your solution. There are libraries that already take care of this resilience functionality and you can, for instance, set it up so that your custom code is executed when some events such as failed retries, timeouts, activated circuit breakers, etc. occur.
E.g. for the Java stack Hystrix is widely used, for .Net you can look into Polly .Net to make use of retry, timeout, circuit breaker, bulkhead or fallback functionality.
Concerning health checks you can look into Actuator for Java and .Net core already provides a health check middleware that more or less provides that functionality out-of-the box.
But before using any libraries I suggest to first get familiar with the purpose and concepts of the listed patterns to choose and integrate those that best fit your use cases and major concerns.
Update
We have to differentiate between two well-known problems here:
1.) How can service A robustly handle temporary outages of service B (or the network connection between service A and B which comes down to the same problem)?
To address the related problems the above mentioned patterns will help.
2.) How to make sure that the request that should be sent to service B will not get lost if service A itself goes down?
To address this kind of problem there are different options at hand.
2a.) The component that performed the request to service A (which than triggers service B) also applies the resilience patterns mentioned and will retry its request until service A successfully answers that it has performed its tasks (which also includes the successful request to service B).
There can also be several instances of each service and some kind of load balancer in front of these instances which will distribute and direct the requests to an available instance (based on regular performed healthchecks) of the specific service. Or you can use a service registry (see https://microservices.io/patterns/service-registry.html).
You can of course chain several API calls after another but this can lead to cascading failures. So I would rather go with an asynchronous communication approach as described in the next option.
2b.) Let's consider that it is of utmost importance that some instance of service A will reliably perform the request to service B.
You can use message queues in this case as follows:
Let's say you have a queue where jobs to be performed by service A are collected.
Then you have several instances of service A running (see horizontal scaling) where each instance will consume the same queue.
You will use message locking features by the message queue service which makes sure that as soon one instance of service A reads a message from the queue the other instances won't see it. If service A was able to complete it's job (i.e. call service B, save some state in service A's persistence and whatever other tasks you need to be included for a succesfull procesing) it will delete the message from the queue afterwards so no other instance of service A will also process the same message.
If service A goes down during the processing the queue service will automatically unlock the message for you and another instance A (or the same instance after it has restarted) of service A will try to read the message (i.e. the job) from the queue and try to perform all the tasks (call service B, etc.)
You can combine several queues e.g. also to send a message to service B asynchronously instead of directly performing some kind of API call to it.
The catch is, that the queue service is some highly available and redundant service which will already make sure that no message is getting lost once published to a queue.
Of course you also could handle jobs to be performed in your own database of service A but consider that when service A receives a request there is always a chance that it goes down before it can save that status of the job to it's persistent storage for later processing. Queue services already address that problem for you if chosen thoughtfully and used correctly.
For instance, if look into Kafka as messaging service you can look into this stack overflow answer which relates to the problem solution when using this specific technology: https://stackoverflow.com/a/44589842/7730554
There is many way to solve your problem.
I guess you are talk about 2 topics Design Pattern in Microservices and Cicruit Breaker
https://dzone.com/articles/design-patterns-for-microservices
To solve your problem, Normally I put a message queue between services and use Service Discovery to detect which service is live and If your service die or orverload then use Cicruit Breaker methods
We're use MassTransit with RabbitMQ. Is there a way to check that endpoints aren't available before we publish any messages? I want to setup our IoC to use another strategy if servicebus isn't available and I don't want to get to the point when I'll catch RabbitMQ.Client.Exceptions.BrockerUnreachableException on publishing messages.
If you're using a container, you could create a decorator that could monitor the outcome of the Publish method call, and if it starts throwing exceptions, you could switch the calls over to an alternative publisher.
Ideally such an implementation would include some type of progressive retry capability so that once the endpoint becomes available the calls resume back to the actual endpoint, as well as triggering some replay of the previously failed messages to the endpoint as well.
I figure you're already dealing with the need to have an alternative storage available, such as a local endpoint or some sort of local storage.
Not currently, you can submit an issue requesting that feature: https://github.com/MassTransit/MassTransit/issues. It's not trivial to implement, but maybe not impossible.
A couple of other options people have done include a remote cluster or having a local instance to forward/cluster across all machines included in the bus.