Can we replace zookeeper with etcd in spring-statemachine? - etcd

What are the changes necessary to use etcd in place of zookeeper in spring-statemachine?

Being honest I'm not sure as I'm not that familiar with it. If etcd allows to notify changes in a same way than zk then in theory it's possible. Beyond that it's just about having leader election and a way to store data.

Related

Pause/Resume Flink job during migrations

I'm using Apache Flink to propagate updates from a given set of Kafka topics into an Elasticsearch cluster.
The problem I'm facing is that sometimes the Elasticsearch cluster evolves and I have to (1) modify the mappings, (2) copy over the data...and by the time I have to point the Flink jobs to the new alias/index, there are plenty of updates that made it to the old index.
So I wonder what's the best way to approach this. I can have downtime, but I would like to avoid this if possible. I was trying to make the Flink jobs to slowdown or pause the (Kafka) input sources until the migration finishes, but I didn't find any endpoint for this.
The Flink jobs run in application mode.
If anyone can shed some light on how to accomplish this: pause/resume the jobs via an API or something similar, I will really appreciate the input. The only constraint I have is around stopping the applications (as in stopping/killing pods): it's possible, but too troublesome due to access constraints to the Kubernetes clusters.
I'd probably look into stopping the job with a save point using Flink REST API: https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/rest_api/#jobs-jobid-stop
If that Flink app is pretty big and has lots of state, you can also try to just stop sending data to the input Kafka topic if you don't want to stop it (assuming that it can properly write with the new mappings and indeces after you've made the required ES cluster changes without any change in the Flink Job). It is a bit of overhead, but you could have different topics for your producers and Flink sources, and have another simple Flink job mirror data from one topic (where producers produce to) to the other (where Flink consumes from). When you want to stop writing to ES, just stop that job using the REST API. To not write a new Flink job you could use MirrorMaker or similar, but to stop it you may have to kill its pod.
Or another option is architecting the Elasticsearch indexes so they can support your cluster evolution without having to stop the Flink app. It is hard to know what do you'd need to exactly change, but by writing into aliases and playing with the write index flag you may be able to achieve what you want. I've done this in the past, but it is true that if your mappings change a lot it may be hard to do,

Master Slave configuration for Spring Boot Microservices

I have a Spring boot application (Micro-service) running on Two nodes and registered with Eureka Naming server. My requirement is as follows:
An Autosys job will trigger one complex calculation in micro-service which will take about 45 minutes to complete. Result of this calculation will be saved to Gemfire cache and database. I want these two nodes act as Master-Slave where only Master node will take up and execute the request of complex calculation. If master goes down then only slave will become master and will be responsible for execution of complex calculation.
Another catch is while complex calculation is running, if adhoc request for the same calculation comes; latest request needs to be rejected saying calculation is already running.
I explored the possibility to use Apache ZooKeeper but it doesn't seem to satisfy my requirement of serving the request only using Master node.
Is there any way of achieving this?
What about Kafka? It uses ZooKeeper under the covers: https://kafka.apache.org/
You are probably looking for leader election: When does Kafka Leader Election happen?

which port is used to gossip in akka actor cluster by the cluster members?

I am trying to debug an issue where cluster island are getting formed.
checking if there is an network issue.
which port is used to gossip in akka actor cluster by the cluster members ?
It depends on your configuration - it's the TCP port that you set up Akka to listen on. For more details refer to Cluster configuration docs (see the port and seed-node values).
Broader hints on cluster partitions:
If you ended up having a cluster partition in an Akka cluster it most likely means you're using the auto-downing feature. It's not recommended for production use, as it's rather fleaky - it relies on a simple timeout based mechanism. For more advanced downing mechanisms you can look into using the Split Brain Resolver commercial tooling, or build a downing mechanism yourself which would hook into external monitoring infrastructure (we've seen a number of teams do this).
An interesting thought to keep in mind is that perhaps you do not need auto-downing at all, and when leaving the cluster with a node you can do so cleanly by issuing Cluster.leave(address) in the code.

How to use Consul in leader election?

How do I use Consul to make sure only one service is performing a task?
I've followed the examples in http://www.consul.io/ but I am not 100% sure which way to go. Should I use KV? Should I use services? Or should I use a register a service as a Health Check and make it be callable by the cluster at a given interval?
For example, imagine there are several data centers. Within every data center there are many services running. Every one of these services can send emails. These services have to check if there are any emails to be sent. If there are, then send the emails. However, I don't want the same email be sent more than once.
How would it make sure all emails are sent and none was sent more than once?
I could do this using other technologies, but I am trying to implement this using Consul.
This is exactly the use case for Consul Distributed Locks
For example, let's say you have three servers in different AWS availability zones for fail over. Each one is launched with:
consul lock -verbose lock-name ./run_server.sh
Consul agent will only run the ./run_server.sh command on which ever server acquires the lock first. If ./run_server.sh fails on the server with the lock Consul agent will release the lock and another node which acquires it first will execute ./run_server.sh. This way you get fail over and only one server running at a time. If you registered your Consul health checks properly you'll be able to see that the server on the first node failed and you can repair and restart the consul lock ... on that node and it will block until it can acquire the lock.
Currently, Distributed Locking can only happen within a single Consul Datacenter. But, since it is up to you to decide what a Consul Servers make up a Datacenter, you should be able to solve your issue. If you want locking across Federated Consul Datacenters you'll have to wait for it, since it's a roadmap item.
First Point:
The question is how to use Consul to solve a specific problem. However, Consul cannot solve that specific problem because of intrinsic limitations in the nature of a gossip protocol.
When one datacenter cannot talk to another you cannot safely determine if the problem is the network or the affected datacenter.
The usual solution is to define what happens when one DC cannot talk to another one. For example, if we have 3 datacenters (DC1, DC2, and DC3) we can determine that whenever one DC cannot talk to the other 2 DCs then it will stop updating the database.
If DC1 cannot talk to DC2 and DC3 then DC1 will stop updating the database, and the system will assume DC2 and DC3 are still online.
Let's imagine that DC2 and DC3 are still online and they can talk to each other, then we have quorum to continue running the system.
When DC1 comes online again it will play catch up with the database.
Where can Consul help here? It can communicate between DCs and check if they are online... but so can ICMP.
Take a look at the comments. Did this answer your question? Not really. But I don't think the question has an answer.
Second point: The question is "How to use Consul in leader election?" It would have been better to ask how does Consul elect a new leader. Or "Given the documentation in Consul.io, can you give me an example on how to determine the leader using Consul".
If that is what you really want, then the question was already answered: How does a Consul agent know it is the leader of a cluster?

Discovery service and Leader election algorithm

I've been doing some research for enhancement of in-house Discovery Service on my project. We have a number of nodes in a cluster accountable for discovery service, higly available. In order to get access to some service each client app sends a multicast message to all these nodes in the cluster. All nodes respond to a client and the very first response defines a particular node for further work. This is an overhead and I'm thinking of using some kind of leader election algorithm where only a single leader responds to clients. Is it reasonable to use such an algorithm for this task?
I think what you are trying to do is load balance across multiple machines where in any machine can handle the requests. Leader selection etc seems a overhead. Probably a loadbalancer can solve the issue.

Resources