Spring Batch in clustered environment, high-availability - spring-boot

Right now I use H2 in-memory database as JobRepostiry for my single node Spring Batch/Boot application.
Now I would like to run Spring Batch application on two nodes in order to increase performance (distribute jobs between these 2 instances) and made the application more failover.
Instead of H2 I'm going to use PostgreSQL and configure both of the applications to use this shared database. Is that enough for Sring Batch in order to start working properly in the cluster and start distributing jobs between cluster nodes or do I need to perform some additional actions?

Depending on how you will distribute your jobs across the nodes, you might need to setup a communication middleware (such a JMS or AMQP provider) in addition to a shared job repository.
For example, if you use remote partitioning, your job will be partitioned and each worker can be run on one node. In this case, the job repository must be shared in order for:
the workers to report their progress to the job repository
the master to poll the job repository for workers statuses.
If your jobs are completely independent and you don't need feature like restart, you can continue using an in-memory database for each job and launch multiple instances of the same job on different nodes. But even in this case, I would recommend using a production grade job repository instead of an in-memory database. Things can go wrong very quickly in a clustered environment and having a job repository to store the execution status, synchronize executions, restart failed executions, etc is crucial in such an environment.

Related

Kafka connect behaviour in Distribution mode

Im running Kafka connect in distributed mode having two different connectors with one task each. Each connector is running in different instance which is exactly I want.
Is it always ensure the same behaviour that Kafka connect cluster share the load properly ?
Connectors in Kafka Connect run with one, or more tasks. The number of tasks depends on how you have configured the connector, and whether the connector itself can run multiple tasks. An example would be the JDBC Source connector, which if ingesting more than one table from a database will run (if configured to do so) one task per table.
When you run Kafka Connect in distributed mode, tasks from all the connectors are executed across the available workers. Each task will only be executing on one worker at one time.
If a worker fails (or is shut down) then Kafka Connect will rebalance the tasks across the remaining worker(s).
Therefore, you may see one connector running across different workers (instances), but only if it has more than one task.
If you think you are seeing the same connector's task executing more than once then it suggests a misconfiguration of the Kafka Connect cluster, and I would suggest reviewing https://rmoff.net/2019/11/22/common-mistakes-made-when-configuring-multiple-kafka-connect-workers/.

Spring Batch running in Kubernetes

I have a Spring Batch that partitions into "Slave Steps" and run in a thread pool, here is the configuration: Spring Batch - FlatFileItemWriter Error 14416: Stream is already closed
I'd like to run this Spring Batch Job in Kubernetes. I checked this post: https://spring.io/blog/2021/01/27/spring-batch-on-kubernetes-efficient-batch-processing-at-scale by #MAHMOUD BEN HASSINE.
From the post, on Paragraph:
Choosing the Right Kubernetes Job Concurrency Policy
As I pointed out earlier, Spring Batch prevents concurrent job executions of the
same job instance. So, if you follow the “Kubernetes job per Spring
Batch job instance” deployment pattern, setting the job’s
spec.parallelism to a value higher than 1 does not make sense, as this
starts two pods in parallel and one of them will certainly fail with a
JobExecutionAlreadyRunningException. However, setting a
spec.parallelism to a value higher than 1 makes perfect sense for a
partitioned job. In this case, partitions can be executed in parallel
pods. Correctly choosing the concurrency policy is tightly related to
which job pattern is chosen (As explained in point 3).
Looking into my Batch Job, if I start 2 or more pods, it sounds like one/more pods will fail because it will try to start the same job. But on the other hand, it sounds like more pods will run in parallel because I am using partitioned job.
My Spring Batch seems to be a similar to https://kubernetes.io/docs/tasks/job/fine-parallel-processing-work-queue/
This said, what is the right approach to it? How many pods should I set on my deployment?
Do the partition/threads will run on separate/different pods, or the threads will run in just one pod?
Where do I define that, in the parallelism? And the parallelism, should it be the same as the number of threads?
Thank you! Markus.
A thread runs in a JVM which runs inside container that in turn is run in a Pod. So it does not make sense to talk about having different threads running on different Pods.
The partitioning technique in Spring Batch can be either local (multiple threads within the same JVM where each thread processes a different partition) or remote (multiple JVMs processing different partitions). Local partitioning requires a single JVM, hence you only need one Pod for that. Remote partitioning requires multiple JVMs, so you need multiple Pods.
I have a Spring Batch that partitions into "Slave Steps" and run in a thread pool
Since you implemented local partitioning with a pool of worker threads, you only need one Pod to run your partitioned Job.

Multiple instances of a partitioned spring batch job

I have a Spring batch partitioned job. The job is always started with a unique set of parameters so always a new job.
My remoting fabric is JMS with request/response queues configured for communication between the masters and slaves.
One instance of this partitioned job processes files in a given folder. Master step gets the file names from the folder and submits the file names to the slaves; each slave instance processes one of the files.
Job works fine.
Recently, I started to execute multiple instances (completely separate JVMs) of this job to process files from multiple folders. So I essentially have multiple master steps running but the same set of slaves.
Randomly; I notice the following behavior sometimes - the slaves will finish their work but the master keeps spinning thinking the slaves are still doing something. The step status will show successful in the job repo but at the job level the status is STARTING with an exit code of UNKNOWN.
All masters share the set of request/response queues; one queue for requests and one for responses.
Is this a supported configuration? Can you have multiple master steps sharing the same set of queues running concurrently? Because of the behavior above I'm thinking the responses back from the workers are going to the incorrect master.

Running spark cluster on standalone mode vs Yarn/Mesos

Currently I am running my spark cluster as standalone mode. I am reading data from flat files or Cassandra(depending upon the job) and writing back the processed data to the Cassandra itself.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Spark standalone cluster manager can also give you cluster mode capabilities.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
When you submit your application in cluster mode all you job related files would be copied on to one of the machines on the cluster which would then submit the job on your behalf, if you submit the application in client mode the machine from which the job is being submitted would be taking care of driver related activities. This means that the machine from which the job has been submitted cannot go offline, whereas in cluster mode the machine from which the job has been submitted can go offline.
Having a Cassandra cluster would also not change any of these behaviors except it can save you network traffic if you can get the nearest contact point for the spark executor(Just like Data locality).
The failed stages gets rescheduled if you use either of the cluster managers.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
In Standalone cluster model, each application uses all the available nodes in the cluster.
From spark-standalone documentation page:
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time.
In other cases (when you are running multiple applications in the cluster) , you can prefer YARN.
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Not sure since your application logic is not known. But you can give a try with YARN.
Have a look at related SE question for benefits of YARN over Standalone and Mesos:
Which cluster type should I choose for Spark?

Schedule a trigger for a job that is excecuted on every node in a cluster

I'm wondering if there is a simple workaround/hack for quartz of triggering a job that is excecuted on every node in a cluster.
My situation:
My application is caching some things and is running in a cluster with no distributed-cache. Now I have situations where I want to refresh the caches on all nodes triggered by a job.
As you have found out, Quartz always picks up a random instance to execute a scheduled job and this cannot be easily changed unless you want to hack its internals.
Probably the easiest way to achieve what you describe would be to implement some sort of a coordinator (or master) job that will be aware of all Quartz instances in the cluster and will "manually" trigger execution of the cache-sync job on every single node. The master job can easily do it via the RMI, or JMX APIs exposed by Quartz.
You may want to check this somewhat similar question.

Resources