Multiple instances of a partitioned spring batch job - spring-boot

I have a Spring batch partitioned job. The job is always started with a unique set of parameters so always a new job.
My remoting fabric is JMS with request/response queues configured for communication between the masters and slaves.
One instance of this partitioned job processes files in a given folder. Master step gets the file names from the folder and submits the file names to the slaves; each slave instance processes one of the files.
Job works fine.
Recently, I started to execute multiple instances (completely separate JVMs) of this job to process files from multiple folders. So I essentially have multiple master steps running but the same set of slaves.
Randomly; I notice the following behavior sometimes - the slaves will finish their work but the master keeps spinning thinking the slaves are still doing something. The step status will show successful in the job repo but at the job level the status is STARTING with an exit code of UNKNOWN.
All masters share the set of request/response queues; one queue for requests and one for responses.
Is this a supported configuration? Can you have multiple master steps sharing the same set of queues running concurrently? Because of the behavior above I'm thinking the responses back from the workers are going to the incorrect master.

Related

Spring Batch running in Kubernetes

I have a Spring Batch that partitions into "Slave Steps" and run in a thread pool, here is the configuration: Spring Batch - FlatFileItemWriter Error 14416: Stream is already closed
I'd like to run this Spring Batch Job in Kubernetes. I checked this post: https://spring.io/blog/2021/01/27/spring-batch-on-kubernetes-efficient-batch-processing-at-scale by #MAHMOUD BEN HASSINE.
From the post, on Paragraph:
Choosing the Right Kubernetes Job Concurrency Policy
As I pointed out earlier, Spring Batch prevents concurrent job executions of the
same job instance. So, if you follow the “Kubernetes job per Spring
Batch job instance” deployment pattern, setting the job’s
spec.parallelism to a value higher than 1 does not make sense, as this
starts two pods in parallel and one of them will certainly fail with a
JobExecutionAlreadyRunningException. However, setting a
spec.parallelism to a value higher than 1 makes perfect sense for a
partitioned job. In this case, partitions can be executed in parallel
pods. Correctly choosing the concurrency policy is tightly related to
which job pattern is chosen (As explained in point 3).
Looking into my Batch Job, if I start 2 or more pods, it sounds like one/more pods will fail because it will try to start the same job. But on the other hand, it sounds like more pods will run in parallel because I am using partitioned job.
My Spring Batch seems to be a similar to https://kubernetes.io/docs/tasks/job/fine-parallel-processing-work-queue/
This said, what is the right approach to it? How many pods should I set on my deployment?
Do the partition/threads will run on separate/different pods, or the threads will run in just one pod?
Where do I define that, in the parallelism? And the parallelism, should it be the same as the number of threads?
Thank you! Markus.
A thread runs in a JVM which runs inside container that in turn is run in a Pod. So it does not make sense to talk about having different threads running on different Pods.
The partitioning technique in Spring Batch can be either local (multiple threads within the same JVM where each thread processes a different partition) or remote (multiple JVMs processing different partitions). Local partitioning requires a single JVM, hence you only need one Pod for that. Remote partitioning requires multiple JVMs, so you need multiple Pods.
I have a Spring Batch that partitions into "Slave Steps" and run in a thread pool
Since you implemented local partitioning with a pool of worker threads, you only need one Pod to run your partitioned Job.

Spring Batch in clustered environment, high-availability

Right now I use H2 in-memory database as JobRepostiry for my single node Spring Batch/Boot application.
Now I would like to run Spring Batch application on two nodes in order to increase performance (distribute jobs between these 2 instances) and made the application more failover.
Instead of H2 I'm going to use PostgreSQL and configure both of the applications to use this shared database. Is that enough for Sring Batch in order to start working properly in the cluster and start distributing jobs between cluster nodes or do I need to perform some additional actions?
Depending on how you will distribute your jobs across the nodes, you might need to setup a communication middleware (such a JMS or AMQP provider) in addition to a shared job repository.
For example, if you use remote partitioning, your job will be partitioned and each worker can be run on one node. In this case, the job repository must be shared in order for:
the workers to report their progress to the job repository
the master to poll the job repository for workers statuses.
If your jobs are completely independent and you don't need feature like restart, you can continue using an in-memory database for each job and launch multiple instances of the same job on different nodes. But even in this case, I would recommend using a production grade job repository instead of an in-memory database. Things can go wrong very quickly in a clustered environment and having a job repository to store the execution status, synchronize executions, restart failed executions, etc is crucial in such an environment.

Number of application masters in a mapreduce job?? And mapreduce processing steps in YARN

I know that there is only Resource Manager in a hadoop cluster.
From my understanding, there should be only one Application Master for a cluster as well. Is that right? Following is my understanding of how a mapreduce job is run in YARN. Please correct if my understanding is not right.
Application execution sequence of steps on YARN:
Client submits a job to the Resource Manager (RM). RM runs on Master Node. There is only one RM across the cluster to manage the resources. Resource Manager is a Daemon process.
RM will go to HDFS thru Name Node.
RM spins up an Application Master (AM). AM will reach HDFS thru Name Node. It will create a mapper matrix. This is the mapper phase. Like if Block 1 is available on Name Node 5 or 6.
Based on Mapper matrix information, AM sends requests to individual Node managers (NM) to run a particular task for each block. NM runs on slave node.
Each NM sends a request to RM to get a container. A container executes an application specific process with a constrained set of resources (memory, CPU etc).
Mapper task runs in the container and sends the heart beat to the Application master. AM also sends the heart beat to RM.
After all the processes are done, AM starts another matrix for Reducer tasks.
After all the reducer tasks are completed, the AM sends the results to RM.
RM lets the client know the results and kills the AM.
Application Master can get stuck. That is why it is sending heart beats to Resource Manager
Thanks much
Nath
Other steps look fine.
RM spins up an Application Master (AM). AM will reach HDFS thru Name Node. It will create a mapper matrix. This is the mapper phase. Like if Block 1 is available on Name Node 5 or 6.
Slight correction here. The AM can only execute inside any given container. So first the RM requests a node manager on some node to start a container and then only the AM gets launched inside that cotainer, not before. So there will be a container dedicated to the AM.

Yarn capacity-scheduler Parallelize

Does capacity-scheduler in yarn run app in parallel on the same queue for the same user.
For example:If we have 2 hive CLI on 2 terminals with same user, and the same query is started on both, do they execute on the default queue in parallel or sequentially.
Currently, the UI shows 1 running, and 1 in pending state:
Is there a way to run it in parallel?
Yarn capacity scheduler run jobs in FIFO manner for the jobs submitted in the same queue. For example if both the hive cli's got submitted for default queue then which ever able to secure resources first will get into running state and other will wait(only if enough resources are not present in the queue).
If you want parallel execution
1) you can run other job in different queue.You can define the queue name while launching job on yarn.
2) You need to define resources in a manner so that both job can get resources as desired.

Set build number of downstream jobs from master job in Jenkins

I have 2 Jenkins slaves with 1 master (3 machines). For example Slave1 and Slave2. I have two jobs and used labels to bind the jobs to the slaves. For example Job1 is bound to Slave1 and Job2 is bound to Slave2. Both are free style jobs. I created a free style job which only invokes Job1 and Job2 so they run on the slaves at the same time. I'd like for the two jobs to always build with the same build number or inherit the build number from the upstream job. Is there a way I could send the build number from the main job to the the two downstream jobs? I'd like to prevent Job1 and Job2's build numbers from getting out of sync which would happen if one is run by itself.
There is a method in Jenkins Java API: Job::updateNextBuildNumber(int). So you can try the following: from a system Groovy script (that can be run via Groovy Plugin) locate the child job objects, set the build number on them via the method above; then trigger them.
You'll still may get problems, however. For example, if one of those jobs is triggered manually you may not be able to set a number on it (build numbers have to increase).

Resources