Spring Batch Parallel Job Scaling - spring

I'm currently working on a Spring Batch POC and have got a pretty good handle on most of the actual Spring Batch features. I've currently got a program that uses Spring Integration to receive an HttpRequest and use message channels to eventually send the job executions to the job launcher in a queue. What we'd really like to do is implement some kind of "scheduler/load balancer" (not quite sure what to call it) before the job launcher that will look at the currently running worker nodes and the size of the input file and make a decision on how many worker nodes the job should be allowed. We would probably also want to be able to change the amount of worker nodes a job has while it is running to allow more jobs to run.
The idea is that we'd have a server running that could accept many job requests at any time, and a large cluster of machines that jobs will be partitioned onto. We'd like to be able to scale horizontally, so whenever the server isn't busy it can make full use of the hardware, as well as being able to make sure that small jobs don't get constantly blocked by larger jobs.
From my research it seems like we'd have to implement another framework to do this (do GridGain and Hadoop allow this?), but I figured I'd ask to see what people recommended to do something like this, and if there's a way to do it without implementing another large framework.
Sorry if anything is unclear or confusing, I'm just a lowly intern who started learning Spring and Spring Batch last month and I'm far from completely understanding everything, especially this scaling stuff. Just ask and I'll try to clear things up.
Thanks for any help!

Take a look at the 'spring-batch-integration' project under the spring-batch-admin umbrella project https://github.com/SpringSource/spring-batch-admin
It has a number of examples of using spring-integration to distribute work to other nodes. IN particular see the chunk and partition packages. Just swap out the spring integration channels with jms channel adapters. By distributing work partitions via JMS, you can scale out the number of worker nodes as needed.
There are a number of threads on this subject in the spring integration forum; search for 'PartitionHandler'.
Hope that helps.

Related

Microservice Decomposition for Batch Job

I am reading different posts and books on Microservice Architecture in the hunt to answer my question which is related to the Decomposition Strategies. The question is, should we create a new microservice specifically to handle the batch job?
To my context, the nature of the batch job is to read the data from the database and make REST calls to external system if the data is in the particular state. Additionally, the batch job is suppose to run only once a day.
My questions related to this are
Is this an industry norm/practice that when we have to run BATCH job, it should be a new microservice because batch job consumes resources which can hinder the incoming traffic and increases latency.
Does running a batch job effect the latency of the APIs exposed towards client?
I would say yes, it makes sense. Usually batch jobs have very different development lifecycle and deployment frequency.
I've done something similar by myself and I'm totally sure it's worth it.
Also It would then possible to spin instance to run job once a day - which can save money in cloud environments.
Latency: it depends on that other system. You might want to come with throttling your requests to other system, to not put it down under the heavy load.

Alternative to Timeout notification node in IBM Integration Bus

I probably have found a similar question (and answer) but I wanted to know if any better alternative is available.
Link to similar question:
http://mqseries.net/phpBB/viewtopic.php?t=72601&sid=f62d9730d61ee2ee2a59986dd79defd1
I want to schedule a particular message flow every 5 seconds (or so). I'm using IIB 10 and it's not associated with MQ. So, Timer nodes are non functional.
I've read about scheduling it with cronjob but again it's getting dependent on the OS which is not my preference. Is there any alternative to the timeout notification node?
Can we use java.util.TimerTask or something similar to to it? Any helping hands please?
I don't know of any solution that does not require a cron job or other external scheduler.
Many organisations use a distributed scheduler like Ctrl-M for a wide range of tasks, and adding a couple of jobs to support the integration layer is not seen as a problem.
You can write you own timer flow using an infinite WHILE loop with SLEEP and PROPAGATE TO TERMINAL functions and sending HTTP requests or configure a "CallableFlow".

Testing transactionality of IBM Integration Bus

In order to improve my flows, I would like to test a few scenarios in which the flow/App or Integration Node is stopped while the message is still being processed (to test how transactional my flows actually are, depending on different settings). As IIB9 is fast with processing simple requests, I don't have the time to shut down the flow quickly enough.
I tried to use the debugger, but that doesn't seem to work; I cannot stop the flow or App while debugging, and shutting down the Integration Node doesn't seem to work well either.
Is there an (in-build) way to make the broker work really slowly so I have the time to shut it down? Or should I just think of a really complicated compute node to keep it occupied for a few seconds?
Any suggestions (also for the latter if that is the best option) are welcome.
Really complicated compute node will take a lot of CPU. I would prefer making flow wait for something.
Eg. A flow with HTTP request node or SOAP request node making call to external service. Make this external service take time like say 120 seconds.

CPU bound/stateful distributed system design

I'm working on a web application frontend to a legacy system which involves a lot of CPU bound background processing. The application is also stateful on the server side and the domain objects needs to be held in memory across the entire session as the user operates on it via the web based interface. Think of it as something like a web UI front end to photoshop where each filter can take 20-30 seconds to execute on the server side, so the app still has to interact with the user in real time while they wait.
The main problem is that each instance of the server can only support around 4-8 instances of each "workspace" at once and I need to support a few hundreds of concurrent users at once. I'm going to be building this on Amazon EC2 to make use of the auto scaling functionality. So to summarize, the system is:
A web application frontend to a legacy backend system
task performed are CPU bound
Stateful, most calls will be some sort of RPC, the user will make multiple actions that interact with the stateful objects held in server side memory
Most tasks are semi-realtime, where they have to execute for 20-30 seconds and return the results to the user in the same session
Use amazon aws auto scaling
I'm wondering what is the best way to make a system like this distributed.
Obviously I will need a web server to interact with the browser and then send the cpu-bound tasks from the web server to a bunch of dedicated servers that does the background processing. The question is how to best hook up the 2 tiers together for my specific neeeds.
I've been looking at message Queue systems such as rabbitMQ but these seems to be geared towards one time task where any worker node can simply grab a job form a queue, execute it and forget the state. My needs are a little different since there could be multiple 'tasks' that needs to be 'sticky', for example if step 1 is started in node 1 then step 2 for the same workspace has to go to the same worker process.
Another problem I see is that most worker queue systems seems to be geared towards background tasks that can be processed anytime rather than a system that has to provide user feedback that I'm dealing with.
My question is, is there an off the shelf solution for something like this that will allow me to easily build a system that can scale? Would love to hear your thoughts.
RabbitMQ is has an RPC tutorial. I haven't used this pattern in particular but I am running RabbitMQ on a couple of nodes and it can handle hundreds of connections and millions of messages. With a little work in monitoring you can detect when there is more work to do then you have consumers for. Messages can also timeout so queues won't backup too greatly. To scale out capacity you can create multiple RabbitMQ nodes/clusters. You could have multiple rounds of RPC so that after the first response you include the information required to get second message to the correct destination.
0MQ has this as a basic pattern which will fanout work as needed. I've only played with this but it is simpler to code and possibly simpler to maintain (as it doesn't need a broker, devices can provide one though). This may not handle stickiness by default but it should be possible to write your own routing layer to handle it.
Don't discount HTTP for this as well. When you want request/reply, a strict throughput per backend node, and something that scales well, HTTP is well supported. With AWS you can use their ELB easily in front of an autoscaling group to provide the routing from frontend to backend. ELB supports sticky sessions as well.
I'm a big fan of RabbitMQ but if this is the whole scope then HTTP would work nicely and have fewer moving parts in AWS than the other solutions.

Can hadoop be used as a distributed queue server?

I'm thinking of learning hadoop but not sure if it'll solve my problem. Basically I have a job with a queue and a bunch of workers. Each worker does a small amount of work and then either saves the results(if successful) or sends it back to the queue for further processing. My problem is scalable, is limited by the bandwidth on the network(ec2) which will never keep up with multiple cpu's crunching the data. I thought maybe I could run my jobs in Java in a hadoop cluster and have hadoop distribute the work via a queue. Would this be a better approach? I am correct in assuming hadoop can a queue and try to run jobs as locally as possible to minimize bandwidth usage and maximize cpu usage? My program is very cpu bound but most of my recent problems with its performence are related to passing work over a network(I want to keep the work as local as possible), but the difference between the hadoop tutorials I see and my problem is that in the tutorials all the work is known in advance while my program is generating new work for its self constantly(until its finally done). Would this work and would it help me reduce the impact of passing messages over a network?
Sorry I'm new to hadoop and wanted to know if it could solve my problem.
Hadoop is all about running jobs in a batch-like mode over a large data set. It's hard to get it to have some sort of queue-like behavior, but not impossible. There is Apache ZooKeeper, which will give you synchronization to build a queue if you need it.
There are plenty of tools to solve the problem it looks like you are trying to solve. I suggest taking a look at RabbitMQ. If you use python, Celery is quite fantastic.

Resources