Crossing of messages in worker replies during concurrent spring batch jobs with remote partitioning - spring

This has been asked here, but I don't think this was answered. The only answer talks about how aggregator uses correlationId. But the real issue is how job status is updated without checking JobExecutionId in replies.
I don't have enough reputation to comment on existing question, so asking here again.
According to javadoc on MessageChannelPartitionHandler it is supposed to be step or job scoped. In remote partitioning scenario we are using RemotePartitioningManagerStepBuilder to build manager step which does not allow to set PartitionHandler. Given that every job will use same queue on rabbitmq, when worker node replies are received message are getting crossed. There is no simple way to reproduce this but I can see this behavior using some manual steps as below
Launch first job
Kill the manager node before worker can reply
Let worker node finish handling all partitions and send a reply on rabbitmq
Start manager node again and launch a new job
Have some mechanism to fail the second job i.e. explicitly fail in reader/writer
Check the status of 2 jobs
Expected Result: Job-1 marked completed and job-2 as failed
Actual Result: Job-1 remains in started and job-2 is marked completed eventhough its worker steps are marked as failed
Below is sample code that shows how manager and worker steps are configured
#Bean
public Step importDataStep(RemotePartitioningManagerStepBuilderFactory managerStepBuilderFactory) {
return managerStepBuilderFactory.get()
.<String, String>partitioner("worker", partitioner())
.gridSize(2)
.outputChannel(outgoingRequestsToWorkers)
.inputChannel(incomingRepliesFromWorkers)
.listener(stepExecutionListener)
.build();
}
#Bean
public Step worker(
RemotePartitioningWorkerStepBuilderFactory workerStepBuilderFactory) {
return workerStepBuilderFactory.get("worker")
.listener(stepExecutionListener)
.inputChannel(incomingRequestsFromManager())
.outputChannel(outgoingRepliesToManager())
.<String, String>chunk(10)
.reader(itemReader())
.processor(itemProcessor())
.writer(itemWriter());
}
Alternatively, I can think of using polling instead of replies where crossing of message does not occur. But polling cannot be restarted if manager nodes crashed while worker nodes were processing. If I follow the same above steps using polling
Actual Result: Job-1 remains in started and job-2 is marked failed as expected
This issue does not occur in case of polling because each Poller is using exact jobExecutionId to poll and update corresponding manger step/job.
What am I doing wrong? Is there a better way to handle this scenario?

Related

How to Stop the Spring Batch Job Execution Immediately or forcefully?

I have a spring batch job and it has 3 steps, the 3rd step has some tasklet as below. Now when we try to stop the job using,
jobOperator.stop(id);
it sends the STOP signal when STEP 3 is in progress and interrupts only when all the tasklet in the STEP 3 is completed. Let's say it has 10 tasks, although we sent the stop signal when it was STEP 3 and Task 1 in progress - it does not stop there. It finishes all the 10 tasks and then marks this STEP 3 status as COMPLETED. Is there any way we can stop step 3 while processing the first task? I did see the spring batch documentation and did not find much. Below is the sample code.
Job:
#Bean(JobConstants.CONTENT_UPGRADE_JOB)
public Job upgradeContentJob() {
Tasklet tasklet = context.getBean("incremental_deploy_tasklet", Tasklet.class);
SimpleJobBuilder jobBuilder = jobBuilderFactory.get(JobConstants.CONTENT_UPGRADE_JOB)
.listener(upgradeJobResultListener())
.start(initContent())
.next(stepBuilderFactory.get("create_snapshot_tasklet").tasklet(createSnapshotTasklet()).build())
.next(stepBuilderFactory.get("incremental_deploy_tasklet").tasklet(tasklet).build());
return jobBuilder.build();
}
Tasks:
packCompositionMap.put(incremental_content_deploy, Arrays.asList(
create_security_actions ,slice_content, upgrade_appmodule,
application_refresh,
install_reset_roles_bar, restore_reset_roles_bar,
populate_roles, add_roles,
replay_security_actions,
create_adw_connection,
apply_system_extensions,
replay_system_steps,
assign_service_admin
));
Here STOP signal for the id is sent when "incremental_deploy_tasklet" is just initiated and "create_security_actions" task is picked from the array list but the problem is it does not stops but completes all the item task in the array and then marks the status for this "incremental_deploy_tasklet" step as STOPPED and then overall status for this job is also marked as STOPPED.
What I am looking for is help on to STOP and interrupt at this "create_security_actions" task itself. Any help or input is appreciated, Thank you
After reading multiple docs and trying it- found that we cannot terminate the thread immediately. The control should come back to the framework.
The shutdown is not immediate, since there is no way to force an immediate shutdown, especially if the execution is currently in developer code that the framework has no control over, such as a business service. However, as soon as control is returned back to the framework, it will set the status of the current StepExecution to BatchStatus.STOPPED, save it, then do the same for the JobExecution before finishing.
Thank you.

How to send an exception to Sentry from Laravel Job only on final fail?

Configuration
I'm using Laravel 8 with sentry/sentry-laravel plugin.
There is a Job that works just fine 99% of time. It retries N times in case of any problems due to:
public $backoff = 120;
public function retryUntil()
{
return now()->addHours(6);
}
And it simply calls some service:
public function handle()
{
// Service calls some external API
$service->doSomeWork(...);
}
Method doSomeWork sometimes throws an exception due to network problems, like Curl error: Operation timed out after 15001 milliseconds with 0 bytes received. This is fine due to automatic retries. In most cases next retry will succeed.
Problem
Every curl error is sent to Sentry. As an administrator I must check every alert, because this job is pretty important and I can't miss actually failed job. For example:
There is some network problem that is not resolved for an hour.
Application queues a Job
Every 2 minutes application generates similar message to Sentry
After network problems resolved job succeeds, so no attention required
But we are seing dozens of errors, that theoretically could be ignored. But what if there an actual problem in that pile and I will miss it?
Question
How to make that only "final" job fail would send a message to Sentry? I mean after 6 hours of failed retries: only then I'd like to receive one alert.
What I tried
There is one workaround that kind of "works". We can replace Exception with SomeCustomException and add it to \App\Exceptions\Handler::$dontReport array. In that case there are no "intermediate" messages sent to Sentry.
But when job finally fails, Laravel sends standard ... job has been attempted too many times or run too long message without details of actual error.

Scatter Gather with parallel flow (Timeout in aggregator)

I've been trying to add a timeout in the gather to don't wait that every flow finished.
but when I added the timeout doesn't work because the aggregator waits that each flow finished.
#Bean
public IntegrationFlow queueFlow(LogicService service) {
return f -> f.scatterGather(scatterer -> scatterer
.applySequence(true)
.recipientFlow(aFlow(service))
.recipientFlow(bFlow(service))
, aggregatorSpec -> aggregatorSpec.groupTimeout(2000L))
E.g of my flows one of them has 2 secs of delay and the other one 4 secs
public IntegrationFlow bFlow(LogicService service) {
return IntegrationFlows.from(MessageChannels.executor(Executors.newCachedThreadPool()))
.handle(service::callFakeServiceTimeout2)
.transform((MessageDomain.class), message -> {
message.setMessage(message.getMessage().toUpperCase());
return message;
}).get();
}
I use Executors.newCachedThreadPool() to run parallel.
I'd like to release each message that was contained until the timeout is fulfilled
Another approach that I've been testing was to use a default gatherer and in scatterGather set the gatherTimeout but I don't know if I'm missing something
Approach gatherTimeout
UPDATE
All the approaches given in the comments were tested and work normally, the only problem is that each action is evaluated over the message group creation. and the message group is created just until the first message arrived. The ideal approach is having an option of valid at the moment when the scatterer distributes the request message.
My temporal solution was to use a release strategy ad hoc applying a GroupConditionProvider which reads a custom header that I created when I send the message through the gateway. The only concern of this is that the release strategy only will be executed when arriving at a new message or I set a group time out.
The groupTimeout on the aggregator is not enough to release the group. If you don't get the whole group on that timeout, then it is going to be discarded. See sendPartialResultOnExpiry option: https://docs.spring.io/spring-integration/reference/html/message-routing.html#agg-and-group-to
If send-partial-result-on-expiry is true, existing messages in the (partial) MessageGroup are released as a normal aggregator reply message to the output-channel. Otherwise, it is discarded.
The gatherTimeout is good to have if you expect no replies from the gatherer at all. So, this way you won't block the scatter-gather thread forever: https://docs.spring.io/spring-integration/reference/html/message-routing.html#scatter-gather-error-handling

DefaultMessageListenerContainer stops processing messages

I'm hoping this is a simple configuration issue but I can't seem to figure out what it might be.
Set-up
Spring-Boor 2.2.2.RELEASE
cloud-starter
cloud-starter-aws
spring-jms
spring-cloud-dependencies Hoxton.SR1
amazon-sqs-java-messaging-lib 1.0.8
Problem
My application starts up fine and begins to process messages from Amazon SQS. After some amount of time I see the following warning
2020-02-01 04:16:21.482 LogLevel=WARN 1 --- [ecutor-thread14] o.s.j.l.DefaultMessageListenerContainer : Number of scheduled consumers has dropped below concurrentConsumers limit, probably due to tasks having been rejected. Check your thread pool configuration! Automatic recovery to be triggered by remaining consumers.
The above warning gets printed multiple times and eventually I see the following two INFO messages
2020-02-01 04:17:51.552 LogLevel=INFO 1 --- [ecutor-thread40] c.a.s.javamessaging.SQSMessageConsumer : Shutting down ConsumerPrefetch executor
2020-02-01 04:18:06.640 LogLevel=INFO 1 --- [ecutor-thread40] com.amazon.sqs.javamessaging.SQSSession : Shutting down SessionCallBackScheduler executor
The above 2 messages will display several times and at some point no more messages are consumed from SQS. I don't see any other messages in my log to indicate an issue, but I get no messages from my handlers that they are processing messages (I have 2~) and I can see the AWS SQS queue growing in the number of messages and the age.
~: This exact code was working fine when I had a single handler, this problem started when I added the second one.
Configuration/Code
The first "WARNing" I realize is caused by the currency of the ThreadPoolTaskExecutor, but I can not get a configuration which works properly. Here is my current configuration for the JMS stuff, I have tried various levels of max pool size with no real affect other than the warings start sooner or later based on the pool size
public ThreadPoolTaskExecutor asyncAppConsumerTaskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setThreadGroupName("asyncConsumerTaskExecutor");
taskExecutor.setThreadNamePrefix("asyncConsumerTaskExecutor-thread");
taskExecutor.setCorePoolSize(10);
// Allow the thread pool to grow up to 4 times the core size, evidently not
// having the pool be larger than the max concurrency causes the JMS queue
// to barf on itself with messages like
// "Number of scheduled consumers has dropped below concurrentConsumers limit, probably due to tasks having been rejected. Check your thread pool configuration! Automatic recovery to be triggered by remaining consumers"
taskExecutor.setMaxPoolSize(10 * 4);
taskExecutor.setQueueCapacity(0); // do not queue up messages
taskExecutor.setWaitForTasksToCompleteOnShutdown(true);
taskExecutor.setAwaitTerminationSeconds(60);
return taskExecutor;
}
Here is the JMS Container Factory we create
public DefaultJmsListenerContainerFactory jmsListenerContainerFactory(SQSConnectionFactory sqsConnectionFactory, ThreadPoolTaskExecutor asyncConsumerTaskExecutor) {
DefaultJmsListenerContainerFactory factory = new DefaultJmsListenerContainerFactory();
factory.setConnectionFactory(sqsConnectionFactory);
factory.setDestinationResolver(new DynamicDestinationResolver());
// The JMS processor will start 'concurrency' number of tasks
// and supposedly will increase this to the max of '10 * 3'
factory.setConcurrency(10 + "-" + (10 * 3));
factory.setTaskExecutor(asyncConsumerTaskExecutor);
// Let the task process 100 messages, default appears to be 10
factory.setMaxMessagesPerTask(100);
// Wait up to 5 seconds for a timeout, this keeps the task around a bit longer
factory.setReceiveTimeout(5000L);
factory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
return factory;
}
I added the setMaxMessagesPerTask & setReceiveTimeout calls based on stuff found on the internet, the problem persists without these and at various settings (50, 2500L, 25, 1000L, etc...)
We create a default SQS connection factory
public SQSConnectionFactory sqsConnectionFactory(AmazonSQS amazonSQS) {
return new SQSConnectionFactory(new ProviderConfiguration(), amazonSQS);
}
Finally the handlers look like this
#JmsListener(destination = "consumer-event-queue")
public void receiveEvents(String message) throws IOException {
MyEventDTO myEventDTO = jsonObj.readValue(message, MyEventDTO.class);
//messageTask.process(myEventDTO);
}
#JmsListener(destination = "myalert-sqs")
public void receiveAlerts(String message) throws IOException, InterruptedException {
final MyAlertDTO myAlert = jsonObj.readValue(message, MyAlertDTO.class);
myProcessor.addAlertToQueue(myAlert);
}
You can see in the first function (receiveEvents) we just take the message from the queue and exit, we have not implemented the processing code for that.
The second function (receiveAlerts) gets the message, the myProcessor.addAlertToQueue function creates a runnable object and submits it to a threadpool to be processed at some point in the future.
The problem only started (the warning, info and failure to consume messages) only started when we added the receiveAlerts function, previously the other function was the only one present and we did not see this behavior.
More
This is part of a larger project and I am working on breaking this code out into a smaller test case to see if I can duplicate this issue. I will post a follow-up with the results.
In the Mean Time
I'm hoping this is just a config issue and someone more familiar with this can tell me what I'm doing wrong, or that someone can provide some thoughts and comments on how to correct this to work properly.
Thank you!
After fighting this one for a bit I think I finally resolved it.
The issue appears to be due to the "DefaultJmsListenerContainerFactory", this factory creates a new "DefaultJmsListenerContainer" for EACH method with a '#JmsListener' annotation. The person who originally wrote the code thought it was only called once for the application, and the created container would be re-used. So the issue was two-fold
The 'ThreadPoolTaskExecutor' attached to the factory had 40 threads, when the application had 1 '#JmsListener' method this worked fine, but when we aded a second method then each method got 10 threads (total of 20) for listening. This is fine, however; since we stated that each listener could grow up to 30 listeners we quickly ran out of threads in the pool mentioned in 1 above. This caused the "Number of scheduled consumers has dropped below concurrentConsumers limit" error
This is probably obvious given the above, but I wanted to call it out explicitly. In the Listener Factory we set the concurrency to be "10-30", however; all of the listeners have to share that pool. As such the max concurrency has to be setup so that each listeners' max value is small enough so that if each listener creates its maximum that it doesn't exceed the maximum number of threads in the pool (e.g. if we have 2 '#JmsListener' annotated methods and a pool with 40 threads, then the max value can be no more than 20).
Hopefully this might help someone else with a similar issue in the future....

spring integration message released twice from aggregator

I have a spring integration flow that starts with a channel inboundadapter and picks up files and passes them through the system as messages.
After a few components, the messages are aggregated at an "Aggregator" from where they are released based on release strategies or by group timeout of 30 sec.
The downstream processing has another bunch of components till the final one.
The problem I am facing is this,
When I send 33 files which create 33 "groups/buckets" based on correlation IDs, aggregated at the "Aggregator", some of the files or messages seems to be "released" twice. The reason I conclude that is because I have a channel interceptor which shows a few messages passing through the "released" channel (appearing right after the aggregator) a second time, after completing the downstream processing successfully, the first time. Additionally, this behavior causes my application to not find a file and throw an exception which I see. This leads me to conclude that the message bucket/group/corrID is somehow being "Released" twice.
I have tried to debug this many ways , but essentially, I want to know how a corrID/bucket after being released and having successfully gone through all downstream components in a single thread, can be "released" again.
My question is, how can I debug this? I want to know what is making this message/bucket re-appear in the aggregator.
My aggregator is as follows,
<int:aggregator id="bufferedFiles" input-channel="inQueueForStage"
output-channel="released" expire-groups-upon-completion="true"
send-partial-result-on-expiry="true" release-strategy="releaseHandler"
release-strategy-method="canRelease"
group-timeout-expression="size() > 0 ? T(com.att.datalake.ifr.loader.utils.MessageUtils).getAggregatorTimeout(one, #sourceSnapshot) : -1">
<int:poller fixed-delay="${files.pickup.delay:3000}"
max-messages-per-poll="${num.files.pickup.per.poll:10}"
task-executor="executor" />
</int:aggregator>
Explanation of aggregator: The size()>0 applies to EACH correlation bucket. each of the 33 files I am sending will spawn/generate/create a new bucket because of the file name, so the aggregator will have 33 buckets/groups/corrIds, each bucket will contain only one file.
So the aggregator SPEL expression simply says that if there no release strategies, then release the bucket/group after 30 secs if the group indeed has at least some files.
My Channel inbound adapter is as follows:
<int-file:inbound-channel-adapter id="files"
channel="dispatchFiles" directory="${source.dir}" scanner="directoryScanner">
<int:poller fixed-delay="${files.pickup.delay:3000}"
max-messages-per-poll="${num.files.pickup.per.poll:10}" />
</int-file:inbound-channel-adapter>
Logs
here is the log of message completing the flow the first time. The completion time invoked suggests reaching the last component a "completionHandler" SA.
Explanation of Log: "cor" is the bucket/corrId that is being released twice. The reason I get the final exception is because during the first time, the file is removed from that original location and processed. So the second time around when this erroneous release happens, there is nothing to process there.
From the pictures it can be seen that the first batch/corrId/bucket is processed and finished around 11:09, and the second one is started around 11:10
an important point I noticed that this behavior only happens when I have a global channel interceptor in which I am doing somewhat long processing. When this interceptor is commented out, the errors go away.
Question:
is it possible for aggregator to double release a batch/corrId under any circumstance? How can I make aggregator emit any logs?
Thanks
Edit 10:15pm
My channel following the aggregator has an interceptor as follows,
public Message<?> preSend(Message<?> message, MessageChannel channel) {
LOGGER.info("******** Releasing from aggregator(interceptor) , corrID:{} at time:{} ********",MessageUtils.getCorrelationId(message), new Date() );
finalReporter.callback(channel.toString(), message);
return message;
}
From Aggregator down to final compeltionHandler SA, I have single threaded processing
Aggregator -> releasedChannel -> some SA1 -> some channel -> ..... -> completionChannel->completeSA
When I run for 33 partitions, let's follow corrId = "alh" The first time it is released, it looks like following,
What it shows is that thread-5 released it and it should process all the downstream components. But it leaves it mid-way and starts doing other things and is picked up again by a diffferent thread a little later as follows,
That seems/seemed to be the problem,
Solution Update:
I did following 3 things to sort of work around, at the moment,
for some reason, my interceptors were doing return super.preSend(message, channel) instead of simply return message. I changed it to latter
I had a global channel interceptors, I removed global and kept individual ones
If the channel interceptors had any issues before returning, would that cause a new release?
Although I still see the above scenario depicted in pictures, I am not getting double processing attempts and as such it avoids the errors. I am still trying to make sense out of this.
I understand it's too specific and difficult to explain; still thanks for the time and comments...
However, yes. I think #GaryRussell is right: since you use expire-groups-upon-completion="true" some partial groups may be released by group-timeout-expression and the new messages with the same correlationId will form a new group, which is released by the next group-timeout. Your size() > 0 isn't good too. It means that it is going to release partial group after that group-timeout. Maybe size() > 1? The group can't be size() == 0 though. Because it is created on the first message, so, if gruop exists, it contains at least one message. Yes, group can be empty, but in that case the aggregator should be marked with expire-groups-upon-completion="false". In that case it is marked as completed and doesn't allow new messages.
After struggling with debugging and various blind scenarios, I believe that at least I have a workaround and a possible root cause. I will try to outline all the things that I modified,
Root Cause:
My interceptors were calling a Common class with a common callback method. This method, based on the channel name from which the request was coming from, would decide the appropriate action to take. The actions were essentially collecting data, incrementing counters and persisting to database some information.
It seems that some of them were having errors and consequently, the thread was dying and message re-released. I am not entirely sure about it and please correct me if that's not the case.
But after I fixed those errors, the re-release issue seems to have subsided or vanished altogether.
The reason it was hard to diagnose was because I could not see those errors thrown during callback method invocations; may be I was catching them or may be they were lost.
I also found that the issue was only on any channel interceptors AFTER the aggregator. Interceptors before the aggregator did not present any issues; may be because they were simpler...
To debug,
I removed the interceptors and made the callback directly from various components (SAs), removed global interceptors and tried to add individual interceptors for specific channels.
Thanks for all the help.

Resources