I've been trying to google and search stack for the answer but have beeen unable to find.
Using NiFi, is it possible to stop a process upon previous job failure?
We have user data we need to process but the data is sequentially constructed so that if a job fails, we need to stop further jobs from running.
I understand we can create scripts to fail a process upon previous process failure, but what if I need entire group to halt upon failure, is this possible? We don't want each job in queue to follow failure path, we want it to halt until we can look at the data and analyze the failure.
TL;DR - can we STOP a process upon a failure, not just funnel all remaining jobs into the failure flow. We want data in queues to wait until we fix, thus stop process, not just fail again and again.
Thanks for any feedback, cheers!
Edit: typos
You can configure backpressure on the queues to stop upstream processes. If you set the backpressure threshold to 1 on a failure queue, it would effectively stop the processor until you had a chance to address the failure.
The screenshot shows failure routing back to the processor, but this is not required. It is important that the next processor should not remove it from the queue to maintain the backpressure until you take action.
Related
I am using inter thread communicator plugin to share data between two thread groups.
TG-1: generates an ID -> stores it in the queue name Q1 TG-2: picks an ID from queue -> does the processing
After some time when run duration of TG-1 is completed, it stops processing or storing ID in to Q1. TG-2 processed all the data in the queue and keep on waiting for new data in the Q1. However Q1 will not have any data. My expectation was when the run duration of TG-2 also completed. TG-2 should finish its job and exit. Why does TG-2 keep on waiting for data in Q1. This is causing the exhaustion of the heap space and test never stops. This is causing a serious issue.
To prevent this, I tried adding kg.apc.jmeter.functions.FifoTimeout=120 in user.properties file as suggested by Dmitri T in one of my previous question for the same thing. However this property is not taking effect. Has anybody else also experience the same thing with this plugin? What is the alternative?
We are not telepathic enough to guess what is your setup, what exact components of the Inter-Thread Communication Plugin you're using and how they're configured.
If you're using Functions - the timeout is working fine for the __fifoPop() function, just make sure to restart JMeter after amending the property. __fifoGet() one will just return an empty value if the queue is empty
If you're using jp#gc - Inter-Thread Communication PreProcessor - there is a possibility to specify the timeout directly in GUI
Also it is always possible to stop the test via Flow Control Action Sampler
I've been exploring how Spring Batch works in certain failure cases when remote partitioning is used.
Let's say I have 3 worker nodes and 1 manager node. The manager node creates 30 partitions that the workers can pick up. The messaging layer is Kafka.
The workers are up, waiting for work to arrive on the specific topic. The manager node creates the partitions, puts them into the DB and sends the messages on the Kafka topic which has 3 partitions.
All nodes have started the processing but suddenly one node has crashed. The node that has crashed will have the step execution states set to STARTED/STARTING for the partitions it initially has picked up.
Another node will come to the rescue since the Kafka partitions will get revoked and reassigned, so one of the nodes between the 2 will read the partition the crashed node did.
In this case, nothing will happen of course because the original Kafka offset was committed by the crashed node even though the processing hasn't finished. Let's say when partitions get reassigned, I set the consumer back to the topic's beginning - for the partitions it manages.
Awesome, this way the consumer will start consuming messages from the partition of the crashed node.
And here's the catch. Even though some of the step executions that the crashed node processed with COMPLETED state, the new node that took over will reprocess that particular step execution once more even though it was finished before by the crashed node.
This seems strange to me.
Maybe I'm trying to solve this the wrong way, not sure but I appreciate any suggestions how to make the workers fault-tolerant for crashes.
Thanks!
If a StepExecution is marked as COMPLETED in the job repository, it will not be reprocessed. No data will be run again. A new StepExecution may be created (I don't have the code in front of me right now) but when Spring Batch evaluates what to do based on the previous run, it won't process it again. That's a key feature of how Spring Batch's partitioning works. You can send the workers 100 messages to process each partition, but it will only actually get processed once due to the synchronization in the job repository. If you are seeing other behavior, we would need more information (details from your job repository and configuration specifics).
I have multiple batches that are using different 3rd party apis to get and store/update data. the connections are made via laravels http request. all batches have about 6k jobs. Because all jobs are important I need to log the failed ones and nofiy the user.
Sometimes the response returns an error for all jobs. sometimes just a connection error or an error because the server cant process those requests.
The batch automatically cancels on first failure. But is there a way to cancel the batch if there are multiple failues (on nth failure) not just first?
First turn off normal batch error handling, then implement your own:
Initialize a counter with zero.
Whenever an error occurs, increase that counter.
Whenever that counter reaches/exceeds 5, fail the batch.
The concise implementation depends on the batch system you are working with.
I have a following scenario
Change a Flag = start (in database)
Do some processing
Update the Flag back to Finished (in database)
Suppose the system crashes during the step 2. Ideally I would want to set the Flag back to Finished. But because of the system crash it doesn't and it falls into deadlock for that task.
What are the standard solutions/approaches/algorithms followed to address such scenario?
Edit: How the deadlock occurs?
The task will be picked only if the Flag = Finished. Flag = start means it is in progress in the middle of something. So when there is a crash, the task is not complete but the Flag is also not set to Finish next the the system runs. So the task is not going to be picked again.
I don't see any simple solution here.
If your tasks execution time is predictable enough you can store a timestamp of task execution start in your DB and return task to "empty" state (not started yet) on timeout.
Or you can store process ID in your DB and implement a supervisor process that will launch your "executor" processes and check their exit code. If process crashed supervisor would "reinitialise" all tasks marked with crashed process ID.
I'd like to find the best way to handle exceptions (failure of any steps) from an Oracle scheduler job chain (11gR2).
Say I have a chain that contains 20 steps. If at any point the chain exits with FAILURE, I'd like to do a set of actions. These actions are specific to that chain, not the individual steps (each step's procedure may be used outside of scheduler or in other chains).
Thanks to 11gR2, I can now setup an email notification on FAILURE of chain, but this is only 1 of several actions I need to do, so its only a partial solution for me.
The only thing I can think of is have another polling job check the status of my chain every x minutes and launch the failure actions when it sees the latest job of the chain exited with FAILURE status. But this is a hack at best imo.
What is the best way to handle exceptions for a given job chain?
thanks
The most flexible way to handle jobs exceptions in general is to use a job exception monitoring procedure and define the jobs to generate events upon job status changes. The job exception monitoring procedure should watch the scheduler event queue in a loop and react upon events in a way you define.
Doing so takes away the burden to have to create failure steps for about each and every job step in a chain. This is a very powerful mechanism.
by lack of time: in the book is a complete scenario of event based scheduling. Will dig one up later.