Batch processing flowfiles in apache nifi - apache-nifi

I have written custom nifi processor which tries to batch process input flow files.
However, it seems it is not behaving as expected. Here is what happening:
I copy paste some files on server. FethFromServerProcessor fetches those files from server and puts it in queue1. MyCustomProcessor reads files in batch from queue1. I have batchSize property defined on MyCustomProcessor and inside its onTrigger() method, I am getting all flow files from queue1 in current batch by doing following:
session.get(context.getProperty(batchSize).asInteger())
First line of onTrigger() creates timestamp and adds this timestamp on all flow files. So all files in the batch should have same timestamp. However, that is not happening. Usually first flow file get one timestamp and rest of the flow files get other timestamp.
It seems that when FetchFromServerProcessor fetches first file from server and puts it in the queue1, MyCustomProcessor gets triggered and it fetches all files from queue. Incidentally, it happens that there used to be single file, which is picked up as only file in this batch. By the time MyCustomProcessor has processed this file, FetchFromServerProcessor has fetched all the files from server and put them in the queue1. So after processing first file, MyCustomProcessor takes all the files in queue1 and forms second batch, whereas I want all files picked up in single batch.
How can I avoid two batches getting formed? I see that people discuss wait-notify in this context:1, 2. But I am not able to make quick sense out of these posts. Can someone give me minimal steps to achieve this using wait notify processors or can someone point me to minimal tutorial which gives step by step procedure to use wait-notify processors? Also is wait-notify pattern standard approach to solve batch related problem I explained? Or is there any other standard approach to get this done?

It sounds as if this batch size is the required count of incoming flowfiles to CustomProcessor, so why not write your CustomProcessor#onTrigger() as follows:
#Override
public void onTrigger(ProcessContext context, ProcessSession session) throws ProcessException {
final ComponentLog logger = getLogger();
// Try to get n flowfiles from incoming queue
final Integer desiredFlowfileCount = context.getProperty(batchSize).asInteger();
final int queuedFlowfileCount = session.getQueueSize().getObjectCount();
if (queuedFlowfileCount < desiredFlowfileCount) {
// There are not yet n flowfiles queued up, so don't try to run again immediately
if (logger.isDebugEnabled()) {
logger.debug("Only {} flowfiles queued; waiting for {}", new Object[]{queuedFlowfileCount, desiredFlowfileCount});
}
context.yield();
return;
}
// If we're here, we do have at least n queued flowfiles
List<FlowFile> flowfiles = session.get(desiredFlowfileCount);
try {
// TODO: Perform work on all flowfiles
flowfiles = flowfiles.stream().map(f -> session.putAttribute(f, "timestamp", "my static timestamp value")).collect(Collectors.toList());
session.transfer(flowfiles, REL_SUCCESS);
// If extending AbstractProcessor, this is handled for you and you don't have to explicitly commit
session.commit();
} catch (Exception e) {
logger.error("Helpful error message");
if (logger.isDebugEnabled()) {
logger.error("Further stacktrace: ", e);
}
// Penalize the flowfiles if appropriate (also done for you if extending AbstractProcessor and an exception is thrown from this method
session.rollback(true);
// --- OR ---
// Transfer to failure if they can't be retried
session.transfer(flowfiles, REL_FAILURE);
}
}
The Java 8 stream syntax can be replaced by this if it's unfamiliar:
for (int i = 0; i < flowfiles.size(); i++) {
// Write the same timestamp value onto all flowfiles
FlowFile f = flowfiles.get(i);
flowfiles.set(i, session.putAttribute(f, "timestamp", "my timestamp value"));
}
The semantics between penalization (telling the processor to delay performing work on a specific flowfile) and yielding (telling the processor to wait some period of time to try performing any work again) are important.
You probably also want the #TriggerSerially annotation on your custom processor to ensure you do not have multiple threads running such that a race condition could arise.

Related

Why does the read method execute multiple times in spring batch?

I am currently using Spring Batch. I created Reader, Writer and a Processor. The Reader is a basic Custom ListItemReader.
public class CustomListItemReader<T> implements ItemReader<T> {
private List<T> list;
public List<T> getList() {
return list;
}
public void setList(List<T> list) {
log.debug("Set list of size {}", list.size());
if (AopUtils.isAopProxy(list)) {
this.list = list;
} else {
this.list = new ArrayList<T>(list);
}
}
#Override
public synchronized T read() {
log.info("Inside custom list item reader");
if (list != null && !list.isEmpty()) {
log.info("Inside read not empty");
T remove = list.remove(0);
while (remove == null && !list.isEmpty()) {
remove = list.remove(0);
}
return remove;
}
return null;
}
}
I tried testing Spring batch with and without a taskExecutor. Without the taskExecutor the
Inside custom list item reader
log gets printed twice. I get that, it is printed once for the actual job and once to check whether any inputs exists or not. When the reader returns null, the job completes and stops. That's fine , but when I do the same with a taskExecutor with a configuration as shown below
public TaskExecutor taskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setMaxPoolSize(1);
taskExecutor.setCorePoolSize(1);
taskExecutor.setQueueCapacity(1);
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
and I even set the throttle-limit to 1. I assumed that the above taskExecutor mimics the single thread scenario. And since the there is only one active thread and throttle-limit = 1 , the log would get printed twice, same as in the previous configuration. But the message gets logged thrice.
Why is there an extra log printed? Hows does the task count get increased by 1?
Also, just for the sake of experimenting I kept the throttle-limit to 20 and the corePoolSize, maxPoolSize and queueCapacity to 1 . The job doesn't end at all.
and I get an exception:
java.util.concurrent.RejectedExecutionException: Task com.esewa.settlementswitch.transaction.cooperative.BatchConfig$ClientSettlementTaskDecorator$$Lambda$1111/698696362#4d3e6424 rejected from org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor$1#60e29dbd[Running, pool size = 1, active threads = 1, queued tasks = 1, completed tasks = 0]
I know that the job was Rejected because the pool size is 1 and the queue is also full and no new tasks can be submitted. But the question is why did so many tasks start ?
The difference between the cases without and with TaskExecutor is that different RepeatOperations are used to control the execution of chunks.
In the sequential case without a user-defined TaskExecutor, one chunk will be executed in exactly the way you described: The read method of the reader will be invoked once for the single item in its list. And a second time for the return value null, which signals that no more items are available. The RepeatOperations that is used in this case is the RepeatTemplate, which executes the chunks sequentially and will not execute further chunks once a chunk has read null from its reader.
In the multi-threaded case with a TaskExecutor, the TaskExecutorRepeatTemplate will be used to execute the chunks instead. It will submit chunk executions to the TaskExecutor until either the throttle limit is reached or a result has been placed in its result queue.
With a single thread and throttle limit 1, the following happens: The TaskExecutorRepeatTemplate submits one chunk execution to its task executor and the chunk will execute in the single thread of the executor as described for the sequential case. Meanwhile, the TaskExecutorRepeatTemplate will continue to submit tasks. It will block while submitting the second chunk execution because of the throttle limit and it will only unblock when the first chunk execution has finished. But between the unblocking and the actual submission of a new chunk execution no check is performed whether the additional chunk execution is actually still required. In the second execution, the read method is only called once as it returns null the first time now.
When you increase the throttle limit to 2 you should see 4 logs being printed, because the TaskExecutorRepeatTemplate will only be blocked when submitting the third chunk execution. The number of logs is actually not guaranteed, because it depends on the order of events that happen in different threads, but this effect should be well reproducible.

How to design a Springbatch job which fetches records from DB in batches and run multiple processor and writer in parallel

Scenario: Read records from DB and create 4 different output files from it.
Tech Stack:
Springboot 2.x
springBatch 4.2.x
ArangoDB 3.6.x
Current Approach: SpringBatch job which has the below steps in sequence:
jobBuilderFactory.get("alljobs")
.start(step("readAllData")) //reads all records from db, stores it in Obj1 (R1)
.next(step("processData1")) //(P1)
.next(step("writer1")) // writes it to file1(W1)
.next(step("reader2")) // reads the same obj1(R2)
.next(step("processor2")) // processes it (P2)
.next(step("writer2")) // writes it to file1(W2)
.next(step("reader3")) // reads the same obj1 (R3)
.next(step("processor3")) // processes it (P3)
.next(step("writer3")) // writes it to file1(W3)
.next(step("reader4")) // reads the same obj1(R4)
.next(step("processor4")) // processes it (P4)
.next(step("writer4")) // writes it to file1 (W4)
.build()
Problem: Since the volume of data coming from DB is HUGE, > 200,000 records, hence now we are fetching the records via cursor in a batch of 10,000 records.
Target state of the job: A reader job which fetches the records from DB via a cursor in batch of 1000 records:
For each batch of 1000 records I have to run processor and writer for the same.
Also, since for all the rest 3 processor and writers, the data set will be the same (Obj1 which will be fetched from the cursor), triggering them in parallel.
Reader1() {
while(cursor.hasNext()) {
Obj1 = cursor.next();
a) P1(Obj1); | c) R2(Obj1); | c) R3(Obj1); | c) R4(Obj1); ||
b) W1(Obj1); | d) P2(Obj1); | d) P3(Obj1); | d) P4(Obj1); || All these running in parallel.
| e) W2(Obj1); | e) W3(Obj1); | e) W4(Obj1); ||
}
}
Below are approaches that popped in my mind:
Invoke the Job inside the cursor itself and execute all steps P1....W4 inside the cursor iteration by iteration.
Invoke a job which has first step as Reader1, and then inside the cursor, invoke another subJob which has all these P1....W4 in parallel, since we can not go out of the cursor.
Kindly suggest the best way to implement.
Thanks in Advance.
Update:
I was trying to make the steps(P1....W4) inside My Reader1 step in a loop , but am stuck with the implementation as everything here is written as a Step and am not sure how to call multiple steps inside R1 step in a loop . I tried using a Decider , putting P1...W4 in a Flow(flow) :
flowbuilder.start(step("R1"))
.next(decider())
.on(COMPLETED).end()
.from(decider())
.on(CONTINUE)
.flow(flow)
job.start(flow)
.next(flow).on("CONTINUE").to(endJob()).on("FINISHED").end()
.end()
.build()
But I am not able to go back to the next cursor iterations , since the cursor iteration is there in the R1 step only.
I also tried to put all steps R1...W4(including Reader1) in the same flow, but the flow ended up throwing cyclic flow error .
Kindly suggest what should be a better way to implement this? How to make all the other steps called in parallel inside the cursor iterating in R1 step.
I believe using 4 parallel steps is a good option for you. Even if you would have 4 threads reading from the same data, you should benefit from parallel steps during the processing/writing phases. This should definitely perform better than 4 steps in sequence. BTW, 200k records is not that much (of course it depends on the record size and how it is mapped, but I think this should be ok, reading data is never the bottleneck).
It's always about trade-offs.. Here I'm trading a bit of read duplication for a better overall throughput thanks to parallel steps. I would not kill my self to make sure items are read only once and complicate things.
A good analogy of such a trade-off in the database world is accepting some data duplication in favor of faster queries (think of NoSQL design where it is sometime recommended to duplicate some data to avoid expensive joins).
This is how I finally designed the solution:
So, I re-framed the whole flow from a Tasklet based approach to an Orchestrated Chunk Based approach .
Job will have 1 step called - fetchProcessAndWriteData .
jobBuilderFactory.get("allChunkJob")
.start(step("fetchProcessAndWriteData"))
.next(step("updatePostJobRunDetails"))
.build()
fetchProcessAndWriteData : will have a reader , masterProcessor and masterWriter with a chunk size of 10,000 .
steps
.get("fetchProcessAndWriteData")
.chunk(BATCHSIZE)
.reader(chunkReader)
.processecor(masterProcessor)
.writer(masterWriter)
.listener(listener())
.build()
chunkReader- reader data in chunks from the database cursor and pass it on to the masterProcessor .
masterProcessor - accepts data one by one and pass the records to all the other processecors - P1, P2, P3, P4
and stores the processed data in a compositeResultBean .
CompositeResultBean consists of data holders for all 4 types of records .
List<Record> recordType1.
List<Record> recordType2.
List<Record> recordType3.
List<Record> recordType4.
This bean is then returned from the process method of the masterProcessor .
public Object process(Object item){
..
bean.setRecordType1(P1.process(item));
bean.setRecordType2(P2.process(item));
bean.setRecordType3(P3.process(item));
bean.setRecordType4(P4.process(item));
return bean;
}
masterWriter - this step accepts a List of records i.e. list of compositeResultBean here. Iterate on the list of bean and call the respective
writers W1, W2,W3,W4 writer() method with the data held in each of compositeResultBean attributes .
public void write(List list) {
list.forEach(record -> {
W1.write(isInitialBatch,list.getRecordType1());
W2.write(isInitialBatch,list.getRecordType2());
W3.write(isInitialBatch,list.getRecordType3());
W4.write(isInitialBatch,list.getRecordType4());
});
}
These whole steps are carried in a batch of 10k records and write the data into the file .
Another challenge that I faced during writing the File was that I would have to replace the already existing file for the very first time the record are written ,but have to append for the later ones in the same file .
I solved this problem by overring chunkListener in the masterWriter - where I pulled in the batch # and set a static flag isInitialBatch defaulting to TRUE.
This variable is set inside the
beforeChunk()
if chunkContext.getStepContext().getStepExecution().getCommitCount()==0 as TRUE , else FALSE .
The same boolean is passed int he FileWriter which opens the file in append - TRUE or FALSE mode .
W1.write(isInitialBatch,list.getRecordType1());

RunnableGraph to wait for multiple response from source

I am using Akka in Play Controller and performing ask() to a actor by name publish , and internal publish actor performs ask to multiple actors and passes reference of sender. The controller actor needs to wait for response from multiple actors and create a list of response.
Please find the code below. but this code is only waiting for 1 response and latter terminating. Please suggest
// Performs ask to publish actor
Source<Object,NotUsed> inAsk = Source.fromFuture(ask(publishActor,service.getOfferVerifyRequest(request).getPayloadData(),1000));
final Sink<String, CompletionStage<String>> sink = Sink.head();
final Flow<Object, String, NotUsed> f3 = Flow.of(Object.class).map(elem -> {
log.info("Data in Graph is " +elem.toString());
return elem.toString();
});
RunnableGraph<CompletionStage<String>> result = RunnableGraph.fromGraph(
GraphDSL.create(
sink , (builder , out) ->{
final Outlet<Object> source = builder.add(inAsk).out();
builder
.from(source)
.via(builder.add(f3))
.to(out); // to() expects a SinkShape
return ClosedShape.getInstance();
}
));
ActorMaterializer mat = ActorMaterializer.create(aSystem);
CompletionStage<String> fin = result.run(mat);
fin.toCompletableFuture().thenApply(a->{
log.info("Data is "+a);
return true;
});
log.info("COMPLETED CONTROLLER ");
If you have several responses ask won't cut it, that is only for a single request-response where the response ends up in a Future/CompletionStage.
There are a few different strategies to wait for all answers:
One is to create an intermediate actor whose only job is to collect all answers and then when all partial responses has arrived respond to the original requestor, that way you could use ask to get a single aggregate response back.
Another option would be to use Source.actorRef to get an ActorRef that you could use as sender together with tell (and skip using ask). Inside the stream you would then take elements until some criteria is met (time has passed or elements have been seen). You may have to add an operator to mimic the ask response timeout to make sure the stream fails if the actor never responds.
There are some other issues with the code shared, one is creating a materializer on each request, these have a lifecycle and will fill up your heap over time, you should rather get a materializer injected from play.
With the given logic there is no need whatsoever to use the GraphDSL, that is only needed for complex streams with multiple inputs and outputs or cycles. You should be able to compose operators using the Flow API alone (see for example https://doc.akka.io/docs/akka/current/stream/stream-flows-and-basics.html#defining-and-running-streams )

how can I prove field-grouping functionality(tuple with same field goes to same task)?

I'm a fresher of Storm, I'm getting started with Storm using the project storm-starter. In this project there is a Topology called WordCountTopology, the key code for building topology is:
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
and in the implementation of WordCount bolt, the key method execute is:
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
My Question is:
As the functionality of filed-grouping is that: tuples with the same filed word will go to the same task for post processing. Here "task" means thread, how can I prove this functionality? In addition, in my opinion, the logic in method execute is a little awkward. In a single task, the parameter tuple is always the same, but in the execute method it does not reflect this, in other words, the logic dose not use this convenience.
Am I clear? My point is that, the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping.
I would like to site few points, it might help clear your doubts
Here "task" means thread
In storm's terminology tasks are NOT threads but they are responsible for processing the actual logic. Each spout or bolt that you implement in your code executes as many tasks across the cluster. So you can define them as an running instance of the components i.e Spouts or Bolts.
There is another entity called Executors which are the thread responsible for running these tasks.It can run one or multiple tasks of the same component. An executor having multiple tasks actually is saying the same component is executed for multiple times by the executor.
Now coming back to your question
the code here in execute is not taking the feature of filed-grouping into account, the code here can also be applied to the situation of shuffle-grouping
In very brief A fields grouping lets you group a stream by a subset of its fields, meaning in order to do a word count, if we filtered the stream by using fieldsGrouping on a field name 'first_name` then it is expected that all the first_name field with a value say (Foo) should go to the same task, and the same field with a different value (Bar) goes to another task.
So here the execute method is supposed to receive the same field value and thus can easily update its counter and to do that it does not require to consider anything special. The whole logic is written keeping in mind that the bolt will be chained with the proper data and that's why using the proper grouping become such an important thing. So if you use shuffleGrouping then same code will run but produces incorrect data.
Well Pinky (or anyone else who finds this useful), to prove it, you just have to keep track of the bolt or spout task ID:
#Override
public void prepare(Map map, TopologyContext tc, OutputCollector oc) {
this.boltId = tc.getThisTaskId();
}
Now in the execute() of the same fieldsGrouped bolt that receives the tuples, you just print the id and the tuple:
#Override
public void execute(Tuple tuple) {
String myWord = (String) tuple.getValue(0);
System.out.println("word: "+myWord+" boltID:"+boltId);
}

Incremental write of Protocol Buffer object

I have Protocol Buffer for logging data.
message Message {
required double val1 = 1;
optional int val2 = 2;
}
message BigObject {
repeated Message message = 1;
}
I receive messages one per second. They stored in memory with my BigObject and they used for some tasks. But at the same time i want to store that messages in file for backup in case application crash. Simple writing BigObject every time will be waste of time. And I trying to find way to write only added messages since last write to file. Is there a way for that?
Protobuf is an appendable format, and your layout is ideal for this. Just open your file positioned at the end, and start with a new (empty) BigObject. Add/serialize just the new Message instance, and write to the file (from the end onwards).
Now, if you parse your file from the beginning you will get a single BigObject with all the Message instances (old and new).
You could actually do this by logging each individual Message as it arrives, as long as you wrap it in a BigObject each time, i.e. in pseudo-code
loop {
msg = await NextMessage();
wrapper = new BigObject();
wrapper.Messages.Add(msg);
file = OpenFileAtEnd();
wrapper.WriteTo(file);
file.Close();
}

Resources