I have a KStream created from a an input topic.
1. On this KStream, I am doing a groupByKey and then a windowed aggregation operation
2. After that, in my code, I again use the same KStream as above, do a map operation to make some changes to the key & value, and then do another windowed aggregation operation.
All my windowed operation are on a tumbling window of 30 secs. My observation is that the second aggregation is starting about 30 secs after the first aggregation. Is there a way to parallelize them ?
Not sure what you mean by "is starting about 30 secs after the first aggregation"? Do you mean:
A delay according to processing-time/wall-clock time (the time you happen to process an input event, regardless of when this event actually occurred in the real world), or
A delay according to event-time (the time when an input event actually occurred in the real world)?
It's expected that the second aggregation is delayed (according to wall-clock time) because the data must be repartitioned after the map() to compute the correct window aggregation, which takes a bit of time.
The structure of your program will be something like this:
KStream -+-> group() -> agg()
|
+-> map() -> to() -> REPARTITION-TOPIC -> KStream -> group() -> agg()
There is nothing you can do about this, but it should also not be a problem, as it will not affect the correctness of your result.
Related
Consider this flow:
It's a simple flow to authenticate to an HTTP API and handle success/failure. In the failure state, you can see I added a ControlRate processor and that there are 2 FlowFiles in the queue for it. I have it set to only pass one FlowFile every 30 seconds (Time Duration = 30sec Maximum Rate = 1). So the queue will continue to fill during this, if the authentication process continues to fail.
What I want is to essentially drop all but the first FlowFile in this queue, because I don't want it to continue re-triggering the authentication processor after we get a successful authentication.
I believe I can accomplish this by setting the FlowFile Expiration (on the highlighted queue) to be just longer than the 30 second Time Duration of the ControlRate processor. But this seems a bit arbitrary and not quite correct in my mind.
Is there a way to say "take first, drop rest" for the highlighted queue?
In NiFi, there exist a data flow to consume from MQTT (ConsumeMQTT) and publish into HDFS path (PutHDFS). I got a requirement to introduce 60 min delay before pushing the consumed data into HDFS path. Found ControlRate and MergeContent processor to be possible solution but not sure.
What is the ideal solution to introduce time delay?
Example: A flow file consumed at 9:00 AM should be published into HDFS at 10:00 AM
You can use an ExecuteScript processor to run a sleep(60*60*1000) loop, but this would unnecessarily use system resources.
I would instead introduce a RouteOnAttribute processor which has an output relationship of one_hour_elapsed going to PutHDFS, and unmatched looped back to itself. The RouteOnAttribute processor should have Routing Strategy set to Route to Property Name and a dynamic property (click the + button on the top right of the Properties tab) named one_hour_elapsed. The Expression Language value should be ${now():toNumber():gt(${entryDate:toNumber():plus(3600000)})}.
This expression:
Gets the current time and converts it to milliseconds since the epoch (now():toNumber())
Gets the entryDate attribute of the flowfile (when it entered NiFi) and converts it to milliseconds and adds one hour (entryDate:toNumber():plus(3600000) [3600000 == 60*60*1000])
Compares the two numbers (a:gt(${b}))
If this is not actually the start of your flow, you can use an UpdateAttribute processor to insert an arbitrary timestamp at any point of your flow and calculate from there.
I would also recommend setting the Yield Duration and Run Schedule of the RouteOnAttribute processor to be substantially higher than usual, as you do not want this processor to run constantly as it will do no work. I'd suggest setting this to 1 or 5 minutes to start, as you are introducing a one hour delay already.
Starting from nifi 1.10 this can be done even easier with the RetryFlowfile processor.
Use penalty duration for setting the delay time:
I have a following flow,
ListFile ---> FetchFile ---> ? ExecuteScript (maybe) ---> Notify
Basically, I want to go to Notify, if
Total flowfiles (from fetch files) is say 200; OR
Time elapsed (from last signal) is say 3 hours.
I think the 1st condition is easy to achieve. I can have a groovy script which can read number of flowfiles, if 200 go to SUCCESS or else ROLLBACK the session.
But I want to know how to also check the time elapsed for n (number can be less than 200) flowfiles in queue is more than 3 hours or so?
Update
Here is the problem: We have a batch processing (~200 files and can increase based on business in future) currently. We have a NiFi pipeline, i.e. List, Fetch, Basic validation on checksum, etc and process (call the SQL) which is working fine.
As per the business, throughout the day we can have the correction to data so that we can get all or some of the files to "re-process". That is also fine and working.
Now, as per new requirements, we need to build the process after this "batch" is completed. So in the best case, I can have the MergeContent processor with max bin of n and give the signal or notify to my new processor.
However, as explained above, throughout that day we can get few or all files processed again. So now my "n" may not match the new "number" of files re-processed. Hence, even in this case if we have elapsed say 3 hours, then irrespective of "n" not equal to new number of files reprocessed, I should notify the new process to run again.
Hence, I am looking for n files OR m hours elapsed check.
I think this may be an example of an XY problem -- you're trying to solve a problem and believe that counting the number of files fetched or time elapsed will help, but this pattern is usually discouraged in Apache NiFi and there are other solutions to the original problem. I would encourage you to describe more fully the higher level problem you are trying to solve to see if there is a better solution.
I will answer the question though (none of these are ideal solutions).
You can use a MergeContent processor with a minimum bin count of 200
You can use an ExecuteScript processor as you noted
You can write a value (the current timestamp) to a DistributedCacheMapServer when the Notify processor executes, and check that value with a FetchDistributedCacheMap processor against the current timestamp and use a simple Expression Language statement to compare the timestamp values
I think you may also want to read some examples of Wait/Notify logic, because creating thresholds like "200 incoming flowfiles || 3 hours elapsed time" is what the Wait processor does.
"How to wait for all fragments to be processed, then do something?" by Koji Kawamura
"NiFi workflow monitoring – Wait/Notify pattern with split and merge" by Pierre Villard
"Simple NiFi Wait/Notify Example" answer by Abdelkrim Hadjidj
I have noticed execution time for simple Condition and ForEach actions in my Logic Apps has an peculiarly long run-time, even considering the scope of these actions taking into account all downstream actions.
In a simple case, my deepest scope action is a For Each:
Notice the For Each is a short as possible in number of iterations: only one record being processed. Each action inside is very brief. The two HTTP actions are 52 ms total:
The Set Variable action is 29 ms:
How am I to understand this sums up to 1.63 seconds in the For Each?
This seems very odd to me, even with the "scope" context of considering ForEach reporting on all downstream actions. It still feels like well over 1.6 seconds that should not be needed.
There will always be some overhead for scopes (delta for the foreach job to wake up + read and aggregate action results + persist state).
There are changes you can make to optimize this, for example, if an array data is coming in from the trigger, "split-on" the array and have separate run process each item will have a better performance than "not-split on" and loop over the array using for-each.
I needed to implement a stream solution using AWS Kinesis streams & Lambda.
Lambda function 1 -
It adds data to stream and is invoked every 10 seconds. I added 100 data request ( each one of 1kb) to stream. I am running two instances of the script which invokes the lambda function.
Lambda function 2 -
This lambda uses above stream as trigger. On small volume of data / interval second lambda get data on same time. But on above metrics, data reaches slower than usual ( 10 minutes slower after +1 hour streaming ).
I checked the logic of both lambda functions and verified that, first lambda does not add latency before pushing data to stream. I also verified this by stream packet in second lambda where approximateArrivalTimestamp & current time clearly have the time difference increasing..
Kinesis itself did not have any issues / throttling shown in analytics ( I am using 1 shard ).
Are their any architectural changes I need to make to have it go smoother as I need to scale up at least 10 times like 20 invocations of first lambda with 200 packets, timeout 1 - 10 seconds as later benchmarks.
I am using 100 as the batch size. Can increasing/decreasing it have advantage?
UPDATE : As I explored more online, I found ideas to implement some async / front facing lambda with kinesis which in-turn invoke actual lambda asynchronously, So lambda processing time will not become bottleneck. However, this approach also failed as I have the same latency issue. I have checked the execution time. Front facing lambda ended in 1 second. But still I get big gap between approximateArrivalTimestamp & current time in both lambdas.
Please help!
For one shard, there will one be one instance of 2nd lambda.
So it works like this for 2nd lambda. The lambda reads configured record size from stream and processes it. It won't read other records until the previous records have been successfully processed.
Adding a second shard, you would have 2 lambdas processing the records. Thus the way I see to scale the architecture is by increasing the number of shards, however make sure data is evenly distributed across shards.