How to make a Nifi processors event driven - apache-nifi

For example, if there is a pipeline made of 3 processors P1, P2, P3. When P2 produces an output flowfile, then after exactly 5 minutes I want processor P3 to work.
I cant use a fixed CRON job because the P2 processor can run at anytime.
Nifi version - 1.9.1

Look at RetryFlowFile with
Maximum Retries = 1 to put between P2 and P3.
It could penalize flow file on retries exceed. It should do it instantly with max retries =1.
Then set penalize duration to 5min.
All set. P3 should not take flow file from queue during 5 min.
option 2
you could use ExecuteGroovyScript in place of retryflowfile with following script to penalize everything that is going through it.
def ff = session.get()
if( !ff ) return
ff = session.penalize(ff)
REL_SUCCESS << ff
ps: don't forget to set penalty duration for this processor

Related

Nifi Group Content by Given Attributes

I am trying to run a script or a custom processor to group data by given attributes every hour. Queue size is up to 30-40k on a single run and it might go up to 200k depending on the case.
MergeContent does not fit since there is no limit on min-max counts.
RouteOnAttribute does not fit since there are too many combinations.
Solution 1: Consume all flow files and group by attributes and create the new flow file and push the new one. Not ideal but gave it a try.
While running this when I had 33k flow files on queue waiting.
session.getQueueSize().getObjectCount()
This number is returning 10k all the time even though I increased the queue threshold numbers on output flows.
Solution 2: Better approach is consume one flow file and and filter flow files matching the provided attributes
final List<FlowFile> flowFiles = session.get(file -> {
if (correlationId.equals(Arrays.stream(keys).map(file::getAttribute).collect(Collectors.joining(":"))))
return FlowFileFilter.FlowFileFilterResult.ACCEPT_AND_CONTINUE;
return FlowFileFilter.FlowFileFilterResult.REJECT_AND_CONTINUE;
});
Again with 33k waiting in the queue I was expecting around 200 new grouped flow files but 320 is created. It looks like a similar issue above and does not scan all waiting flow files on filter query.
Problems-Question:
Is there a parameter to change so this getObjectCount can take up to 300k?
Is there a way to filter all waiting flow files again by changing a parameter or by changing the processor?
I tried making default queue threshold 300k on nifi.properties but it didn't help
in nifi.properties there is a parameter that affects batching behavior
nifi.queue.swap.threshold=20000
here is my test flow:
1. GenerateFlowFile with "batch size = 50K"
2. ExecuteGroovyScript with script below
3. LogAttrribute (disabled) - just to have queue after groovy
groovy script:
def ffList = session.get(100000) // get batch with maximum 100K files from incoming queue
if(!ffList)return
def ff = session.create() // create new empty file
ff.batch_size = ffList.size() // set attribute to real batch size
session.remove(ffList) // drop all incoming batch files
REL_SUCCESS << ff // transfer new file to success
with parameters above there are 4 files generated in output:
1. batch_size = 20000
2. batch_size = 10000
3. batch_size = 10000
4. batch_size = 10000
according to documentation:
There is also the notion of "swapping" FlowFiles. This occurs when the number of FlowFiles in a connection queue exceeds the value set in the nifi.queue.swap.threshold property. The FlowFiles with the lowest priority in the connection queue are serialized and written to disk in a "swap file" in batches of 10,000.
This explains that from 50K incoming files - 20K it keeps inmemory and others in swap batched by 10K.
i don't know how increasing of nifi.queue.swap.threshold property will affect your system performance and memory consumption, but i set it to 100K on my local nifi 1.16.3 and it looks good with multiple small files, and first batch increased to 100K by this.

Lambda SQS integration: Batch Size vs MaxBatchingWindow

I'm integrating a lambda function with a standard queue in SQS.
I came across these two parameters batchSize and maxBatchingWindow. My original thinking was either the number of messages in the queue has reached the batchSize or the time since the first message came in has last for maxBatchingWindow seconds will trigger the lambda. In other words, whichever condition is satisfied first will invoke the lambda. And I couldn't find enough clarification about these two parameters in this documentation.
As a result, I did some experiment, setting batchSize = 3 and maxBatchingWindow = 300 seconds while setting the reservedConcurrency = 1 for lambda. Then I manually create 3 messages in the queue quickly (<< 5 min). However, I didn't observe the lambda being invoked after 5 minutes (300 s). Particularly, the metric Number Of Messages Sent of sqs shows a new data point at xx:54:15 while the logGroup for lambda updates around xx:59:53 (The lambda does nothing intensive but just to print out the value of event so I'm sure that would be the right execution).
Does that mean, once maxBatchingWindow is set greater than 0, it will become the only requirement to invoke lambda even if the batchSize has met?

Generate exactly 1 Flowfile

I'm using the GenerateFlowFile processor in Apache Nifi - When I activate it, I want the processor to create exactly 1 Flowfile.
Right now I use the REST API via Python to change the state to RUNNING, wait 0.5 seconds and change the state to STOPPED. This results in 1 FlowFile being added to the queue to the next processor.
I tested a bit and waiting for 1.5 seconds gives me 2 FlowFiles, 2.5 seconds gives me 3 FlowFiles - I'm guessing the processor generates one Flowfile each second it is running.
How can I ensure that exactly 1 Flowfile is being generated? The above method obviously is dependent on the network connection and roundtrip times. Worst case: the connection drops while I wait and I cannot stop the processor anymore and x Flowfiles are being generated.
My current configs are:
Settings:
Yield duration: 1 sec
Penalty Duration: 30sec
Bulletin Level: WARN
Scheduling:
Scheduling Strategy: CRON driven
Concurrent Tasks: 1
Run Schedule: * * * * * ?
Execution: All nodes
Run duration: 0ms
Properties:
File Size: 0B
Batch Size: 1
Data Format: Text
Unique FlowFiles: false
Custom Text: No value set
Character Set: UTF-8
Mime Type: No value set
You'll want to flag the GenerateFlowFile as Primary node only (assuming you have more than 1 node) to ensure each node is not generating its own FlowFile.
Set the Scheduling to Timer and whack the run schedule up to something like 604800 (1 week) - this means that it even if you leave the processor running, it's only going to run once a week - that should give you plenty time to fix a connectivity issue if your script can't connect to tell the processor to stop.
Keep concurrency at 1.

Spark job just hangs with large data

I am trying to query from s3 (15 days of data). I tried querying them separately (each day) it works fine. It works fine for 14 days as well. But when I query 15 days the job keeps running forever (hangs) and the task # is not updating.
My settings :
I am using 51 node cluster r3.4x large with dynamic allocation and maximum resource turned on.
All I am doing is =
val startTime="2017-11-21T08:00:00Z"
val endTime="2017-12-05T08:00:00Z"
val start = DateUtils.getLocalTimeStamp( startTime )
val end = DateUtils.getLocalTimeStamp( endTime )
val days: Int = Days.daysBetween( start, end ).getDays
val files: Seq[String] = (0 to days)
.map( start.plusDays )
.map( d => s"$input_path${DateTimeFormat.forPattern( "yyyy/MM/dd" ).print( d )}/*/*" )
sqlSession.sparkContext.textFile( files.mkString( "," ) ).count
When I run the same with 14 days, I got 197337380 (count) and I ran the 15th day separately and got 27676788. But when I query 15 days total the job hangs
Update :
The job works fine with :
var df = sqlSession.createDataFrame(sc.emptyRDD[Row], schema)
for(n <- files ){
val tempDF = sqlSession.read.schema( schema ).json(n)
df = df(tempDF)
}
df.count
But can some one explain why it works now but not before ?
UPDATE : After setting mapreduce.input.fileinputformat.split.minsize to 256 GB it works fine now.
Dynamic allocation and maximize resource allocation are both different settings, one would be disabled when other is active. With Maximize resource allocation in EMR, 1 executor per node is launched, and it allocates all the cores and memory to that executor.
I would recommend taking a different route. You seem to have a pretty big cluster with 51 nodes, not sure if it is even required. However, follow this rule of thumb to begin with, and you will get a hang of how to tune these configurations.
Cluster memory - minimum of 2X the data you are dealing with.
Now assuming 51 nodes is what you require, try below:
r3.4x has 16 CPUs - so you can put all of them to use by leaving one for the OS and other processes.
Set your number of executors to 150 - this will allocate 3 executors per node.
Set number of cores per executor to 5 (3 executors per node)
Set your executor memory to roughly total host memory/3 = 35G
You got to control the parallelism (default partitions), set this to number of total cores you have ~ 800
Adjust shuffle partitions - make this twice of number of cores - 1600
Above configurations have been working like a charm for me. You can monitor the resource utilization on Spark UI.
Also, in your yarn config /etc/hadoop/conf/capacity-scheduler.xml file, set yarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator - which will allow Spark to really go full throttle with those CPUs. Restart yarn service after change.
You should be increasing the executor memory and # executors, If the data is huge try increasing the Driver memory.
My suggestion is to not use the dynamic resource allocation and let it run and see if it still hangs or not (Please note that spark job can consume entire cluster resources and make other applications starve for resources try this approach when no jobs are running). if it doesn't hang that means you should play with the resource allocation, then start hardcoding the resources and keep increasing resources so that you can find the best resource allocation you can possibly use.
Below links can help you understand the resource allocation and optimization of resources.
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html

Spark Streaming:how to sum up all result for several DStreams?

I am now using Spark Streaming + Kafka to construct my message processing system.But I have a little technical problem , I will describe it below:
For example , I want to do a wordcount for each 10 minutes,So, in my earliest code,I set Batch Interval to 10 minutes.Code is like below:
val sparkConf = new SparkConf().setAppName(args(0)).setMaster(args(1))
val ssc = new StreamingContext(sparkConf, Minutes(10))
But I don't think it is a very good solution because 10 minutes is what a long time and large amount of data that my memory cannot sustain so much data.So , I want to reduce batch interval to 1 minutes, like:
val sparkConf = new SparkConf().setAppName(args(0)).setMaster(args(1))
val ssc = new StreamingContext(sparkConf, Minutes(1))
Then the problem comes:How can I sum up the result of 10 minutes for ten '1 minutes'? I think this word can only be done in driver instead of worker program,what can I do?
I am new learner of Spark Streaming.Any one can give me a hand?
Maybe I have my idea. In this condition ,I should use stateful function like UpdateStateByKey() because , since what I want is a global 10 minutes' result but what I can get is just each intermediate result of each 1 minute , so before each 10 minutes end , I have to record the state of each 1 minute , such as the word count result of each 1 minute and add them up for each 1 minute.
Posting here as I had a similar issue and came across the Window Operations section of Spark Streaming. In the poster's original case, they want a count for the past 10 minutes, done every 10 minutes although their program calculates counts each 1 minute. Assuming we have counts defined and calculated as the standard word count (i.e. at a 1-minute batch duration, with tuples (word, count)), we could follow the linked guide and define something along the lines of
// Reduce/count last 10 seconds worth of data, every 10 seconds
val windowedWordCounts = counts.reduceByKeyAndWindow(_+_, Seconds(10), Seconds(10))
where _+_ is a sum function.

Resources