i have nifi flow(standalone) like
executestreamprocessor(hive script) -> executestreamprocessor(hadoop script).
for each and every incoming flowfile, hive script runs with the command INSERT..INTO..SELECT..FROM and hadoop script deletes the particular files from the storage area.
sometimes, hive script is getting failed when simultaneously hadoop script runs the command.
i will be getting maximum of 4 number of files for each an hour. so i planned to use controlrate processor in between the hive and hadoop processors. i set the condition when count of the queue reaches to 4 flowfile, then hadoop script should be executed. But, controlrate has the property to set only for maximum rate. it does not have minimum rate.
is there any possible solution to achieve? or any other solution?
You should be able to use ExecuteScript for this, try this Groovy script:
def flowFiles = session.get(4)
if(!flowFiles || flowFiles.size() < 4) {
session.rollback()
} else {
session.transfer(flowFiles, REL_SUCCESS)
}
If you only want to trigger the downstream flow once, then you can create a child flow file from the parents (and report a JOIN provenance event):
def flowFiles = session.get(4)
if(!flowFiles || flowFiles.size() < 4) {
session.rollback()
} else {
def flowFile = session.create(flowFiles)
session.provenanceReporter.join(flowFiles, flowFile)
session.remove(flowFiles)
session.transfer(flowFile, REL_SUCCESS)
}
Having said that, if you don't care about the flow file contents (i.e. you are using a flow file as a trigger), you could use MergeContent with a Minimum and Maximum Number Of Entries = 4.
Related
I am trying to run a script or a custom processor to group data by given attributes every hour. Queue size is up to 30-40k on a single run and it might go up to 200k depending on the case.
MergeContent does not fit since there is no limit on min-max counts.
RouteOnAttribute does not fit since there are too many combinations.
Solution 1: Consume all flow files and group by attributes and create the new flow file and push the new one. Not ideal but gave it a try.
While running this when I had 33k flow files on queue waiting.
session.getQueueSize().getObjectCount()
This number is returning 10k all the time even though I increased the queue threshold numbers on output flows.
Solution 2: Better approach is consume one flow file and and filter flow files matching the provided attributes
final List<FlowFile> flowFiles = session.get(file -> {
if (correlationId.equals(Arrays.stream(keys).map(file::getAttribute).collect(Collectors.joining(":"))))
return FlowFileFilter.FlowFileFilterResult.ACCEPT_AND_CONTINUE;
return FlowFileFilter.FlowFileFilterResult.REJECT_AND_CONTINUE;
});
Again with 33k waiting in the queue I was expecting around 200 new grouped flow files but 320 is created. It looks like a similar issue above and does not scan all waiting flow files on filter query.
Problems-Question:
Is there a parameter to change so this getObjectCount can take up to 300k?
Is there a way to filter all waiting flow files again by changing a parameter or by changing the processor?
I tried making default queue threshold 300k on nifi.properties but it didn't help
in nifi.properties there is a parameter that affects batching behavior
nifi.queue.swap.threshold=20000
here is my test flow:
1. GenerateFlowFile with "batch size = 50K"
2. ExecuteGroovyScript with script below
3. LogAttrribute (disabled) - just to have queue after groovy
groovy script:
def ffList = session.get(100000) // get batch with maximum 100K files from incoming queue
if(!ffList)return
def ff = session.create() // create new empty file
ff.batch_size = ffList.size() // set attribute to real batch size
session.remove(ffList) // drop all incoming batch files
REL_SUCCESS << ff // transfer new file to success
with parameters above there are 4 files generated in output:
1. batch_size = 20000
2. batch_size = 10000
3. batch_size = 10000
4. batch_size = 10000
according to documentation:
There is also the notion of "swapping" FlowFiles. This occurs when the number of FlowFiles in a connection queue exceeds the value set in the nifi.queue.swap.threshold property. The FlowFiles with the lowest priority in the connection queue are serialized and written to disk in a "swap file" in batches of 10,000.
This explains that from 50K incoming files - 20K it keeps inmemory and others in swap batched by 10K.
i don't know how increasing of nifi.queue.swap.threshold property will affect your system performance and memory consumption, but i set it to 100K on my local nifi 1.16.3 and it looks good with multiple small files, and first batch increased to 100K by this.
I am using NiFi 1.9.2
I am reading a text file which happens to be a csv file. I have the Contents of the file in the Contents of a flowFile.
Contents are
a,b,c
d,e,f
g,h,i
I want to prepend a line number to all records in the flowfile and get
1,a,b,c
2,d,e,f
3,g,h,i
each time I feed a file through this processor
I can achieve something close by using the ReplaceText processor with Properties as follows:
Search Value : (?m)(^.*$)
Replacement Value : ${nextInt()},$1
But because nextInt() persists it's value over the lifetime of the running NiFi instance I get
0,a,b,c
1,d,e,f
2,g,h,i
for 1st execution
3,a,b,c
4,d,e,f
5,g,h,i
for the next execution etc
Additionally, from the NiFi expression-language-guide, the "counter is shared across all NiFi components, so calling this function multiple times from one Processor will not guarantee sequential values within the context of a particular Processor."
Is there a way to ensure the line numbers always start at 0 for each execution of this processor for the lifetime of the NiFi instance, and are always sequential?
What the range of the counter?
Can I get the counter to start at 1?
You can split the content to several lines then use fragment.index to prepent the counter to the lines. After that you can merge them again.
The Flow:
GenerateFlowFile:
SplitText:
ReplaceText:
MergeContent:
Don't forget to add a new line (Shift+Enter) to Demarcator attribute.
Result:
You can use ${Fragment.index:minus(1)} if you want to count from zero.
I need to remove duplicates from a flow I've developed, it can receive the same ${filename} multiple times. I tried using HBase_1_1_2_ClientMapCacheService with DetectDuplicate (I am using NiFi v1.4), but found that it lets a few duplicates through. If I use DistributedMapCache (ClientService and Server), I do not get any duplicates. Why would I receive some duplicates with the HBase Cache?
As a test, I listed a directory (ListSFTP) with 20,000 files on all cluster nodes (4 nodes) and passed to DetectDuplicate (using the HBase Cache service). It routed 20,020 to "non-duplicate", and interestingly the table actually has 20,000 rows.
Unfortunately I think this is due to a limitation in the operations that are offered by HBase.
The DetectDuplicate processor relies on an operation "getAndPutIfAbsent" which is expected to return the original value, and then set the new value if it wasn't there. For example, first time through it would return null and set the new value, indicating it wasn't a duplicate.
HBase doesn't natively support this operation, so the implementation of this method in the HBase map cache client does this:
V got = get(key, keySerializer, valueDeserializer);
boolean wasAbsent = putIfAbsent(key, value, keySerializer, valueSerializer);
if (! wasAbsent) return got;
else return null;
So because it is two separate calls there is a possible race condition...
Imagine node 1 calls the first line and gets null, but then node 2 performs the get and the putIfAbsent, now when node 1 calls putIfAbsent it gets false because node 2 just populated the cache, so now node 1 returns the null value from the original get... both of these look like non-duplicates to DetectDuplicate.
In the DistributedMapCacheServer, it locks the entire cache per operation so it can provide an atomic getAndPutIfAbsent.
I need to develop a streaming application which read some session logs from several sources.
The batch interval could be in a scale around 5 minutes..
The problem is that the files I get in each batch vary enormously. In one in each batch I may get some file with 10 megabyte and then in another batch getting some files around 20GB.
I want to know if there is any approach to handle this..Is there any limitation for the size of RDDs a file stream can generate for each batch?
Can I limit the spark streaming to read just a fixed amount of data in each batch into the RDD?
As of I know there is no direct way to limit that. File to considered is controlled in isNewFile private function in FileStream. Based on the code I can think of one work around.
Use filter function to limit the number of files to be read. Any files more then 10 return false and use touch command to update the timestamp of the file to be considered for next window.
globalcounter=10
val filterF = new Function[Path, Boolean] {
def apply(file: Path): Boolean = {
globalcounter --
if(globalcounter > 0) {
return true // consider only 10 files.
}
// touch the file so that timestamp of the file is updated.
return false
}
}
I was creating processing pipeline using Spark SQL 1.6.0 . This pipeline consist of steps/transformations and the output of one step is forward to next one. After last step the resulted DataFrame is save at HDFS. I also need to save the result at some intermediate steps. The code which is doing this as:
saveDataFrame(flushPath, flushFormat, isCoalesce, flushMode, previousDataFrame, sqlContext)
previousDataFrame
here, previousDataFrame is the result of the last step and saveDataFrame is just saving the DataFrame as given location, then the previousDataFrame will be used by next steps/transformation. And Finally after last step it will be saved at HDFS. The code for saveDataFrame is:
implicit def saveDataFrame(path: String, format: String, isCoalesce: Boolean, saveMode: SaveMode, dataFrame: DataFrame, sqlContext: SQLContext): Unit = {
val source = if (isCoalesce) dataFrame.coalesce(1) else dataFrame
if (format.equalsIgnoreCase("csv")) {
source
.write
.mode(saveMode)
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.save(path)
}
else if (format.equalsIgnoreCase("parquet") || format.equalsIgnoreCase("json")) {
source
.write
.mode(SaveMode.Overwrite)
.format(format)
.save(path)
}
else {
throw new Exception("%s input format is not supported".format(format))
}}
This works well, only the spark application is taking longer time than usual. If with saving intermediate output application runs in 20 minutes, then with this code it took 1 hour. Although the jobs and tasks complete in 20 minutes as per the Spark UI, but the spark-submit process continue to run till 1 hour.
Please help in figuring out the result. I have also tried following 2 possible solutions:
Using Future to create multi-threading to call saveDataFrame.
Caching the previousDataFrame before saving and reusing it into next step.
The issue was the AWS S3 path, which was causing the delay in execution. As I started saving output to HDFS, the execution time got reduced.