NiFi: MergeRecord is generating duplicates - apache-nifi

I have some troubles with the MergeRecord processor in Nifi. You can see the whole Nifi flow below: I'm getting a json array from an API, then I split it, I apply some filters and then I want to build the json array again.
Nifi workflow
I'm able to build the good json array from all the chunks, but the problem is that the processor is generating data indefinitely. When I execute the job step by step (by starting / stopping every processors one by one) everything is fine, but when the MergeRecord is running it's generating the same data even if I stop the begin of the flow (so there is no more inputs...)
You can see a screenshot below of the data in the "merged" box that are stacking
data stacked
I scheduled this processor every 10 sec, and after 30 sec you can see that it executed 3 times and generated 3 times the same file while there is no more data above. It's weird because when you look at the "original" box of the processor I can see the right original amount of data (18,43Kb). But the merged part is still increasing...
Here is the configuration of the MergeRecord:
configuration
I suppose that I'm missing something but I don't know why !
Thank you for your help,
Regards,
Thomas

Related

Understanding Apache Spark Web UI performance metrics

I'm new to Spark and I'm trying to understand the metrics in the Web UI that are related to in my Spark Application (developed through Dataset API). I've watched few videos by Spark Summit and Databricks and most of the videos I watched were about a general overview of the Web UI like: definition of stage/job/task, how to understand when something is not working properly (e.g. not balanced work between executors), suggestions about things to avoid while programming, etc.
However, I couldn't find a detailed explaination of each performance metrics. In particular I'm interested understanding the things in the following images that are related to a Query that contains a groupBy(Col1, Col2), a orderBy(Col1, Col2) and a show().
Job 0
If I understood well, the default max partition size is set to 128 MB. Since my dataset size is 1378MB I get 11 tasks that work with 128MB, right? and since in the first stage I did some filtering (before applying groupBy) tasks write in memory so Shuffle Write is 108.3KB but why do I get 200 tasks for second stage?
After the groupBy I used an orderBy, is the number of tasks related to how my dataset is or it is related to the size of it?
UPDATE: I found this spark.sql.shuffle.partitions of 200 default partitions conundrum and some other questions, but now I'm wondering if there is a specific reason for it to be 200?
Stage 0
Why some tasks have result serialization here? If I understood well the serialization is related to the output so any show(), count(), collect(), etc. But in this stage those actions are not present (before the groupBy).
Stage 1
Is it normal that there is a huge part for result serialization time? I called show() (that takes 20 rows by default and there is an orderBy) so all tasks run in parallel and that one serialized all its records?
Why only one task have a considerable Shuffle Read Time? I expected all to have at least a small amount of Shuffle Read Time, again it is something related to my dataset?
The deserialization time is related to reading my dataset file? I'm asking because I wouldnt have expected it there since it is stage 1 and it was already present in stage 0.
Job 1- caching
Since I'm dealing with 3 queries that starts from the same dataset, I used cache() at the beginning of the first Query. I was wondering why it shows 739.9MB / 765 [input size/records] ... In the first query it shows 1378.8 MB / 7647431 [input size/records].
I guess that it has 11 tasks since the size of the dataset cached is still 1378MB but 765 is a really low number compared to the initial that was 7647431 so I dont think it is really related to records/rows, right?
Thanks for reading.

NiFi how to release flow file until a process downstream is finished

I am designing a data ingestion pattern using NiFi. One process needs to stop releasing flow files until a process downstream has finished processed. I tried to use wait and notified and have not made any success. I am hoping if the queue size and back pressure can be set across a few processors.
Similarly if there's a way I can implement logic: Don't allow flow files go in if there is one currently processing between multiple processors.
Any help is appreciated
You need a combination of MonitorActivity with executestreamcommand (with a python "nipyapi" script).
I have a similar requirement in one of my working flows.
You will need to install python lib nipyapi first and create this script on the nifi box.
from time import sleep
import nipyapi
nipyapi.utils.set_endpoint('http://ipaddress:port/nifi-api', ssl=False, login=False)
## Get PG ID using the PG Name
mypg = nipyapi.canvas.get_process_group('start')
nipyapi.canvas.schedule_process_group(mypg.id, scheduled=True) ## Start
sleep(1)
nipyapi.canvas.schedule_process_group(mypg.id, scheduled=False) ## Stop
I will put the template in the img in the link bellow, see the configuration on the monitor-activity processor - it will generate a flow if not activity is happening for 10 sec(you can play with the times thou).
Download template
Note: this is not a very good approach if you have high latency requirements.
Another idea would be to monitor the aggregate queue in the entire flow and if queue is zero then you restart start flow. (this would be very intense if you have a lot of connections)
I was able to design a solution within NiFi. Essentially using generate flow file as a signal (Only run once ever). The trick is have the newly generated flow file to merge with the original input flow through defragmentation. And every time after the flow has finished, the success condition will be able to merge with the next input flow file.
Solution Flow

Apache Niffi getMongo Processor

I am new in niffi i am using getMongo to extract document from mongodb but same result is coming again and again but the result of query is only 2 document the query is {"qty":{$gt:10}}
There is a similar question regarding this. Let me quote what I had said there:
"GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach."
The question can be found here.

Always read first n lines on spring batch job restart

I am using spring batch module to read a complex file with multi-line records. First 3 lines in the file will always contain a header with few common fields.
These common fields will be used in the processing of subsequent records in the file. The job is restartable.
Suppose the input file has 10 records (please note number of records may not be same as number of lines since records can span over multiple lines).
Suppose job runs first time, starts reading the file from line 1, and processes first 5 records and fails while processing 6th record.
During this first run, since job has also parsed header part (first 3 lines in the file), application can successfully process first 5 records.
Now when failed job restarted it will start from 6th record and hence will not read the header part this time. Since application requires certain values
contained in the header record, the job fails. I would like to know possible suggestions so that restarted job always reads the header part and then starts
from where it left off (6th record in the above scenario).
Thanks in advance.
i guess, the file in question does not change between runs? then it's not necessary to re-read it, my solution builds on this assumption
if you use one step you can
implement a LineCallbackHandler
give it access to the stepExecutionContext (it's easy with annotations, but can be too with interfaces, just extend StepExecutionListenerSupport)
save the header values into the ExecutionContext
extract them from the context and use them where you want to
it should work for re-start as well, because Spring Batch reads/saves the values from the first run and will provide the complete ExecutionContext for subsequent runs
You can make 2 step job where:
First step reads first 3 lines as header information and puts everything you need to job context (and therefore save it in DB for future executions if job fails). If this step fails, header info will be read again and if it passes you are sure it will always have header info in job context.
Second step can use same file for input but this time you can tell it to skip first 3 lines and read rest as is. This way you will get restartability on that step and each time job fails it will resume where it left of.

Hadoop DistributedCache failed to report status

In a Hadoop job i am mapping several XML-files and filtering an ID for every element (from < id>-tags). Since I want to restrict the job to a certain set of IDs, I read in a large file (about 250 million lines in 2.7 GB, every line with just an integer as a ID). So I use a DistributedCache, parse the file in the setup() method of the Mapper with a BufferedReader and save the IDs to a HashSet.
Now when I start the job, I get countless
Task attempt_201201112322_0110_m_000000_1 failed to report status. Killing!
Before any map-job is executed.
The cluster consists of 40 nodes and since the files of a DistributedCache are copied to the slave nodes before any tasks for the job are executed, i assume the failure is caused by the large HashSet. I have already increased the mapred.task.timeout to 2000s. Of course I could raise the time even more, but actually this period should suffice, shouldn't it?
Since DistributedCache's are used to be a way to "distribute large, read-only files efficiently", I wondered what causes the failure here and if there is another way to pass the relevant IDs to every map-job?
Can you add some some debug printlns to your setup method to check that it is timing out in this method (log the entry and exit times)?
You may also want to look into using a BloomFilter to hold the IDs in. You can probably store these values in a 50MB bloom filter with a good false positive rate (~0.5%), and then run a secondary job to perform a partitioned check against the actual reference file.

Resources