Nifi record counts - apache-nifi

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process

This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing

Related

NiFi How can I merge exactly 3 records in a single one?

I'm working with NiFi, receiving several json in the same file. Pretending to modify propertly those jsons, I split it into several single flowfiles and here is where the problems start.
For each id and datetime, there are 3 flowfiles that I would like to merge into a single flowfile. I've tried to use MergeRecord NiFi's Processor, but it works when it wants. When I have a few records, it seems that it works ok. But when I have "a lot of" records, e.g., more than 70 records, it breaks. Sometimes it merges 2 records in a single record, sometimes it lets pass a single record directly.
merge_key is a string attribute based on id and datetime.
I need to take exactly 3 records and merge them in a single one.
If there is a way to order the flowfile and take the first n elements of it each 5 seconds, I think it can help. But I prefer to be sure it works properly without any "help"...
For ordering the flowfile we can use the EnforceOrder processor as per the default documentation.
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.EnforceOrder/#:~:text=Order%20Attribute,will%20be%20routed%20to%20failure.
Also NiFi uses below prioritizers among the incoming flow files.
FirstInFirstOutPrioritizer
NewestFlowFileFirstPrioritizer
OldestFlowFileFirstPrioritizer
PriorityAttributePrioritizer
Refer below link for more details
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html

How to wait for GenerateTableFetch queries to finish

My use case is like this. I have some X tables to be pulled from MySQL. I am splitting them using SplitText to put each table in a individual flow file and pull using GenerateTableFetch and ExecuteSQL.
And I want to be notified or put some other action when import is done for all the tables. At SplitText text processor I have routed original relationship to Wait on ${filename} with target count ${fragment.count}. This will track how many tables are done.
But now I am not able to figure out how to know when a particular table is done. GenerateTableFetch forks flow file into multiple based on Partition Size. But it does not write attributes like fragment.count which I can use to wait on for each table.
Is there a way I can achieve this? Or maybe is there a way to know at the end of the entire flow if all flow files in the flow have been processed and nothing is in queue or being processed?
If you have a standalone instance of NiFi (or are not distributing the flow files among a cluster to ExecuteSQL nodes), then you could use QueryDatabaseTable instead, it (by default) will only issue all flow files when the entire result set is processed. If you have all the rows go into a single flow file, then the fact that the flow file has been transferred downstream is an indication that the fetch is complete.
I have written NIFI-5601 to cover the improvement of adding fragment.* attributes to flow files generated by GTF.
Till NiFi add's support for this, I managed to make it work using MergeContent. Use table_name as Correlation attribute name and then use merged relation to Wait processor using ${merge.count} as target. Refer screenshots if someone is looking to do the same.

I need to get latest data in data ingest template nifi

Hi sir,
In data ingest template i need to get this property
for ex i have data with date field
date data
12-07-2018 a
13-07-2018 b
14-07-2018 c
15-07-2018 d
In that , i would like to take latest one i.e, 15-07-2018
if date field got new data
16-07-2018 e
then i have to get 16-07-2018 by checking last updated date 15-07-2018 rather than checking from first one 12-07-2018
like that, if i got 17-08-2108 f then have to get 17-08-2018 by checking with last new date 16-07-2018 ..
how to achieve this , in which processor i have to do modifications or have to add new properties
When the feed runs again, how does it take the latest watermark and work from there
Two possible approach comes to my mind:
Write your own Spark app which would be used (ExecuteSparkJob) to read through the file which is getting ingested. In this case, you keep track of the max date and when you are done through the ingestion, persist it somewhere. If you're in HDP world, easy thing would be to insert the max date to a Hive (transactional) table. You can also leverage ZooKeeper znode to persist or even the PutDistributedMapCache processor that NiFi offers.
Write a custom NiFi processor which would basically do the same thing as the above one, except that you have to enable it yourself to work with data of different format (CSV, JSON). Spark, in this regard, comes packed with many thing built in.

Nifi Fetching Data From Oracle Issue

I am having a requirement to fetch data from oracle and upload into google cloud storage.
I am using executeSql proecssor but it is failing for large table and even for table with 1million records of approx 45mb size it is taking 2hrs to pull.
The table name are getting passed using restapi to listenHttp which passes them to executeSql. I cant use QueryDatabase because the number of table are dynamic and calls to start the fetch is also dynamic using a UI and Nifi RestUi.
Please suggest any tuning parameter in ExecuteSql Processor.
I believe you are talking about having the capability to have smaller flow files and possibly sending them downstream while the processor is still working on the (large) result set. For QueryDatabaseTable this was added in NiFi 1.6.0 (via NIFI-4836) and in an upcoming release (NiFi 1.8.0 via NIFI-1251) this capability will be available for ExecuteSQL as well.
You should be able to use GenerateTableFetch to do what you want. There you can set the Partition Size (which will end up being the number of rows per flow file) and you don't need a Maximum Value Column if you want to fetch the entire table each time a flow file comes in (which also allows you do handle multiple tables as you described). GenerateTableFetch will generate the SQL statements to fetch "pages" of data from the table, which should give you better, incremental performance on very large tables.

Creating larger NiFi flow files when using the ConsumeKafka processor

I've created a simple NiFi pipeline that reads a stream of data from a Kafka topic (using ConsumeKafka) and writes it to the HDFS (using PutHDFS). Currently, I'm seeing lots of small files being created on the HDFS. A new file is created about once a second, some with only one or two records.
I want fewer, larger files to be written to the HDFS.
I have the following settings in ConsumeKafka:
Message Demarcator = <new line>
Max Poll Records = 10000
Max Uncommitted Time = 20s
In the past I've used Flume instead of Nifi, and it has batchSize and batchDurationMillis, which allow me to tweak how big HDFS files are. It seems like ConsumeKafka in Nifi is missing a batchDurationMillis equivalent.
What's the solution in NiFi?
Using the Message Demarcator and Max Poll Records is the correct approach to get multiple messages per flow file. You may want to slow down the ConsumeKafka processor by adjusting the Run Schedule (on the scheduling tab) from 0 sec which means run as fast as possible, to something like 1 second or whatever makes sense for you to grab more data.
Even with the above, you would likely still want to stick a MergeContent processor before PutHDFS, and merge together flow files based on size so that you can wait til you have the appropriate amount of data before writing to HDFS.
How to use MergeContent will depend on the type of data you are merging... If you have Avro, there is a specific merge strategy for Avro. If you have JSON you can merge them one after another, or you can wrap them with a header, footer, and demarcator to make a valid JSON array.

Resources