Data aggregation in Apache Nifi - apache-nifi

I am using Apache nifi to process the data from different resources and I have independent pipelines created for each data flow. I want to combine this data to process further. Is there any way I can aggregate the data and write it to a single file. The data is present in the form of flowfiles attributes in Nifi.

You should use the MergeContent processor, which accepts configuration values for min/max batch size, etc. and combines a number of flowfiles into a single flowfile according to the provided merge strategy.

Related

Apache Nifi for data masking

We are using Nifi as our main data ingestion engine. Nifi is used to ingest data from multiple sources like DB, blob storage, etc and all of the data is pushed to kafka ( with avro as serializatiton format). Now, one of the requirement is to mask the specific fields(
PII) in input data.
Is nifi a good tool to do that ?
Does it have any processor to support data masking/obfuscation ?
Nifi comes with the EncryptContent and CryptographicHashContent and CryptographicHashAttribute processors which can be used to encrypt/hash data respectively.
I would look into this first.
In addition ReplaceText could also do simple masking. An ExecuteScript processor could perform custom masking, or a combination of UpdateRecord with a ScriptedRecordSetWriter could easily mask certain fields in a record.

Appending to existing avro file in HDFS with NiFi

I have this NiFi flow that grabs events in JSON from a MQTT broker, groups them according to some criteria, transforms them to Avro rows, and should ouput them through files in a Hadoop cluster.
I chose Avro as the storage format since it's able to append new data to an existing file.
These events are grouped by source, and ideally I should have one separate Avro file in HDFS for each event source, so NiFi accumulates new events in each file as they appear (with proper write batching of course since issuing a write per new event wouldn't be very good, I've already worked this out with a MergeContent processor).
I have the flow worked out but I found out that the last step, a PutHDFS processor, is file format agnostic, that is, it doesn't understands how to append to an existing Avro file.
I've found this pull request that implements exactly that, but it was never merged into NiFi due various concerns.
Is there a way to do this with existing NiFi processors? Or do I have to roll out my custom PutHDFS processor that understands how to append to existing Avro files?

How Can ExtractGrok use multiple regular expressions?

I have a Kakfa topic which includes different types of messages sent from different sources.
I would like to use the ExtractGrok processor to extract the message based on the regular expression/grok pattern.
How do I configure or run the processor with multiple regular expression?
For example, the Kafka topic contains INFO, WARNING and ERROR log entries from different applications.
I would like to separate the different log levels messages and place then into HDFS.
Instead of Using ExtractGrok processor, use Partition Record processor in NiFi to partition as this processor
Evaluates one or more RecordPaths against the each record in the
incoming FlowFile.
Each record is then grouped with other "like records".
Configure/enable controller services
RecordReader as GrokReader
Record writer as your desired format
Then use PutHDFS processor to store the flowfile based on the loglevel attribute.
Flow:
1.ConsumeKafka processor
2.Partition Record
3.PutHDFS processor
Refer to this link describes all the steps how to configure PartitionRecord processor.
Refer to this link describes how to store partitions dynamically in HDFS directories using PutHDFS processor.

Nifi record counts

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing

Creating larger NiFi flow files when using the ConsumeKafka processor

I've created a simple NiFi pipeline that reads a stream of data from a Kafka topic (using ConsumeKafka) and writes it to the HDFS (using PutHDFS). Currently, I'm seeing lots of small files being created on the HDFS. A new file is created about once a second, some with only one or two records.
I want fewer, larger files to be written to the HDFS.
I have the following settings in ConsumeKafka:
Message Demarcator = <new line>
Max Poll Records = 10000
Max Uncommitted Time = 20s
In the past I've used Flume instead of Nifi, and it has batchSize and batchDurationMillis, which allow me to tweak how big HDFS files are. It seems like ConsumeKafka in Nifi is missing a batchDurationMillis equivalent.
What's the solution in NiFi?
Using the Message Demarcator and Max Poll Records is the correct approach to get multiple messages per flow file. You may want to slow down the ConsumeKafka processor by adjusting the Run Schedule (on the scheduling tab) from 0 sec which means run as fast as possible, to something like 1 second or whatever makes sense for you to grab more data.
Even with the above, you would likely still want to stick a MergeContent processor before PutHDFS, and merge together flow files based on size so that you can wait til you have the appropriate amount of data before writing to HDFS.
How to use MergeContent will depend on the type of data you are merging... If you have Avro, there is a specific merge strategy for Avro. If you have JSON you can merge them one after another, or you can wrap them with a header, footer, and demarcator to make a valid JSON array.

Resources