How can I transfer data from HDFS to Oracle using Apache Nifi? - apache-nifi

I am new to Nifi and looking for information on using Nifi processors to get speed upto 100MB/s.

At first you should use getHdfs processor to retrive HDFS file as flowfile.
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hadoop-nar/1.11.4/org.apache.nifi.processors.hadoop.GetHDFS/index.html
to put data into Oracle, you can use the PutDatabaseRecord Processor :
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.11.4/org.apache.nifi.processors.standard.PutDatabaseRecord/
between them, it's depend of your requirement, you can use ExecuteGroovyScript for exemple to transform your flowfile into query.
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-groovyx-nar/1.11.4/org.apache.nifi.processors.groovyx.ExecuteGroovyScript/index.html
all processor avaible : https://nifi.apache.org/docs.html

Related

Apache Nifi for data masking

We are using Nifi as our main data ingestion engine. Nifi is used to ingest data from multiple sources like DB, blob storage, etc and all of the data is pushed to kafka ( with avro as serializatiton format). Now, one of the requirement is to mask the specific fields(
PII) in input data.
Is nifi a good tool to do that ?
Does it have any processor to support data masking/obfuscation ?
Nifi comes with the EncryptContent and CryptographicHashContent and CryptographicHashAttribute processors which can be used to encrypt/hash data respectively.
I would look into this first.
In addition ReplaceText could also do simple masking. An ExecuteScript processor could perform custom masking, or a combination of UpdateRecord with a ScriptedRecordSetWriter could easily mask certain fields in a record.

I want to ingest data using NIFI to two directions one in HDFS and one in Oracle Database. Is it Possible?

We are using Nifi to ingesting data in HDFS. Can at same time same data be ingested in Oracle or any other database using NIFI?
I need to publish same data two places (HDFS and Oracle Database) and do not want to write two subscribe program.
NiFi has processors to get data from an RDBMS (Oracle, e.g.) such as QueryDatabaseTable and ExecuteSQL, and also from HDFS (ListHDFS, FetchHDFS, etc.). It also has processors to put data into an RDBMS (PutDatabaseRecord, PutSQL, etc.) or HDFS (PutHDFS, e.g.). So you can get your data from multiple sources and send it to multiple targets with NiFi.

How to keep hive table in NiFi DistributedMapCache

I want to keep my hive/MySQL table in NiFi DistributedMapCache. Can someone please help me with the example?
Or please correct me if we can not cache hive table anyhow in NiFi cache.
Thanks
You can use SelectHiveQL processor to pull data from Hive table and output format as CSV and include Header as false.
SplitText processor to split each line as individual flowfile.
Note
if your flowfile size is big then you have to use series of split text processors in series to split the flowfile to each line individually
ExtractText processor to extract the key attribute from the flowfile content.
PutDistributedMapCache processor
Configure/Enable DistributedMapCacheClientService, DistributedMapCacheServer controller service.
Add the Cache Entry Identifier property as your extracted attribute from ExtractText processor.
You need to change the Max cache entry size depending on the flowfile size.
To fetch the cached data you can use FetchDistributedMapCache processor and we need to use same exact value for the identifier that we have cached in PutDistributedMapCache
In the same way if you want to load data from external sources as we are going to have data in Avro format use ConvertRecord processor to convert Avro --> CSV format then load the data into distributed cache.
However this not an best practice to load all the data into distributedmapcache for the huge datasets as you can use lookuprecord processor also.

How to pass values dynamicallly from one processor to another processor using apache nifi

i want pass one processor result as input to another processor using apache NiFi.
I am geeting values from mysql using ExecuteSQL processor .i want pass this result dynamically to SelectHiveQL Processor in apache nifi.
ExecuteSQL outputs a result set as Avro. If you would like to process each row individually, you can use SplitAvro then ConvertAvroToJson, or ConvertAvroToJson then SplitJson. At that point you can use EvaluateJsonPath to extract values into attributes (for use with NiFi Expression Language), and at some point you will likely want ReplaceText where you set the content of the flow file to a HiveQL statement (for use by SelectHiveQL).

Data aggregation in Apache Nifi

I am using Apache nifi to process the data from different resources and I have independent pipelines created for each data flow. I want to combine this data to process further. Is there any way I can aggregate the data and write it to a single file. The data is present in the form of flowfiles attributes in Nifi.
You should use the MergeContent processor, which accepts configuration values for min/max batch size, etc. and combines a number of flowfiles into a single flowfile according to the provided merge strategy.

Resources