I want to keep my hive/MySQL table in NiFi DistributedMapCache. Can someone please help me with the example?
Or please correct me if we can not cache hive table anyhow in NiFi cache.
Thanks
You can use SelectHiveQL processor to pull data from Hive table and output format as CSV and include Header as false.
SplitText processor to split each line as individual flowfile.
Note
if your flowfile size is big then you have to use series of split text processors in series to split the flowfile to each line individually
ExtractText processor to extract the key attribute from the flowfile content.
PutDistributedMapCache processor
Configure/Enable DistributedMapCacheClientService, DistributedMapCacheServer controller service.
Add the Cache Entry Identifier property as your extracted attribute from ExtractText processor.
You need to change the Max cache entry size depending on the flowfile size.
To fetch the cached data you can use FetchDistributedMapCache processor and we need to use same exact value for the identifier that we have cached in PutDistributedMapCache
In the same way if you want to load data from external sources as we are going to have data in Avro format use ConvertRecord processor to convert Avro --> CSV format then load the data into distributed cache.
However this not an best practice to load all the data into distributedmapcache for the huge datasets as you can use lookuprecord processor also.
Related
We are using Nifi as our main data ingestion engine. Nifi is used to ingest data from multiple sources like DB, blob storage, etc and all of the data is pushed to kafka ( with avro as serializatiton format). Now, one of the requirement is to mask the specific fields(
PII) in input data.
Is nifi a good tool to do that ?
Does it have any processor to support data masking/obfuscation ?
Nifi comes with the EncryptContent and CryptographicHashContent and CryptographicHashAttribute processors which can be used to encrypt/hash data respectively.
I would look into this first.
In addition ReplaceText could also do simple masking. An ExecuteScript processor could perform custom masking, or a combination of UpdateRecord with a ScriptedRecordSetWriter could easily mask certain fields in a record.
I am new to Nifi and looking for information on using Nifi processors to get speed upto 100MB/s.
At first you should use getHdfs processor to retrive HDFS file as flowfile.
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hadoop-nar/1.11.4/org.apache.nifi.processors.hadoop.GetHDFS/index.html
to put data into Oracle, you can use the PutDatabaseRecord Processor :
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.11.4/org.apache.nifi.processors.standard.PutDatabaseRecord/
between them, it's depend of your requirement, you can use ExecuteGroovyScript for exemple to transform your flowfile into query.
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-groovyx-nar/1.11.4/org.apache.nifi.processors.groovyx.ExecuteGroovyScript/index.html
all processor avaible : https://nifi.apache.org/docs.html
i need to ingest multiple csv files based on table names into their respective hive tables using apache nifi.
the data for table_address present in source json file should go to table_address in hive and similarly for other tables.
In short, records from source json file needs to be segregated into multiple csv files with tablename.csv format and loaded into their respective hive tables.
processors i am using
consume kafka ---> splitjson ----> evaluatejsonpath ----> updateattribute ----> replacetext ----> putfile
Records from source json file consumed from kafka Golden gate trials needs to be segregated into multiple csv files with tablename.csv format and loaded into their respective hive tables using apache nifi flow.
You can use PartitionRecord processor in NiFi.
Configure Record Reader(json)/Writer(csv) controller services
Output flowfile will be in csv format and based on partition column value you can store data into hive tables dynamically.
Flow:
Consume Kafka -->
Partition Record (specify partition field) -->
PutFile (or) PutHiveStreaming (or) PutHDFS(based on the value of partition field)
i want pass one processor result as input to another processor using apache NiFi.
I am geeting values from mysql using ExecuteSQL processor .i want pass this result dynamically to SelectHiveQL Processor in apache nifi.
ExecuteSQL outputs a result set as Avro. If you would like to process each row individually, you can use SplitAvro then ConvertAvroToJson, or ConvertAvroToJson then SplitJson. At that point you can use EvaluateJsonPath to extract values into attributes (for use with NiFi Expression Language), and at some point you will likely want ReplaceText where you set the content of the flow file to a HiveQL statement (for use by SelectHiveQL).
I am using Apache nifi to process the data from different resources and I have independent pipelines created for each data flow. I want to combine this data to process further. Is there any way I can aggregate the data and write it to a single file. The data is present in the form of flowfiles attributes in Nifi.
You should use the MergeContent processor, which accepts configuration values for min/max batch size, etc. and combines a number of flowfiles into a single flowfile according to the provided merge strategy.