How to guarantee data sequence every time when fetching delta table by NiFi QueryDataBaseTable processor. The table has an incremental field called "SEQNUM". And set up the "Maximum-value Columns" by "SEQNUM" in QueryDataBaseTable processor. Has any method to order by fetching delta table?
Once you got the result flowfile from QueryDatabaseTable processor
Then use QueryRecord processor add new sql query with order by clause in it.
By using QueryRecord processor we are making sure the order of seqnum in each flowfile is arranged either asc/desc.
if you are having more than one flowfile as result of QueryDatabaseTable then by using MergeRecord processor merge the flowfiles into one then connect the merged connection to QueryRecord processor for ordering the data in flowfile (but this is not optimal way instead of NiFi consider Hive for these kind of heavy lifts).
Refer this and this links for more details regards to QueryRecord processor.
Related
For Incremental load we will be using QueryDatabaseTable processor which extracts data incrementally from one table. For writing sql query which extracts data from multiple tables we are using ExecuteSQL processor.
How can we extract incremental load for a join query??
If I understand what you're trying to do, in NiFi that's a Lookup pattern so you'd likely use LookupRecord with a DatabaseRecordLookupService. Each one of those would "join" the incremental load table with the rows from the table specified by the DatabaseRecordLookupService. For multiple joins you'd have a LookupRecord with corresponding DatabaseRecordLookupService for each of them.
I am new to Nifi and looking for information on using Nifi processors to get speed upto 100MB/s.
At first you should use getHdfs processor to retrive HDFS file as flowfile.
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hadoop-nar/1.11.4/org.apache.nifi.processors.hadoop.GetHDFS/index.html
to put data into Oracle, you can use the PutDatabaseRecord Processor :
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.11.4/org.apache.nifi.processors.standard.PutDatabaseRecord/
between them, it's depend of your requirement, you can use ExecuteGroovyScript for exemple to transform your flowfile into query.
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-groovyx-nar/1.11.4/org.apache.nifi.processors.groovyx.ExecuteGroovyScript/index.html
all processor avaible : https://nifi.apache.org/docs.html
The problem is that I am not able to process a bunch of JSON-records which came as output of the QueryCassandra processor. I am able to process record by record using the splitjson processor before PutDatabaseRecord.
I am trying to use Jsonpathreader in PutDatabaseRecord. **How can I configure the PutDatabaseRecord processor or the Jsonpathreader in order to process all records of the JSON at once?
I am trying to Join multiple tables using NiFi. The datasource may be MySQL or RedShift maybe something else in future. Currently, I am using ExecuteSQL processor for this but the output is in a Single flowfile. Hence, for terabyte of data, this may not be suitable. I have also tried using generateTableFetch but this doesn't have join option.
Here are my Questions:
Is there any alternative for ExecuteSQL processor?
Is there a way to make ExecuteSQL processor output in multiple flowfiles? Currently I can split the output of ExecuteSQL using SplitAvro processor. But I want ExecuteSQL itself splitting the output
GenerateTableFetch generates SQL queries based on offset. Will this slows down the process when the dataset becomes larger?
Please share your thoughts. Thanks in advance
1.Is there any alternative for ExecuteSQL processor?
if you are joining multiple tables then we need to use ExecuteSQL processor.
2.Is there a way to make ExecuteSQL processor output in multiple flowfiles? Currently I can split the output of ExecuteSQL using SplitAvro processor. But I want ExecuteSQL itself splitting the output ?
Starting from NiFi-1.8 version we can configure Max Rows for flowfile, so that ExecuteSQL processor splits the flowfiles.
NiFi-1251 addressing this issue.
3.GenerateTableFetch generates SQL queries based on offset. Will this slows down the process when the dataset becomes larger?
if your source table is having indexes on the Maximum-value Columns then it won't slow down the process even if your dataset is becoming larger.
if there is no indexes created on the source table then there will be full table scan will be done always, which results slow down the process.
I want to keep my hive/MySQL table in NiFi DistributedMapCache. Can someone please help me with the example?
Or please correct me if we can not cache hive table anyhow in NiFi cache.
Thanks
You can use SelectHiveQL processor to pull data from Hive table and output format as CSV and include Header as false.
SplitText processor to split each line as individual flowfile.
Note
if your flowfile size is big then you have to use series of split text processors in series to split the flowfile to each line individually
ExtractText processor to extract the key attribute from the flowfile content.
PutDistributedMapCache processor
Configure/Enable DistributedMapCacheClientService, DistributedMapCacheServer controller service.
Add the Cache Entry Identifier property as your extracted attribute from ExtractText processor.
You need to change the Max cache entry size depending on the flowfile size.
To fetch the cached data you can use FetchDistributedMapCache processor and we need to use same exact value for the identifier that we have cached in PutDistributedMapCache
In the same way if you want to load data from external sources as we are going to have data in Avro format use ConvertRecord processor to convert Avro --> CSV format then load the data into distributed cache.
However this not an best practice to load all the data into distributedmapcache for the huge datasets as you can use lookuprecord processor also.