Looking for an Equivalent of GenerateTableFetch - apache-nifi

I use ExecuteSQLRecord to run a query and write to CSV format. The table has 10M rows. Although I can split the output into multiple flow files, the query is executed by only a single thread and is very slow.
Is there a way to partition the query into multiple queries so that the next processor can run multiple concurrent tasks, each one process one partition? It would be like:
GenerateTableFetch -> ExecuteSQLRecord (with concurrent tasks)
The problem is that GenerateTableFetch only accepts table name as input. It does not accept customized queries.
Please advise if you have solutions. Thank you in advance.

You can increase the concurrency on Nifi processors (by increase the number in Councurrent Task), you can also increase the throughput, some time it works :
Also if you are working on the cluster, before the processor, you can apply load balancing on the queue, so it will distribute the workload among the nodes of your cluster (load balance strategy, put to round robin):
Check this, youtube channel, for Nifi antipatterns (there is a video on concurrency): Nifi Notes
Please clarify your question, if I didn't answer it.

Figured out an alternative way. I developed a Oracle PL/SQL function which takes table name as an argument, and produces a series of queries like "SELECT * FROM T1 OFFSET x ROWS FETCH NEXT 10000 ROWS ONLY". The number of queries is based on the number of rows of the table, which is a statistics number in the catalog table. If the table has 1M rows, and I want to have 100k rows in each batch, it will produces 10 queries. I use ExecuteSQLRecord to call this function, which effectively does the job of NiFi processor GenerateTableFetch. My next processor (e.g. ExecuteSQLRecord again) can now have 10 concurrent tasks working in parallel.


How to find optimum Spark-athena file size

I have a spark job that writes to s3 bucket and have a athena table on top of this location.
The table is partitioned. Spark was writing 1GB single file per partition. We experimented with maxRecordsPerFile option thus writing only 500MB data per file. In the above case we ended up having 2 files with 500MB each
This saved 15 mins in run-time on the EMR
However, there was a problem with athena. Athena query CPU time started getting worse with the new file size limit.
I tried comparing the same data with the same query before and after execution and this is what I found:
Partition columns = source_system, execution_date, year_month_day
Query we tried:
select *
from dw.table
where source_system = 'SS1'
and year_month_day = '2022-09-14'
and product_vendor = 'PV1'
and execution_date = '2022-09-14'
and product_vendor_commission_amount is null
and order_confirmed_date is not null
and filter = 1
order by product_id
limit 100;
Execution time:
Before: 6.79s
After: 11.102s
Explain analyze showed that the new structure had to scan more data.
Before: CPU: 13.38s, Input: 2619584 rows (75.06MB), Data Scanned: 355.04MB; per task: std.dev.: 77434.54, Output: 18 rows (67.88kB)
After: CPU: 20.23s, Input: 2619586 rows (74.87MB), Data Scanned: 631.62MB; per task: std.dev.: 193849.09, Output: 18 rows (67.76kB)
Can you please guide me why this takes double the time? What are the things to look out for? Is there a sweet spot on file size that would be optimal for spark & athena combination?
One hypothesis is that pushdown filters are more effective with the single file strategy.
From AWS Big Data Blog's post titled Top 10 Performance Tuning Tips for Amazon Athena:
Parquet and ORC file formats both support predicate pushdown (also
called predicate filtering). Both formats have blocks of data that
represent column values. Each block holds statistics for the block,
such as max/min values. When a query is being run, these statistics
determine whether the block should be read or skipped depending on the
filter value used in the query. This helps reduce data scanned and
improves the query runtime. To use this capability, add more filters
in the query (for example, using a WHERE clause).
One way to optimize the number of blocks to be skipped is to identify
and sort by a commonly filtered column before writing your ORC or
Parquet files. This ensures that the range between the min and max of
values within the block are as small as possible within each block.
This gives it a better chance to be pruned and also reduces data
scanned further.
To test it I would suggest to do another experiment if possible. Change the spark job and sort the data before persisting it into the two files. Use the following order:
source_system, execution_date, year_month_day, product_vendor, product_vendor_commission_amount, order_confirmed_date, filter and product_id. Then check the query statistics.
At least the dataset would be optimised for the presented use case. Otherwise, change it according to the most heavy queries.
The post comments about optimal file sizes too and it gives a general rule of thumb. From my experience, Spark works well with sizes between 128MB and 2GB. It should be also fine for other query engines like Presto used by Athena.
My suggestion would be to break year_month_day/execution date ( as mostly used in the queries ) to Year, Month and Day partitions , which would reduce the amount of data scan and efficient filtering.

How do we know when a flow is completed in case we have multiple flowfiles running parallely?

I have a requirement where we have a template which uses SQL as source and SQL as destination and data would be more than 100GB for each table so here template will be instantiated multiple times based on tables to be migrated and also each table is partitioned into multiple flowfiles. How do we know when the process is completed? As here there will be multiple flowfiles we are unable to conclude as it hits end processor.
I have tried using SitetoSiteStatusReportingTask to check queue count, but it provides count based on connection and its difficult to fetch connectionid for each connection then concatenate as we have large number of templates. Here we have another problem in reporting task as it provides data on all process groups which are available on NIFI canvas which will be huge data if all templates are running and may impact in performance even though I used avro schema to fetch only queue count and connection id.
Can you please suggest some ideas and help me to achieve this?
you have multiple solution :
1 - you can use the wait/notify duo processor.
if you dont want multiple flowfile running parallely :
2 - set backpressure on Queue
3 - specify group level flow file concurrency (recommended but Nifi 1.12 only )

Nifi record counts

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...

Pull Data from Hive to SQL Server without duplicates using Apache Nifi

Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)

How does PutHiveQL works on batch?

I am trying to input multiple insert statements to PutHiveQL via ReplaceText processor. Each insert statement is a flowfile coming out from ReplaceText. I set the batch in PutHiveQL to 100. However, it seems it still sends it 1 flowfile at a time. How to best implement this batch?
I don't think the PutHiveQL processor batches statements at the JDBC layer as you expect, not in the way that processors like PutSQL do. From the code, it looks like the Batch Size property is used to control how many flowfiles the processor works on before yielding, but the statements for each flowfile are still executed individually.
That might be a good topic for a NiFi feature request.
The version of Hive supported by NiFi doesn't allow for batching/transactions. The Batch Size parameter is meant to try to move multiple incoming flow files a bit faster than having the processor invoked every so often. So if you schedule the PutHiveQL processor for every 5 seconds with a Batch Size of 100, then every 5 seconds (if there are 100 flow files queued), the processor will attempt to process those during one "session".
Alternatively you can specify a Batch Size of 0 or 1 and schedule it as fast as you like; unfortunately this will have no effect on the Hive side of things, as it auto-commits each HiveQL statement; the version of Hive doesn't support transactions or batching.
Another (possibly more performant) alternative is to put the entire set of rows as a CSV file into HDFS and use the HiveQL "LOAD DATA" DML statement to create a table on top of the data: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DMLOperations
