Apache NiFi - Generic Execution Flows? - apache-nifi

I'm trying to figure out if the following scenarios are possible:
I have hundreds of tables that need to use the same flow, but have different intervals, different source hostnames and destinations?
How to build such a flow? also I can't figure out how to use dynamic hosts/schemas/table names...
We maintain a table with with all this info but how to execute it with NiFi?
If I need to load a file on multiple clusters (each table different clusters) in parallel - how can this be achieved?
tnx!

The solution I found is to use an external schedular (like airflow)
and to use ListenHttp processor.
Then you can send to that listener any data you wish parse it and use it as parameters/attributes in the rest of the flow.

Related

How do we know when a flow is completed in case we have multiple flowfiles running parallely?

I have a requirement where we have a template which uses SQL as source and SQL as destination and data would be more than 100GB for each table so here template will be instantiated multiple times based on tables to be migrated and also each table is partitioned into multiple flowfiles. How do we know when the process is completed? As here there will be multiple flowfiles we are unable to conclude as it hits end processor.
I have tried using SitetoSiteStatusReportingTask to check queue count, but it provides count based on connection and its difficult to fetch connectionid for each connection then concatenate as we have large number of templates. Here we have another problem in reporting task as it provides data on all process groups which are available on NIFI canvas which will be huge data if all templates are running and may impact in performance even though I used avro schema to fetch only queue count and connection id.
Can you please suggest some ideas and help me to achieve this?
you have multiple solution :
1 - you can use the wait/notify duo processor.
if you dont want multiple flowfile running parallely :
2 - set backpressure on Queue
3 - specify group level flow file concurrency (recommended but Nifi 1.12 only )

How to wait for GenerateTableFetch queries to finish

My use case is like this. I have some X tables to be pulled from MySQL. I am splitting them using SplitText to put each table in a individual flow file and pull using GenerateTableFetch and ExecuteSQL.
And I want to be notified or put some other action when import is done for all the tables. At SplitText text processor I have routed original relationship to Wait on ${filename} with target count ${fragment.count}. This will track how many tables are done.
But now I am not able to figure out how to know when a particular table is done. GenerateTableFetch forks flow file into multiple based on Partition Size. But it does not write attributes like fragment.count which I can use to wait on for each table.
Is there a way I can achieve this? Or maybe is there a way to know at the end of the entire flow if all flow files in the flow have been processed and nothing is in queue or being processed?
If you have a standalone instance of NiFi (or are not distributing the flow files among a cluster to ExecuteSQL nodes), then you could use QueryDatabaseTable instead, it (by default) will only issue all flow files when the entire result set is processed. If you have all the rows go into a single flow file, then the fact that the flow file has been transferred downstream is an indication that the fetch is complete.
I have written NIFI-5601 to cover the improvement of adding fragment.* attributes to flow files generated by GTF.
Till NiFi add's support for this, I managed to make it work using MergeContent. Use table_name as Correlation attribute name and then use merged relation to Wait processor using ${merge.count} as target. Refer screenshots if someone is looking to do the same.

Flink metrics as csv?

I'm trying to test different metrics on some flink jobs. However, I found it pretty cumbersome to set up a metrics reporter. Currently I'm using Graphite and for each job I need to manually select all the metrics I want, put them into different graphs to get an overview whether this test is meaningful and if so, I have to export all metrics individually into a csv file and finally merge it again in another diagram (custom diagram is mandatory unfortunately).
Then I redeploy the job with different settings or parallism, which changes the job id, which means I need to put all graphs together again, export those metrics, and so on...
Is there a more comfortable way to get metrics of flink jobs as csv file? Would be nice to get the metrics of certain operators (those ids stay the same) somehow as csv, independent of the job id or taskmanager id.
With the influxdb reporter it's pretty easy to setup scope formats that make it the metric naming really clean, and then you can do nice regular queries on the database to dump out groups of metrics. See https://github.com/jgrier/flink-stuff/tree/master/flink-influx-reporter.

Data aggregation in Apache Nifi

I am using Apache nifi to process the data from different resources and I have independent pipelines created for each data flow. I want to combine this data to process further. Is there any way I can aggregate the data and write it to a single file. The data is present in the form of flowfiles attributes in Nifi.
You should use the MergeContent processor, which accepts configuration values for min/max batch size, etc. and combines a number of flowfiles into a single flowfile according to the provided merge strategy.

Writing to multiple HCatalog schemas in single reducer?

I have a set of Hadoop flows that were written before we started using Hive. When we added Hive, we configured the data files as external tables. Now we're thinking about rewriting the flows to output their results using HCatalog. Our main motivation to make the change is to take advantage of the dynamic partitioning.
One of the hurdles I'm running into is that some of our reducers generate multiple data sets. Today this is done with side-effect files, so we write out each record type to its own file in a single reduce step, and I'm wondering what my options are to do this with HCatalog.
One option obviously is to have each job generate just a single record type, reprocessing the data once for each type. I'd like to avoid this.
Another option for some jobs is to change our schema so that all records are stored in a single schema. Obviously this option works well if the data was just broken apart for poor-man's partitioning, since HCatalog will take care of partitioning the data based on the fields. For other jobs, however, the types of records are not consistent.
It seems that I might be able to use the Reader/Writer interfaces to pass a set of writer contexts around, one per schema, but I haven't really thought it through (and I've only been looking at HCatalog for a day, so I may be misunderstanding the Reader/Writer interface).
Does anybody have any experience writing to multiple schemas in a single reduce step? Any pointers would be much appreciated.
Thanks.
Andrew
As best I can tell, the proper way to do this is to use the MultiOutputFormat class. The biggest help for me was the TestHCatMultiOutputFormat test in Hive.
Andrew

Resources