How to process and what is the use of reporting task in Nifi? - hortonworks-data-platform

I have gone through the below documentation link
https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks
But still I need a sample or workflow and its use in NiFi

A ReportingTask is a way to push information like metrics and statistics out of NiFi to an external system. It is a global component that you define in the controller settings, similar to controller services, so it is not on the canvas.
The available reporting tasks are in the documentation below the processors:
https://nifi.apache.org/docs.html
You can also look at the examples in the NiFi code:
https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-reporting-tasks
https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-ambari-bundle/nifi-ambari-reporting-task

Related

How To Run A Nifi Template As Part Of A DH Ingestion Flow

We have a nifi template that used to extract data from a relational SQL Server DB and ingest every row of the tables as documents in Marklogic.
Now we want to add DH in the mix, and run the nifi template as part of the ingestion flows to populate the staging DB.
What is the recommended approach to put the nifi template to be called as part of an ingestion flow?
Is there any other recommended approach to extract the data from the relational DB and ingested into Marklogic during an ingestion flow?
Thanks for any help.
The easiest integration here is to run your NiFi flow and use the PutMarkLogic processor to write data to ML - that processor can be found in the MarkLogic-NiFi connector at https://github.com/marklogic/nifi/releases . You can configure PutMarkLogic with a REST transform. You'll want to use the mlRunIngest transform documented at https://docs.marklogic.com/datahub/5.4/tools/rest/rest-extensions.html so that you can reference your ingestion step configuration while writing data to ML.
The NiFi community generally recommends that you replace Templates with NiFi Registry instead, which has versioning and deployment controls built in that are far more accessible than Templates.
Both Templates and Versioned Flows can be automated using the REST endpoints in NiFi & Registry, with variables recommended to be set using Parameter Contexts.
You can roll your own client for this, or you may wish to use my Python client NiPyAPI.

Apache NiFi deployment in production

I am new to Apache NiFi. From the documentation, I could understand that NiFi is a framework with a drag and drop UI that helps to build data pipelines. This NiFi flow can be exported into a template which is then saved in Git.
Are we supposed to import this template into production NiFi server? Once imported, are we supposed to manually start all the processors via the UI? Please help.
Templates are just example flows to share with people and are not really meant for deployment. Please take a look at NiFi Registry and the concept of versioned flows.
https://nifi.apache.org/registry.html
https://www.youtube.com/watch?v=X_qhRVChjZY&feature=youtu.be
https://bryanbende.com/development/2018/01/19/apache-nifi-how-do-i-deploy-my-flow
Template is xml representation of your process structure (processors, processor groups, controllers, relationships etc). You can upload it to another nifi server for deploy. You also can start all the processors by nifi-api.

Kylo and nifi usage for ETL

We have started to explore and use Nifi for data flow as a basic ETL tool.
Got to know about Kylo as a datalake specific tool which works over Nifi.
Are there any industry usage and pattern where Kylo is being used Or any article giving its use case/preference over custom Hadoop components like Nifi/Spark ?
Please take a look at the following two resources:
1) Kylo's website: The home page lists domains where Kylo is being used.
2) Kylo FAQs: Useful information that can help you understand Kylo's architecture and comparison with other tools.
Kylo is designed to work with NiFi and Spark, and does not replace them. You can build custom Spark jobs and execute them via ExecuteSparkJob NiFi processor provided by Kylo.

Hadoop Jobs statistics using YARN Resource Manager REST API + elasticsearch +Kibana

My goal is to provide Hadoop jobs statistics web UI for administrative users.
I use HortonWorks Hadoop2 cluster and jobs run on YARN.
From the architecture perspective , I am planning to collect jobs related information ( such as start time, end time, mappers, etc ) from YARN Resource Manager REST API as scheduled cron job >> index them in to elastic search >> show them in Kibana.
I wonder if there is better way to do this.
Have you looked into Ambari? It provides metrics, dashboards, and alerting without having to create the framework from scratch.
Apache Ambari
Ambari provides statistics on an infrastructure level not on job level. So, you need to write a custom code to use yarn-rest API which provides you a JSON response. Based on which you can use the JSON parser and get the exact details. I have written one on python, you can refer to this link-https://dzone.com/articles/customized-alerts-for-hadoop-jobs-using-yarn-rest
http://thelearnguru.com/customized-alerts-for-hadoop-jobs-using-yarn-rest-api

Sensu AWS plugin to get ec2-metrics which are under a load balancer

I have been trying to write a aws sensu plugin which will get the instance id's of all the healthy instances which are under a load balancer and then get the stats for each of the instances like CPU Utilization Network In and Network Out etc and using graphite and graphane generate graphs.
I was searching the open source plugins in the sensu community, I could not find any. Is it possible write the script or plugin for this. Or anyone has done it before??
Kindly help me out
I don't believe a Sensu-specific plugin exists for this. However, since Sensu can run any Nagios plugin, you could use one of those: This one looks like it would get basic information on how many hosts are healthy. You could also write your own plugin using your language of choice (check out the available SDKs) to get more detailed metrics for each of the instances.
I wrote a plugin to do the same. It use to work fine then. I have testing on newer version of API. Let me know if you face any problem. I will help to fix the same.

Resources