Kylo and nifi usage for ETL

Kylo and nifi usage for ETL - etl

We have started to explore and use Nifi for data flow as a basic ETL tool.
Got to know about Kylo as a datalake specific tool which works over Nifi.
Are there any industry usage and pattern where Kylo is being used Or any article giving its use case/preference over custom Hadoop components like Nifi/Spark ?

Please take a look at the following two resources:
1) Kylo's website: The home page lists domains where Kylo is being used.
2) Kylo FAQs: Useful information that can help you understand Kylo's architecture and comparison with other tools.
Kylo is designed to work with NiFi and Spark, and does not replace them. You can build custom Spark jobs and execute them via ExecuteSparkJob NiFi processor provided by Kylo.

Related

Migrating existing ETL to NiFi: what processor should I choose?

We have existing ETL pipeline, exploring the possibility to migrate the pipeline to NiFi.
Our pipeline contains jobs written in python/scala, and do lots of ingestion/transformation.
What processor NiFi allows me to put python/scala code in it?
Thank you very much.

Migrating your entire code to NiFi is pretty much missing the point of NiFi.. Seems like what you're looking for is a workflow engine like Apache Oozie or Apache Airflow.
If you really want to migrate to NiFi, try replacing your code logic with NiFi processors. It will look better, and will be easier to maintain.
I do have to warn you, it's not easy if you don't know NiFi well.. You should learn about how to use NiFi first. After you do, you won't regret it :)

Apache NiFi deployment in production

I am new to Apache NiFi. From the documentation, I could understand that NiFi is a framework with a drag and drop UI that helps to build data pipelines. This NiFi flow can be exported into a template which is then saved in Git.
Are we supposed to import this template into production NiFi server? Once imported, are we supposed to manually start all the processors via the UI? Please help.

Templates are just example flows to share with people and are not really meant for deployment. Please take a look at NiFi Registry and the concept of versioned flows.
https://nifi.apache.org/registry.html
https://www.youtube.com/watch?v=X_qhRVChjZY&feature=youtu.be
https://bryanbende.com/development/2018/01/19/apache-nifi-how-do-i-deploy-my-flow

Template is xml representation of your process structure (processors, processor groups, controllers, relationships etc). You can upload it to another nifi server for deploy. You also can start all the processors by nifi-api.

How to process and what is the use of reporting task in Nifi?

I have gone through the below documentation link
https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks
But still I need a sample or workflow and its use in NiFi

A ReportingTask is a way to push information like metrics and statistics out of NiFi to an external system. It is a global component that you define in the controller settings, similar to controller services, so it is not on the canvas.
The available reporting tasks are in the documentation below the processors:
https://nifi.apache.org/docs.html
You can also look at the examples in the NiFi code:
https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-reporting-tasks
https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-ambari-bundle/nifi-ambari-reporting-task

Hadoop Jobs statistics using YARN Resource Manager REST API + elasticsearch +Kibana

My goal is to provide Hadoop jobs statistics web UI for administrative users.
I use HortonWorks Hadoop2 cluster and jobs run on YARN.
From the architecture perspective , I am planning to collect jobs related information ( such as start time, end time, mappers, etc ) from YARN Resource Manager REST API as scheduled cron job >> index them in to elastic search >> show them in Kibana.
I wonder if there is better way to do this.

Have you looked into Ambari? It provides metrics, dashboards, and alerting without having to create the framework from scratch.
Apache Ambari

Ambari provides statistics on an infrastructure level not on job level. So, you need to write a custom code to use yarn-rest API which provides you a JSON response. Based on which you can use the JSON parser and get the exact details. I have written one on python, you can refer to this link-https://dzone.com/articles/customized-alerts-for-hadoop-jobs-using-yarn-rest
http://thelearnguru.com/customized-alerts-for-hadoop-jobs-using-yarn-rest-api

Resource usage from Ambari

I have few Hive jobs and Mapreduce programs running in my cluster. I am able to check in Ambari about general resource utilization. But I want to see the resources utilized by individual applications. Is it possible through Ambari API? Can you provide some clues?

To my knowledge metrics that are provided by Ambari are for whole cluster.
But you can check MapReduce2 Job History UI, it seems like you are looking for this stuff. Check this link out, there is more detailed description there
http://hortonworks.com/blog/elephants-can-remember-mapreduce-job-history-in-hdp-2-0/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Kylo and nifi usage for ETL - etl

Related

Migrating existing ETL to NiFi: what processor should I choose?

Apache NiFi deployment in production

How to process and what is the use of reporting task in Nifi?

Hadoop Jobs statistics using YARN Resource Manager REST API + elasticsearch +Kibana

Resource usage from Ambari

Categories

Resources