We have started to explore and use Nifi for data flow as a basic ETL tool.
Got to know about Kylo as a datalake specific tool which works over Nifi.
Are there any industry usage and pattern where Kylo is being used Or any article giving its use case/preference over custom Hadoop components like Nifi/Spark ?
Please take a look at the following two resources:
1) Kylo's website: The home page lists domains where Kylo is being used.
2) Kylo FAQs: Useful information that can help you understand Kylo's architecture and comparison with other tools.
Kylo is designed to work with NiFi and Spark, and does not replace them. You can build custom Spark jobs and execute them via ExecuteSparkJob NiFi processor provided by Kylo.
Related
We have existing ETL pipeline, exploring the possibility to migrate the pipeline to NiFi.
Our pipeline contains jobs written in python/scala, and do lots of ingestion/transformation.
What processor NiFi allows me to put python/scala code in it?
Thank you very much.
Migrating your entire code to NiFi is pretty much missing the point of NiFi.. Seems like what you're looking for is a workflow engine like Apache Oozie or Apache Airflow.
If you really want to migrate to NiFi, try replacing your code logic with NiFi processors. It will look better, and will be easier to maintain.
I do have to warn you, it's not easy if you don't know NiFi well.. You should learn about how to use NiFi first. After you do, you won't regret it :)
I am new to Apache NiFi. From the documentation, I could understand that NiFi is a framework with a drag and drop UI that helps to build data pipelines. This NiFi flow can be exported into a template which is then saved in Git.
Are we supposed to import this template into production NiFi server? Once imported, are we supposed to manually start all the processors via the UI? Please help.
Templates are just example flows to share with people and are not really meant for deployment. Please take a look at NiFi Registry and the concept of versioned flows.
https://nifi.apache.org/registry.html
https://www.youtube.com/watch?v=X_qhRVChjZY&feature=youtu.be
https://bryanbende.com/development/2018/01/19/apache-nifi-how-do-i-deploy-my-flow
Template is xml representation of your process structure (processors, processor groups, controllers, relationships etc). You can upload it to another nifi server for deploy. You also can start all the processors by nifi-api.
I have gone through the below documentation link
https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks
But still I need a sample or workflow and its use in NiFi
A ReportingTask is a way to push information like metrics and statistics out of NiFi to an external system. It is a global component that you define in the controller settings, similar to controller services, so it is not on the canvas.
The available reporting tasks are in the documentation below the processors:
https://nifi.apache.org/docs.html
You can also look at the examples in the NiFi code:
https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-reporting-tasks
https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-ambari-bundle/nifi-ambari-reporting-task
My goal is to provide Hadoop jobs statistics web UI for administrative users.
I use HortonWorks Hadoop2 cluster and jobs run on YARN.
From the architecture perspective , I am planning to collect jobs related information ( such as start time, end time, mappers, etc ) from YARN Resource Manager REST API as scheduled cron job >> index them in to elastic search >> show them in Kibana.
I wonder if there is better way to do this.
Have you looked into Ambari? It provides metrics, dashboards, and alerting without having to create the framework from scratch.
Apache Ambari
Ambari provides statistics on an infrastructure level not on job level. So, you need to write a custom code to use yarn-rest API which provides you a JSON response. Based on which you can use the JSON parser and get the exact details. I have written one on python, you can refer to this link-https://dzone.com/articles/customized-alerts-for-hadoop-jobs-using-yarn-rest
http://thelearnguru.com/customized-alerts-for-hadoop-jobs-using-yarn-rest-api
I have few Hive jobs and Mapreduce programs running in my cluster. I am able to check in Ambari about general resource utilization. But I want to see the resources utilized by individual applications. Is it possible through Ambari API? Can you provide some clues?
To my knowledge metrics that are provided by Ambari are for whole cluster.
But you can check MapReduce2 Job History UI, it seems like you are looking for this stuff. Check this link out, there is more detailed description there
http://hortonworks.com/blog/elephants-can-remember-mapreduce-job-history-in-hdp-2-0/