Migrating existing ETL to NiFi: what processor should I choose? - apache-nifi

We have existing ETL pipeline, exploring the possibility to migrate the pipeline to NiFi.
Our pipeline contains jobs written in python/scala, and do lots of ingestion/transformation.
What processor NiFi allows me to put python/scala code in it?
Thank you very much.

Migrating your entire code to NiFi is pretty much missing the point of NiFi.. Seems like what you're looking for is a workflow engine like Apache Oozie or Apache Airflow.
If you really want to migrate to NiFi, try replacing your code logic with NiFi processors. It will look better, and will be easier to maintain.
I do have to warn you, it's not easy if you don't know NiFi well.. You should learn about how to use NiFi first. After you do, you won't regret it :)

Related

Kylo and nifi usage for ETL

We have started to explore and use Nifi for data flow as a basic ETL tool.
Got to know about Kylo as a datalake specific tool which works over Nifi.
Are there any industry usage and pattern where Kylo is being used Or any article giving its use case/preference over custom Hadoop components like Nifi/Spark ?
Please take a look at the following two resources:
1) Kylo's website: The home page lists domains where Kylo is being used.
2) Kylo FAQs: Useful information that can help you understand Kylo's architecture and comparison with other tools.
Kylo is designed to work with NiFi and Spark, and does not replace them. You can build custom Spark jobs and execute them via ExecuteSparkJob NiFi processor provided by Kylo.

running a non mapreduce program in hadoop

I have a question.. I have a program write in Netbeans. the program read data from cassandra and write the result into it. My program is not MapReduce at all.I execute the program and make a .jar file from it. now, I want to know if I can execute it in Hadoop?
actually, I want to know can I run a non-MapReduce Program in Hadoop?
You could architect this program to run on Hadoop v2 as a Yarn application. This would require re-architecting your application to fit the Yarn paradigm. An example of how to do this is given here: Writing App Framework on Yarn
This is not a simple exercise. Also, if you are interested in using Hadoop, I would consider simply re-writing your application to use HBase (another No-SQL Columnar database competitor to Cassandra) which is written specifically for Hadoop. It translates your query requests to MapReduce calls automatically.
This question is ages long but has never been answered. Anyhow, two projects are looking into this issue:
Apache Slider (incubating): http://slider.incubator.apache.org/
and
Apache Myriad (incubating): http://myriad.incubator.apache.org/
Slider is mainly sponsored by Hortonworks while Myriad is a MapR / Mesosphere project with large assistance from PayPal.

Workflow tool comaparison: Oozie Vs Cascading

I am looking for a workflow tool to run complex map-reduce jobs. I have Oozie in mind but also want to explore Cascading. Is there any sample code or example that chains existing M/R jobs using cascading API? Also, can you provide the comparison Oozie Vs Cascading?
Cascading and Oozie are not in the same category.
Oozie is a workflow scheduler.
Cascading is an API for creating workflows. It is agnostic about schedulers, i.e., it should run with whatever scheduler system that you use.
There is perhaps some confusion because the Oozie docs mention a "DAG", and both run atop Hadoop.
Also, Cascading has a notion of "data availability" in the checkpoint support, which is supported in Oozie, albeit differently.
Personally i play around with both to some extend, what i found interesting with cascading is
1)concise and expressive in terms of simple keywords like flow,tap,pipe etc.,
2)amazing TDD based approach for local development and research
3)nice planner view(.dot file) and will be useful once the project is grown, so maintenance is ease.
4)DSL based approach using groovy,scala,cloujre. so no need to worry about learning any new language or rather hadoop.
5)simple cloud deployment(e.g. amazon support as raw jar deployment).
6)you can call anything like existing pig or hive or pure other MR jar as long as they expose java api.
7)amazing for ML and NLP related works.

Best practices for using Oozie for Hadoop

I have been using Hadoop quite a while now. After some time I realized I need to chain Hadoop jobs, and have some type of workflow. I decided to use Oozie , but couldn't find much of information about best practices. I would like to hear it from more experienced folks.
Best Regards
The best way to learn oozie is to download the examples tar file that comes with the distribution and run each of them. It has an example for mapreduce, pig , streaming workflow as well as sample coordinator xmls.
First run the normal workflows and once you debug that , move to running the workflows with coordinator so that you can take it step by step. Lastly one best practice would be to make most of your variables in workflow and coordinator be to configurable and supplied through a component.properties file so that you don't have touch the xml often.
http://yahoo.github.com/oozie/releases/3.1.0/DG_Examples.html
There are documents about Oozie on github and apache.
https://github.com/yahoo/oozie/wiki
http://yahoo.github.com/oozie/releases/3.1.0/DG_Examples.html
http://incubator.apache.org/oozie/index.html
Apache document is being updated and should be live soon.

Do you know batch log processing tools for hadoop (zohmg alternatives)?

Since the zohmg project seems to be dead (no new commits since nov 2009), I would like to know if any of you used/uses it (with successful results). Or if you know anything about future of this project.
And if not, is there any alternative for this project. I'm looking for tool that will help to extract data from (apache) logs (using Hadoop as a batch processing system), store it into HBase, help with querying this data.
Cascading is very often used for this. It also provides adapters for HBase.
Examples can be found here
http://github.com/cwensel/cascading.samples
HBase integration
http://www.cascading.org/modules.html

Resources