Difference between HDF and Apache NiFi - hortonworks-data-platform

I am trying to understand difference between Apache Nifi and Hortonworks Data Flow (HDF).
How they differ from each other in terms of capability and overall design ? What will be possible use cases for Nifi and HDF ?

Hortonworks Data Flow (HDF) is a platform for data collection, curation, analysis, and delivery. It is made up of Apache NiFi, Apache Kafka, Apache Storm, and Apache Ranger. You can read more about it here: https://hortonworks.com/products/data-center/hdf/
Apache NiFi is an open-source data flow tool, and is one of the tools included in HDF.

Related

Ingest Avro from Apache Flume sink to Apache NIFI

If possible to get the Avro as flow file from apache flume sink? I have no idea which processors I should use. Tried use site-to-site, ListenTCPRecord, but seems not work. Note that Flume and NIFI are hosted in different server.

Can I use Apache Nifi as a ESB or a request mediator?

I've seen Apache Nifi compared to similar ETL tools like Apache Flume, Airflow and Kafka. These are ETL tools more than ESBs or request mediators.
ESBs/request mediators can be used to orchestrate web services and expose a single service (a proxy service) which is expected to serve concurrent HTTP requests efficiently.
My question is, can I use Apache Nifi for the same purpose? To provide service orchestration and serve proxy service endpoints using Nifi's processors such as HandleHttpRequest? Is it designed to handle real-time concurrent requests efficiently?
You brought up a few technologies that are quite different..
Apache NiFi is a dataflow management tool. Unlike, Kafka Streams, Airflow or Apache Flume, it does not require you to write your own code. You can do almost anything you need using the existing processors developed by Apache.
Besides, Airflow is a workflow management tool, could be compared with Oozie.
NiFi is made for real time performance but not for serving as a Rest API. It can start a flow based on an http request like you said though.
Hope it helps

Can Hadoop do streaming?

Someone suggested that Hadoop does streaming, and have quoted Flume and Kafka as examples.
While I understand they might have streaming features, I wonder if they can be considered in the same league as stream processing technologies like Storm/Spark/Flink. Kafka is a 'publish-subscribe model messaging system' and Flume is a data ingestion tool. And even though they interact/integrae with hadoop are they technically part of 'hadoop' itself?
PS: I understand there is a Hadoop Streaming which is an entirely different thing.
Hadoop is only YARN, HDFS, and MapReduce. As a project, it does not accommodate (near) real time ingestion or processing.
Hadoop Streaming is a tool used to manipulate data between filesystem streams (standard input/output)
Kafka is not only a publish/subscribe message queue.
Kafka Connect is essentially a Kafka channel, in Flume terms. Various plug-ins exist for reading from different "sources", producing to Kafka, then "sinks" exist to consume from Kafka to databases or filesystems. From a consumer perspective, this is more scalable than singular Flume agents deployed across your infrastructure. If all you're looking for log ingestion into Kafka, personally I find Filebeat or Fluentd to be better than Flume (no Java dependencies).
Kafka Streams is a comparable product to Storm, Flink, and Samza, except the dependency upon YARN or any cluster scheduler doesn't exist, and it's possible to embed a Kafka Streams processor within any JVM compatible application (for example, a Java web application). You'd have difficulties trying to do that with Spark or Flink without introducing a dependency on some external system(s).
The only benefits of Flume, NiFi, Storm, Spark, etc. I find is that they compliment Kafka and they have Hadoop compatible integrations along with other systems used in the BigData space like Cassandra (see SMACK stack)
So, to answer the question, you need to use other tools to allow streaming data to be processed and stored by Hadoop.

data stream between Kerberized kafka cluster to hadoop cluster using Spring boot

I have a streaming use case to develop an Spring boot application where it should read data from kafka topic and put into hdfs path, I got two distinct cluster for kafka and hadoop.
Application worked fine without having kerberos authentication in kafka cluster and hadoop being kerberized.
Issues started when both cluster being kerberized, At the same time i could only authenticate into only one cluster.
I did few analysis/googling , i could not find much of help,
My theory is we could not login/authenticate into two kerberized cluster at same jvm instance because we need to set REALM and KDC details in code which are not client specific but jvm specific,
It might happen that i did not used proper APIs, I am very new to Spring boot.
I know we can do this by setting cross realm trust between clusters but i am looking for application level solutions if possible.
I got few questions
is it possible to login/authenticate two separate kerberized cluster at same jvm instance, if possible? please help me, use of Spring boot is preferred.
What would be the best solution to stream data from kafka cluster to hadoop cluster.
What would be the best solution to stream data from kafka cluster to hadoop cluster.
Kafka's Connect API is for streaming integration of sources and targets with Kafka, using just configuration files - no coding! The HDFS connector is what you want, and supports Kerberos authentication. It is open source and available standalone or as part of Confluent Platform.

Hadoop Zookeeper understanding

I'm having difficulty understanding the Zookeeper Hadoop framework. The main aspects of zookeeper I find confusing is understanding is how it handles consistency across its nodes, but also how it makes use of its distributed in-memory file system to handle co-ordination? Any help with these points would be great.
As #Yann said, Zookeeper is not related to Hadoop. Zookeeper is a coordination engine for services. With it, you can create distributable configuration, service discovery, leader election, among other features.
I suggest you to read this post to see how it works for load balancing and this other for service discovery. Other nice resource to read is Apache Curator that is a framework that make easy to use Zookeeper with Java or any other JVM language

Resources