Ingest Avro from Apache Flume sink to Apache NIFI - apache-nifi

If possible to get the Avro as flow file from apache flume sink? I have no idea which processors I should use. Tried use site-to-site, ListenTCPRecord, but seems not work. Note that Flume and NIFI are hosted in different server.

Related

Need to move small JSON messages from Kafka to HDFS with Kakfa Connect but without using Confluent libs, if not completely free

I'd like to use Kakfa Connect to move JSON messages from Kafka to HDFS and then Impala, only using OpenSource libs.
I was trying to understand if I can use the Confluent Sink library for Kakfa Connect, without the need to use the entire Confluent distribution.
Are there are other and/or better options to achieve this?
The Kafka Connect HDFS 2 Sink is available under the Confluent Community Licence. It is a plugin for Apache Kafka; you do not have to run Confluent Platform to use it.

Can Hadoop do streaming?

Someone suggested that Hadoop does streaming, and have quoted Flume and Kafka as examples.
While I understand they might have streaming features, I wonder if they can be considered in the same league as stream processing technologies like Storm/Spark/Flink. Kafka is a 'publish-subscribe model messaging system' and Flume is a data ingestion tool. And even though they interact/integrae with hadoop are they technically part of 'hadoop' itself?
PS: I understand there is a Hadoop Streaming which is an entirely different thing.
Hadoop is only YARN, HDFS, and MapReduce. As a project, it does not accommodate (near) real time ingestion or processing.
Hadoop Streaming is a tool used to manipulate data between filesystem streams (standard input/output)
Kafka is not only a publish/subscribe message queue.
Kafka Connect is essentially a Kafka channel, in Flume terms. Various plug-ins exist for reading from different "sources", producing to Kafka, then "sinks" exist to consume from Kafka to databases or filesystems. From a consumer perspective, this is more scalable than singular Flume agents deployed across your infrastructure. If all you're looking for log ingestion into Kafka, personally I find Filebeat or Fluentd to be better than Flume (no Java dependencies).
Kafka Streams is a comparable product to Storm, Flink, and Samza, except the dependency upon YARN or any cluster scheduler doesn't exist, and it's possible to embed a Kafka Streams processor within any JVM compatible application (for example, a Java web application). You'd have difficulties trying to do that with Spark or Flink without introducing a dependency on some external system(s).
The only benefits of Flume, NiFi, Storm, Spark, etc. I find is that they compliment Kafka and they have Hadoop compatible integrations along with other systems used in the BigData space like Cassandra (see SMACK stack)
So, to answer the question, you need to use other tools to allow streaming data to be processed and stored by Hadoop.

Put data from Hive tables to kafka topic via nifi

I have few tables in Hive and my goal is to create a view over them and then publish it over a topic in Kafka through Apache NiFi.
What are the options to get it done?
I am planning to do it through Nifi .
I'm sure Nifi would work,
see PutHiveStreaming processor, but sounds like a lot of effort.
Kafka Connect HDFS is able to consume Kafka data and automatically register a Hive table for you.
And if I misunderstood that, and you're trying to query Hive and publish it into a Kafka topic, then sure - Nifi is perfectly capable of that
Use SelectHiveQL and PublishKafka, however Kafka Connect JDBC Source should be able to query Hive and write to Kafka as well

Difference between HDF and Apache NiFi

I am trying to understand difference between Apache Nifi and Hortonworks Data Flow (HDF).
How they differ from each other in terms of capability and overall design ? What will be possible use cases for Nifi and HDF ?
Hortonworks Data Flow (HDF) is a platform for data collection, curation, analysis, and delivery. It is made up of Apache NiFi, Apache Kafka, Apache Storm, and Apache Ranger. You can read more about it here: https://hortonworks.com/products/data-center/hdf/
Apache NiFi is an open-source data flow tool, and is one of the tools included in HDF.

Logstash or Elasticsearch integration with Apache Spark streaming

Is it possible to receive live input streams of logs from Logstash or Elasticsearch into Spark Streaming?
I see there's a builtin Flume receiver. But any existing Custom receivers for Logstash or Elasticsearch?
Possibly the best solution currently seems to be to use the logstash output plugin for kafka and then read the kafka topic using spark kafka receiver
http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Resources