Need to move small JSON messages from Kafka to HDFS with Kakfa Connect but without using Confluent libs, if not completely free - hadoop

I'd like to use Kakfa Connect to move JSON messages from Kafka to HDFS and then Impala, only using OpenSource libs.
I was trying to understand if I can use the Confluent Sink library for Kakfa Connect, without the need to use the entire Confluent distribution.
Are there are other and/or better options to achieve this?

The Kafka Connect HDFS 2 Sink is available under the Confluent Community Licence. It is a plugin for Apache Kafka; you do not have to run Confluent Platform to use it.

Related

Sending data from elasticsearch to kafka and finally to influxdb?

I would like to know how can I send data from elasticsearch to kafka and then to influxdb?
I've already tried using confluent platform with sources connector from elasticsearch and sink connector from influxdb, but the problem is that I'm stuck on sending data from elasticsearch to kafka
moreover once my computer is off I no longer have the backup of the connectors and I have to start from scratch
that's why my questions:
How to send data from elasticsearch to kafka? using confluent platform?
Do I really have to use confluent platform if I want to use kafka connect?
Kafka Connect is Apache 2.0 Licensed and is included with Apache Kafka download.
Confluent (among other companies) write plugins for it, such as Sinks to Elasticsearch or Influx.
It appears the Elasticsearch source on Confluent Hub is not built by Confluent, for example.
Related - Use Confluent Hub without Confluent Platform installation
once my computer is off I no longer have the backup of the connectors and I have to start from scratch
Kafka Connect distributed mode stores its config data in Kafka topics... Kafka defaults to store topic data in /tmp... Which is deleted when you shutdown your computer
Similarly, if you are using Docker for any of these systems without mounted volumes, Docker also is not persistent by default

How to write to multiple distinct Elasticsearch clusters using the Kafka Elasticsearch Sink Connector

Is is possible to use a single Kafka instance with the Elasticsearch Sink Connector to write to separate Elasticsearch clusters with the same index? Documentation. The source data may be a backend database or an application. An example use-case is that one cluster may be used for real-time search and the other may be used for analytics.
If this is possible, how do I configure the sink connector? If not, I can think of a couple of options:
Use 2 Kafka instances, each pointing to a different Elasticsearch cluster. Either write to both, or write to one and copy from it to the other.
Use a single Kafka instance and write a stream processor which will write to both clusters.
Are there any others?
Yes you can do this. You can use a single Kafka cluster and single Kafka Connect worker.
One connector can write to one Elasticsearch instance, and so if you have multiple destination Elasticsearch you need multiple connectors configured.
The usual way to run Kafka Connect is in "distributed" mode (even on a single instance), and then you submit one—or more—connector configurations via the REST API.
You don't need a Java client to use Kafka Connect - it's configuration only. The configuration, per connector, says where to get the data from (which Kafka topic(s)) and where to write it (which Elasticsearch instance).
To learn more about Kafka Connect see this talk, this short video, and this specific tutorial on Kafka Connect and Elasticsearch

Export data from Kafka to Oracle

I am trying to export data from Kafka to Oracle db. I've searched related questions and web but could not understand that we need a platform (confluent etc.. ) or not. I'd been read the link below but it's not clear enough.
https://docs.confluent.io/3.2.2/connect/connect-jdbc/docs/sink_connector.html
So, what we actually need to export data without 3rd party platform? Thanks in advance.
It's not clear what you mean by "third-party" here
What you linked to is Kafka Connect, which is Apache 2.0 Licensed and open source.
Kafka Connect is a plugin ecosystem, you install connectors individually, written by anyone, or write your own, just like any other Java dependency (i.e. a third-party)
The JDBC connector just happens to be maintained by Confluent. and you can configure the Confluent Hub CLI
to install within any Kafka Connect distribution (or use Kafka Connect Docker images from Confluent)
Alternatively, you use Apache Spark, Flink, Nifi, and many other Kafka Consumer libraries to read data and then start an Oracle transaction per record batch
Or you can explore non-JVM kafka libraries as well and use a language you're more familiar with doing Oracle operations with

How to implement kafka-connect using apache-kaka instead of confluent

I would like to use an open source version of kafka-connect instead of the confluent one as it appears that confluent cli is not for production and only for dev. I would like to be able to listen to changes on mysql database on aws ec2. Can someone point me in the right direction.
Kafka Connect is part of Apache Kafka. Period. If you want to use Kafka Connect you can do so with any modern distribution of Apache Kafka.
You then need a connector plugin to use with Kafka Connect, specific to your source technology. For integrating with a database there are various considerations, and available for MySQL you specifically have:
Kafka Connect JDBC - see it in action here
Debezium - see it in action here
The Confluent CLI is just a tool for helping manage and deploy Confluent Platform on developer machines. Confluent Platform itself is widely used in production.

Kafka-Connect vs Filebeat & Logstash

I'm looking to consume from Kafka and save data into Hadoop and Elasticsearch.
I've seen 2 ways of doing this currently: using Filebeat to consume from Kafka and send it to ES and using Kafka-Connect framework. There is a Kafka-Connect-HDFS and Kafka-Connect-Elasticsearch module.
I'm not sure which one to use to send streaming data. Though I think that if I want at some point to take data from Kafka and place it into Cassandra I can use a Kafka-Connect module for that but no such feature exists for Filebeat.
Kafka Connect can handle streaming data and is a bit more flexible. If you are just going to elastic, Filebeat is a clean integration for log sources. However, if you are going from Kafka to a number of different sinks, Kafka Connect is probably what you want. I'd recommend checking out the connector hub to see some examples of open source connectors at your disposal currently http://www.confluent.io/product/connectors/

Resources