Put data from Hive tables to kafka topic via nifi - hadoop

I have few tables in Hive and my goal is to create a view over them and then publish it over a topic in Kafka through Apache NiFi.
What are the options to get it done?
I am planning to do it through Nifi .

I'm sure Nifi would work,
see PutHiveStreaming processor, but sounds like a lot of effort.
Kafka Connect HDFS is able to consume Kafka data and automatically register a Hive table for you.
And if I misunderstood that, and you're trying to query Hive and publish it into a Kafka topic, then sure - Nifi is perfectly capable of that
Use SelectHiveQL and PublishKafka, however Kafka Connect JDBC Source should be able to query Hive and write to Kafka as well

Related

How to write to multiple distinct Elasticsearch clusters using the Kafka Elasticsearch Sink Connector

Is is possible to use a single Kafka instance with the Elasticsearch Sink Connector to write to separate Elasticsearch clusters with the same index? Documentation. The source data may be a backend database or an application. An example use-case is that one cluster may be used for real-time search and the other may be used for analytics.
If this is possible, how do I configure the sink connector? If not, I can think of a couple of options:
Use 2 Kafka instances, each pointing to a different Elasticsearch cluster. Either write to both, or write to one and copy from it to the other.
Use a single Kafka instance and write a stream processor which will write to both clusters.
Are there any others?
Yes you can do this. You can use a single Kafka cluster and single Kafka Connect worker.
One connector can write to one Elasticsearch instance, and so if you have multiple destination Elasticsearch you need multiple connectors configured.
The usual way to run Kafka Connect is in "distributed" mode (even on a single instance), and then you submit one—or more—connector configurations via the REST API.
You don't need a Java client to use Kafka Connect - it's configuration only. The configuration, per connector, says where to get the data from (which Kafka topic(s)) and where to write it (which Elasticsearch instance).
To learn more about Kafka Connect see this talk, this short video, and this specific tutorial on Kafka Connect and Elasticsearch

Incremental fetch from Oracle

Is there any way to fetch incremental data from an Oracle database using user-defined query using JDBC?
We are ok to use Spark, Kafka or plain JDBC.
The only thing it should be able to support heavy load.
You've not specified the destination. If it's a Kafka topic then using Apache Kafka makes sense to do the extract too, using Kafka Connect.
In which case, you can use the Kafka Connect JDBC connector to do this. See here for the specifics on using incremental mode with a custom query.
++ EDIT ++
If your final target is BigQuery then you can use Kafka Connect for that too with the appropriate BigQuery connector. You can see an example of it in action here.

Kafka Topic to Oracle database using Kafka Connect API JDBC Sink Connector Example

I know to write a Kafka consumer and insert/update each record into Oracle database but I want to leverage Kafka Connect API and JDBC Sink Connector for this purpose. Except the property file, in my search I couldn't find a complete executable example with detailed steps to configure and write relevant code in Java to consume a Kafka topic with json message and insert/update (merge) a table in Oracle database using Kafka connect API with JDBC Sink Connector. Can someone point demonstrate an example including configuration and dependencies? Are there any disadvantages with this approach? Do we anticipate any potential issues when table data increases to millions?
Thanks in advance.
There won't be an example for your specific use-case becuase the JDBC connector is meant to be generic.
Here is one configuration example with an Oracle database
All you need is
A topic of some format
key.converter and value.converter to be set to deserialize that topic
Your JDBC string and database schema (tables, projection fields, etc)
Any other JDBC Sink Specific Options
All this goes in a Java properties / JSON file, not Java source code
If you have a specific issue creating this configuration, please comment.
Do we anticipate any potential issues when table data increases to millions?
Well, those issues would be database server related, not with Kafka Connect. For example, disk filling up or increased load while accepting continuous writes.
Are there any disadvantages with this approach?
You'd have to handle de-deduplication or record expiration (e.g. GDPR) separately, if you did want that.

Kafka to Elasticsearch, HDFS with Logstash or Kafka Streams/Connect

I use Kafka for message queue/processing. My question is about performance/best practice. I will do my own performance tests but maybe someone has results/experience already.
The data is raw in a Kafka (0.10) topic and I want to transfer it structured to ES and HDFS.
Now I see 2 possibilities:
Logstash (Kafka input plugin, grok filter (parsing), ES/webhdfs output plugin)
Kafka Streams (parsing), Kafka Connect (ES sink, HDFS sink)
Without any tests I would say that the second option is better/cleaner and more reliable?
Logstash "best practice" for getting data into Elasticsearch. WebHDFS won't have the raw performance of the Java API that is part of the Kafka Connect plugin, however.
Grok could be done in a Kafka Streams process, so your parsing could be done in either location.
If you are on an Elastic subscription, then they would like to sell Logstash. Confluent would like to sell Kafka Streams + Kafka Connect.
Avro seems to be the best medium for data transfer, and the Schema Registry is a popular way to do that. IIUC, Logstash doesn't work well with a Schema Registry or Avro, and prefers JSON.
In the Hadoop landscape, I would offer the intermediate options of Apache Nifi or Streamsets.
In the end, it really depends on your priorities, and how well you (and your team) can support these tools.

Kafka-Connect vs Filebeat & Logstash

I'm looking to consume from Kafka and save data into Hadoop and Elasticsearch.
I've seen 2 ways of doing this currently: using Filebeat to consume from Kafka and send it to ES and using Kafka-Connect framework. There is a Kafka-Connect-HDFS and Kafka-Connect-Elasticsearch module.
I'm not sure which one to use to send streaming data. Though I think that if I want at some point to take data from Kafka and place it into Cassandra I can use a Kafka-Connect module for that but no such feature exists for Filebeat.
Kafka Connect can handle streaming data and is a bit more flexible. If you are just going to elastic, Filebeat is a clean integration for log sources. However, if you are going from Kafka to a number of different sinks, Kafka Connect is probably what you want. I'd recommend checking out the connector hub to see some examples of open source connectors at your disposal currently http://www.confluent.io/product/connectors/

Resources