I am currently using one Hadoop cluster of 10 nodes (1 Name Node and 9 Data Nodes) in which Hbase, Hive, Kafka, Zookeeper and other echo systems of Hadoop are running. Now I want to fetch data from RDBMS and store it in HDFS in real time. Can we do that by using Confluent Source Connector and HDFS2 Sink Connector with in the same cluster or do I need to have a separate cluster for Kafka Connect?
Yes. Kafka Connect is a standalone Java process, just like each of the other components you mentioned.
do I need to have a separate cluster for Kafka Connect
That would be preferred, but is optional
Related
Is is possible to use a single Kafka instance with the Elasticsearch Sink Connector to write to separate Elasticsearch clusters with the same index? Documentation. The source data may be a backend database or an application. An example use-case is that one cluster may be used for real-time search and the other may be used for analytics.
If this is possible, how do I configure the sink connector? If not, I can think of a couple of options:
Use 2 Kafka instances, each pointing to a different Elasticsearch cluster. Either write to both, or write to one and copy from it to the other.
Use a single Kafka instance and write a stream processor which will write to both clusters.
Are there any others?
Yes you can do this. You can use a single Kafka cluster and single Kafka Connect worker.
One connector can write to one Elasticsearch instance, and so if you have multiple destination Elasticsearch you need multiple connectors configured.
The usual way to run Kafka Connect is in "distributed" mode (even on a single instance), and then you submit one—or more—connector configurations via the REST API.
You don't need a Java client to use Kafka Connect - it's configuration only. The configuration, per connector, says where to get the data from (which Kafka topic(s)) and where to write it (which Elasticsearch instance).
To learn more about Kafka Connect see this talk, this short video, and this specific tutorial on Kafka Connect and Elasticsearch
I am a newbie to Kafka technology.
I have setup a basic single node cluster using Ambari.
I want to understand what is the recommended configuration for a production server. Let's say in production I will have 5 topics each getting traffic in the range of 500,000 to 50,000,000 in a day.
I am thinking of setting up a 3-4 node kafka cluster using EC2 r5.xlarge instances.
I am mostly confused about zookeeper part. I understand zookeeper needs odd number of nodes and zookeeper is installed on all kafka nodes, then how do I run Kafka with even number of nodes. If this is true it will limit Kafka to odd number of nodes as well.
Is it really needed to install Zookeeper on all Kafka nodes. Can I install Zookeeper on separate nodes and Kafka brokers on separate nodes, how ?
What if I want to run multiple Kafka clusters. Is it possible to manage multiple Kafka clusters through single Zookeeper cluster, how if possible ?
I have started learning Kafka recently only, any help would be appreciated.
Thanks,
I am mostly confused about zookeeper part. I understand zookeeper
needs odd number of nodes and zookeeper is installed on all kafka
nodes, then how do I run Kafka with even number of nodes. If this is
true it will limit Kafka to odd number of nodes as well.
Zookeeper can, but doesn't have to be installed on the same servers as kafka. It is not requirement to run zookeeper on odd number of nodes, just very good recommendation
Is it really needed to install Zookeeper on all Kafka nodes. Can I
install Zookeeper on separate nodes and Kafka brokers on separate
nodes, how ?
It is not required and it's even better not to have zookeeper and kafka on the same server. Installing zookeeper on another server is quite similar to when they reside on the same one. Every kafka broker needs to have zookeeper.connect setting pointing to all zookeeper nodes.
What if I want to run multiple Kafka clusters. Is it possible to
manage multiple Kafka clusters through single Zookeeper cluster, how
if possible ?
It is possible. In this case it's recommended to have servers dedicated just to zookeeper ensemble. In this case, in zookeeper.connect settings you should use hostname:port/path instead just hostname:port.
Can I install Zookeeper on separate nodes and Kafka brokers on separate nodes, how ?
You can, and you should if you have the available resources.
Run zookeeper-server-start zookeeper.properties on an odd number of servers. (max 5 or 7 for larger Kafka clusters)
On every other machine that is a Kafka broker, not the same servers as Zookeeper, edit server.properties to point to that set of Zookeeper machine addresses for the zookeeeper.connect property.
Then do kafka-server-start server.properties for every new Kafka broker.
From there, you can scale Kafka independently of Zookeeper
Is it possible to manage multiple Kafka clusters through single Zookeeper cluster
Look up Zookeeper chroots
One Kafka cluster would be defined as
zoo1:2181/kafka1
And a second
zoo1:2181/kafka2
be careful not to mix those up if machines shouldn't be in the same Kafka cluster
You can find various CloudFormation, Terraform, or Ansible repos for setting up Kafka in a distibuted way in the Cloud on Github, or go for Kubernetes if you are familiar with it.
Question
On a Flink standalone cluster, running on a server, I am developing a Flink streaming job in Scala. The job consumes data from more than 1 Kafka topics, (do some formatting,) and write results to HDFS.
One of the Kafka topic, and HDFS, they both require separate Kerberos authentications (because they belong to completely different clusters).
My questions are:
Is it possible (if yes, how?) to use two Kerberos keytabs (one for Kafka, the other for HDFS) from a Flink job on a Flink cluster, running on a server? (so the Flink job can consume from Kafka topic and write to HDFS at the same time)
If not possible, what is a reasonable workaround, for the Kafka-Flink-HDFS data streaming when Kafka and HDFS are both Kerberos protected?
Note
I am quite new to the most of the technologies mentioned here.
The Flink job can write to HDFS if it doesn't need to consume the Kerberos-requiring topic. In this case, I specified the information of HDFS tosecurity.kerberos.login.keytab and security.kerberos.login.principal in flink-conf.yaml
I am using HDFS Connector provided from Flink to write to HDFS.
Manually switching the Kerberos authentication between the two principals was possible. In [realm] section in krb5.conf file, I specified two realms, one for Kafka, the other for HDFS.
kinit -kt path/to/hdfs.keytab [principal: xxx#XXX.XXX...]
kinit -kt path/to/kafka.keytab [principal: yyy#YYY.YYY...]
Environment
Flink (v1.4.2) https://ci.apache.org/projects/flink/flink-docs-stable/
Kafka client (v0.10.X)
HDFS (Hadoop cluster HDP 2.6.X)
Thanks for your attentions and feedbacks!
Based on the answer and comment on this very similar question
It appears there is no clear way to use two credentials in a single Flink job.
Promising approaches or workarounds:
Creating a trust
Co-installing Kafka and HDFS on the same platform
Using something else to bridge the gap
An example of the last point:
You can use something like NiFi or Streams Replication Manager to bring the data from the source Kafka, to the Kafka in your cluster. NiFi is more modular, and it is possible to configure the kerberos credentials for each step. Afterward you are in a single context which Flink can handle.
Full disclosure: I am an employee of Cloudera, a driving force behind NiFi, Kafka, HDFS, Streams Replication Manager and since recently Flink
After three years from my initial post, our architecture has moved from standalone bare metal server to Docker container on Mesos, but let me summarize the workaround (for Flink 1.8):
Place krb5.conf with all realm definitions and domain-realm mappings (for example under /etc/ of the container)
Place Hadoop krb5.keytab (for example under /kerberos/HADOOP_CLUSTER.ORG.EXAMPLE.COM/)
Configure Flink's security.kerberos.login.* properties in flink-conf.yaml
security.kerberos.login.use-ticket-cache: true
security.kerberos.login.principal: username#HADOOP_CLUSTER.ORG.EXAMPLE.COM
security.kerberos.login.contexts should not be configured. This ensures that Flink does not use Hadoop’s credentials for Kafka and Zookeeper.
Copy keytabs for Kafka into separate directories inside the container (for example under /kerberos/KAFKA_CLUSTER.ORG.EXAMPLE.COM/)
Periodically run custom script to renew ticket cache
KINIT_COMMAND_1='kinit -kt /kerberos/HADOOP_CLUSTER.ORG.EXAMPLE.COM/krb5.keytab username#HADOOP_CLUSTER.ORG.EXAMPLE.COM'
KINIT_COMMAND_2='kinit -kt /kerberos/KAFKA_CLUSTER.ORG.EXAMPLE.COM/krb5.keytab username#KAFKA_CLUSTER.ORG.EXAMPLE.COM -c /tmp/krb5cc_kafka'
...
Set the property sasl.jaas.config when instantiating each FlinkKafkaConsumer to the actual JAAS configuration string.
To bypass the global JAAS configuration. If we set this globally, we can’t use different Kafka instances with different credentials, or unsecured Kafka together with secured Kafka.
props.setProperty("sasl.jaas.config",
"com.sun.security.auth.module.Krb5LoginModule required " +
"refreshKrb5Config=true " +
"useKeyTab=true " +
"storeKey=true " +
"debug=true " +
"keyTab=\"/kerberos/KAFKA_CLUSTER.ORG.EXAMPLE.COM/krb5.keytab\" " +
"principal=\"username#KAFKA_CLUSTER.ORG.EXAMPLE.COM\";")
Someone suggested that Hadoop does streaming, and have quoted Flume and Kafka as examples.
While I understand they might have streaming features, I wonder if they can be considered in the same league as stream processing technologies like Storm/Spark/Flink. Kafka is a 'publish-subscribe model messaging system' and Flume is a data ingestion tool. And even though they interact/integrae with hadoop are they technically part of 'hadoop' itself?
PS: I understand there is a Hadoop Streaming which is an entirely different thing.
Hadoop is only YARN, HDFS, and MapReduce. As a project, it does not accommodate (near) real time ingestion or processing.
Hadoop Streaming is a tool used to manipulate data between filesystem streams (standard input/output)
Kafka is not only a publish/subscribe message queue.
Kafka Connect is essentially a Kafka channel, in Flume terms. Various plug-ins exist for reading from different "sources", producing to Kafka, then "sinks" exist to consume from Kafka to databases or filesystems. From a consumer perspective, this is more scalable than singular Flume agents deployed across your infrastructure. If all you're looking for log ingestion into Kafka, personally I find Filebeat or Fluentd to be better than Flume (no Java dependencies).
Kafka Streams is a comparable product to Storm, Flink, and Samza, except the dependency upon YARN or any cluster scheduler doesn't exist, and it's possible to embed a Kafka Streams processor within any JVM compatible application (for example, a Java web application). You'd have difficulties trying to do that with Spark or Flink without introducing a dependency on some external system(s).
The only benefits of Flume, NiFi, Storm, Spark, etc. I find is that they compliment Kafka and they have Hadoop compatible integrations along with other systems used in the BigData space like Cassandra (see SMACK stack)
So, to answer the question, you need to use other tools to allow streaming data to be processed and stored by Hadoop.
I have a streaming use case to develop an Spring boot application where it should read data from kafka topic and put into hdfs path, I got two distinct cluster for kafka and hadoop.
Application worked fine without having kerberos authentication in kafka cluster and hadoop being kerberized.
Issues started when both cluster being kerberized, At the same time i could only authenticate into only one cluster.
I did few analysis/googling , i could not find much of help,
My theory is we could not login/authenticate into two kerberized cluster at same jvm instance because we need to set REALM and KDC details in code which are not client specific but jvm specific,
It might happen that i did not used proper APIs, I am very new to Spring boot.
I know we can do this by setting cross realm trust between clusters but i am looking for application level solutions if possible.
I got few questions
is it possible to login/authenticate two separate kerberized cluster at same jvm instance, if possible? please help me, use of Spring boot is preferred.
What would be the best solution to stream data from kafka cluster to hadoop cluster.
What would be the best solution to stream data from kafka cluster to hadoop cluster.
Kafka's Connect API is for streaming integration of sources and targets with Kafka, using just configuration files - no coding! The HDFS connector is what you want, and supports Kerberos authentication. It is open source and available standalone or as part of Confluent Platform.