Spring-xd batch job to ingest data from kafka to hdfs - spring-xd

How to ingest data from kafka to hdfs using spring-xd batch job? I would like to have a batch job which is scheduled to run once in a day. How can I track offsets in kafka?

I assume the stream setup kafka | hdfs doesn't help you as you want to run this as a batch job so that you can orchestrate as a batch job.
In this case, the out of the box XD batch job module that can run kafka -> hdfs isn't available yet. You can implement a custom batch job module.
In order to read the kafka messges, you would need a ItemReader implementation that reads Kafka messages from Kafka Broker. See similar approach in AMQPItemReader:
https://github.com/spring-projects/spring-batch/blob/master/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/amqp/AmqpItemReader.java
Looking at spring-integration-kafka would help here for Kafka specific implementation: https://github.com/spring-projects/spring-integration-kafka
To write the data into HDFS, XD already has org.springframework.xd.batch.item.hadoop.HdfsTextItemWriter.
Any of the existing XD batch job modules that write to HDFS would help you how to implement this. Feel free to open JIRA and your contributions are welcome.

Related

How to use two Kerberos keytabs (for Kafka and Hadoop HDFS) from a Flink job on a Flink standalone cluster?

Question
On a Flink standalone cluster, running on a server, I am developing a Flink streaming job in Scala. The job consumes data from more than 1 Kafka topics, (do some formatting,) and write results to HDFS.
One of the Kafka topic, and HDFS, they both require separate Kerberos authentications (because they belong to completely different clusters).
My questions are:
Is it possible (if yes, how?) to use two Kerberos keytabs (one for Kafka, the other for HDFS) from a Flink job on a Flink cluster, running on a server? (so the Flink job can consume from Kafka topic and write to HDFS at the same time)
If not possible, what is a reasonable workaround, for the Kafka-Flink-HDFS data streaming when Kafka and HDFS are both Kerberos protected?
Note
I am quite new to the most of the technologies mentioned here.
The Flink job can write to HDFS if it doesn't need to consume the Kerberos-requiring topic. In this case, I specified the information of HDFS tosecurity.kerberos.login.keytab and security.kerberos.login.principal in flink-conf.yaml
I am using HDFS Connector provided from Flink to write to HDFS.
Manually switching the Kerberos authentication between the two principals was possible. In [realm] section in krb5.conf file, I specified two realms, one for Kafka, the other for HDFS.
kinit -kt path/to/hdfs.keytab [principal: xxx#XXX.XXX...]
kinit -kt path/to/kafka.keytab [principal: yyy#YYY.YYY...]
Environment
Flink (v1.4.2) https://ci.apache.org/projects/flink/flink-docs-stable/
Kafka client (v0.10.X)
HDFS (Hadoop cluster HDP 2.6.X)
Thanks for your attentions and feedbacks!
Based on the answer and comment on this very similar question
It appears there is no clear way to use two credentials in a single Flink job.
Promising approaches or workarounds:
Creating a trust
Co-installing Kafka and HDFS on the same platform
Using something else to bridge the gap
An example of the last point:
You can use something like NiFi or Streams Replication Manager to bring the data from the source Kafka, to the Kafka in your cluster. NiFi is more modular, and it is possible to configure the kerberos credentials for each step. Afterward you are in a single context which Flink can handle.
Full disclosure: I am an employee of Cloudera, a driving force behind NiFi, Kafka, HDFS, Streams Replication Manager and since recently Flink
After three years from my initial post, our architecture has moved from standalone bare metal server to Docker container on Mesos, but let me summarize the workaround (for Flink 1.8):
Place krb5.conf with all realm definitions and domain-realm mappings (for example under /etc/ of the container)
Place Hadoop krb5.keytab (for example under /kerberos/HADOOP_CLUSTER.ORG.EXAMPLE.COM/)
Configure Flink's security.kerberos.login.* properties in flink-conf.yaml
security.kerberos.login.use-ticket-cache: true
security.kerberos.login.principal: username#HADOOP_CLUSTER.ORG.EXAMPLE.COM
security.kerberos.login.contexts should not be configured. This ensures that Flink does not use Hadoop’s credentials for Kafka and Zookeeper.
Copy keytabs for Kafka into separate directories inside the container (for example under /kerberos/KAFKA_CLUSTER.ORG.EXAMPLE.COM/)
Periodically run custom script to renew ticket cache
KINIT_COMMAND_1='kinit -kt /kerberos/HADOOP_CLUSTER.ORG.EXAMPLE.COM/krb5.keytab username#HADOOP_CLUSTER.ORG.EXAMPLE.COM'
KINIT_COMMAND_2='kinit -kt /kerberos/KAFKA_CLUSTER.ORG.EXAMPLE.COM/krb5.keytab username#KAFKA_CLUSTER.ORG.EXAMPLE.COM -c /tmp/krb5cc_kafka'
...
Set the property sasl.jaas.config when instantiating each FlinkKafkaConsumer to the actual JAAS configuration string.
To bypass the global JAAS configuration. If we set this globally, we can’t use different Kafka instances with different credentials, or unsecured Kafka together with secured Kafka.
props.setProperty("sasl.jaas.config",
"com.sun.security.auth.module.Krb5LoginModule required " +
"refreshKrb5Config=true " +
"useKeyTab=true " +
"storeKey=true " +
"debug=true " +
"keyTab=\"/kerberos/KAFKA_CLUSTER.ORG.EXAMPLE.COM/krb5.keytab\" " +
"principal=\"username#KAFKA_CLUSTER.ORG.EXAMPLE.COM\";")

Can Hadoop do streaming?

Someone suggested that Hadoop does streaming, and have quoted Flume and Kafka as examples.
While I understand they might have streaming features, I wonder if they can be considered in the same league as stream processing technologies like Storm/Spark/Flink. Kafka is a 'publish-subscribe model messaging system' and Flume is a data ingestion tool. And even though they interact/integrae with hadoop are they technically part of 'hadoop' itself?
PS: I understand there is a Hadoop Streaming which is an entirely different thing.
Hadoop is only YARN, HDFS, and MapReduce. As a project, it does not accommodate (near) real time ingestion or processing.
Hadoop Streaming is a tool used to manipulate data between filesystem streams (standard input/output)
Kafka is not only a publish/subscribe message queue.
Kafka Connect is essentially a Kafka channel, in Flume terms. Various plug-ins exist for reading from different "sources", producing to Kafka, then "sinks" exist to consume from Kafka to databases or filesystems. From a consumer perspective, this is more scalable than singular Flume agents deployed across your infrastructure. If all you're looking for log ingestion into Kafka, personally I find Filebeat or Fluentd to be better than Flume (no Java dependencies).
Kafka Streams is a comparable product to Storm, Flink, and Samza, except the dependency upon YARN or any cluster scheduler doesn't exist, and it's possible to embed a Kafka Streams processor within any JVM compatible application (for example, a Java web application). You'd have difficulties trying to do that with Spark or Flink without introducing a dependency on some external system(s).
The only benefits of Flume, NiFi, Storm, Spark, etc. I find is that they compliment Kafka and they have Hadoop compatible integrations along with other systems used in the BigData space like Cassandra (see SMACK stack)
So, to answer the question, you need to use other tools to allow streaming data to be processed and stored by Hadoop.

Put data from Hive tables to kafka topic via nifi

I have few tables in Hive and my goal is to create a view over them and then publish it over a topic in Kafka through Apache NiFi.
What are the options to get it done?
I am planning to do it through Nifi .
I'm sure Nifi would work,
see PutHiveStreaming processor, but sounds like a lot of effort.
Kafka Connect HDFS is able to consume Kafka data and automatically register a Hive table for you.
And if I misunderstood that, and you're trying to query Hive and publish it into a Kafka topic, then sure - Nifi is perfectly capable of that
Use SelectHiveQL and PublishKafka, however Kafka Connect JDBC Source should be able to query Hive and write to Kafka as well

Kafka topic merging with Kafka Connect to HDFS

Is it possible to configure Kafka Connect’s HDFS connector to write/combine several separate topics into one file?
The topics will contain messages with the same avro schema and I want KafkaConnect to act as an intermediary between those Kafka topics and HDFS. Worst case scenario the topic contents could be combined after being written to HDFS, but I feel like a cleaner and quicker way should be possible with the HDFS connector.
Right now the HDFS connector will write each topic to its own directory. You can combine directories in HDFS after writing, or combine topics in Kafka before writing to HDFS, but the connector itself will not do it.

how can i bring data from static websites to HDFS?

What are the other available framework like spring XD, Flume for that? Which one is the best among; please advise steps to bring data.
Using NUTCH
Using Kafka flume
Using spring xd
scraper import.io
java program by producer consumer

Resources