Using Kafka to import data to Hadoop - hadoop

Firstly I was thinking what to use to get events into Hadoop, where they will be stored and periodically analysis would be performed on them (possibly using Ooozie to schedule periodic analysis) Kafka or Flume, and decided that Kafka is probably a better solution, since we also have a component that does event processing, so in this way, both batch and event processing components get data in the same way.
But know I'm looking for suggestions concretely how to get data out of broker to Hadoop.
I found here that Flume can be used in combination with Kafka
Flume - Contains Kafka Source (consumer) and Sink (producer)
And also found on the same page and in Kafka documentation that there is something called Camus
Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great.
I'm interested in what would be a better (and easier, better documented solution) to do that? Also, are there any examples or tutorials how to do it?
When should I use this variants over simpler, High level consumer?
I'm opened for suggestions if there is another/better solution than this two.
Thanks

You can use flume to dump data from Kafka to HDFS. Flume has kafka source and sink. Its a matter of property file change. An example is given below.
Steps:
Create a kafka topic
kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 -- partitions 1 --topic testkafka
Write to the above created topic using kafka console producer
kafka-console-producer --broker-list localhost:9092 --topic testkafka
Configure a flume agent with the following properties
flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.zookeeperConnect = localhost:2181
flume1.sources.kafka-source-1.topic =testkafka
flume1.sources.kafka-source-1.batchSize = 100
flume1.sources.kafka-source-1.channels = hdfs-channel-1
flume1.channels.hdfs-channel-1.type = memory
flume1.sinks.hdfs-sink-1.channel = hdfs-channel-1
flume1.sinks.hdfs-sink-1.type = hdfs
flume1.sinks.hdfs-sink-1.hdfs.writeFormat = Text
flume1.sinks.hdfs-sink-1.hdfs.fileType = DataStream
flume1.sinks.hdfs-sink-1.hdfs.filePrefix = test-events
flume1.sinks.hdfs-sink-1.hdfs.useLocalTimeStamp = true
flume1.sinks.hdfs-sink-1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d
flume1.sinks.hdfs-sink-1.hdfs.rollCount=100
flume1.sinks.hdfs-sink-1.hdfs.rollSize=0
flume1.channels.hdfs-channel-1.capacity = 10000
flume1.channels.hdfs-channel-1.transactionCapacity = 1000
Save the above config file as example.conf
Run the flume agent
flume-ng agent -n flume1 -c conf -f example.conf - Dflume.root.logger=INFO,console
Data will be now dumped to HDFS location under the following path
/tmp/kafka/%{topic}/%y-%m-%d

Most of the time, I see people using Camus with azkaban
You can you at the github repo of Mate1 for their implementation of Camus. It's not a tutorial but I think it could help you
https://github.com/mate1/camus

Related

Restart kafka connect sink and source connectors to read from beginning

I have searched quite a lot on this but there doesn't seems to be a good guide around this.
From what I have searched there are a few things to consider:
Resetting Sink Connector internal topics (status, config and offset).
Source Connector offsets implementation is implementation specific.
Question: Is there even a need to reset these topics?
Deleting the consumer group.
Restarting the connector with a different name (this is also an option) but it doesn't seems to be the right thing to do.
Resetting consumer group to --reset-offsets to --to-earliest
Using the REST API (Does the it provides the functionality to reset and read from beginning)
What would be the best way to restart both a sink and a source connector to read from beginning?
Source Connector:
Standalone mode: remove offset file (/tmp/connect.offsets) or change connector name.
Distributed mode: change name of the connector.
Sink Connector (both modes) one of the following methods:
Change name.
Reset offset for the Consumer group. Name of the group is same as Connector name.
To reset offset you have to first delete connector, reset offset (./bin/kafka-consumer-groups.sh --bootstrap-server :9092 --group connectorName --reset-offsets --to-earliest --execute --topic topicName), add same configuration one more time
You can check following question: Reset the JDBC Kafka Connector to start pulling rows from the beginning of time?
Source connector Distributed mode - has another option which is producing a new message to the offset topic.
For example I use jdbc source connector:
When looking on the offset topic I see the following:
./kafka-console-consumer.sh --zookeeper localhost:2181/kafka11-staging --topic kc-staging--offsets --from-beginning --property print.key=true
["referrer-family-jdbc-source",{"query":"query"}] {"incrementing":100}
Now in order to reset this I just produce another message with incrementing:0
For example: how to produce from shell with key from here
./kafka-console-producer.sh \
--broker-list `hostname`:9092 \
--topic kc-staging--offsets \
--property "parse.key=true" \
--property "key.separator=|"
["referrer-family-jdbc-source",{"query":"query"}]|{"incrementing":0}
Please note that you need to do the following:
Delete the connector.
Produce a message with the relevant offset as I described above.
Create the connector again.
a bit late but found another way. Just set the offset.storage.file.name in standalone mode to dev/null:
#worker.properties
offset.storage.file.filename=/dev/null
#cmdline
connect-standalone /data/config/worker.properties /data/config/connector.properties

Write event with flume to S3 via HDFS Sink ensure transaction

We are using flume and S3 to store our events.
I recognized that events are only transferred to S3 whenever the HDFS sink rolls to the next file or flume is shutdown gracefully.
This can, in my mind, lead to potential data loss. The Flume Documentation writes:
...Flume uses a transactional approach to guarantee the reliable
delivery of the Events...
here my configuration:
agent.sinks.defaultSink.type = HDFSEventSink
agent.sinks.defaultSink.hdfs.fileType = DataStream
agent.sinks.defaultSink.channel = fileChannel
agent.sinks.defaultSink.serializer = avro_event
agent.sinks.defaultSink.serializer.compressionCodec = snappy
agent.sinks.defaultSink.hdfs.path = s3n://testS3Bucket/%Y/%m/%d
agent.sinks.defaultSink.hdfs.filePrefix = events
agent.sinks.defaultSink.hdfs.rollInterval = 3600
agent.sinks.defaultSink.hdfs.rollCount = 0
agent.sinks.defaultSink.hdfs.rollSize = 262144000
agent.sinks.defaultSink.hdfs.batchSize = 10000
agent.sinks.defaultSink.hdfs.useLocalTimeStamp = true
#### CHANNELS ####
agent.channels.fileChannel.type = file
agent.channels.fileChannel.capacity = 1000000
agent.channels.fileChannel.transactionCapacity = 10000
I assume that I just do something wrong, any Ideas?
After some investigation I found one of the main problems using S3 with flume and the HDFS Sink.
One of the main differences between plain HDFS and the S3 implementation is that S3 does not directly support rename. When a file is renamed in S3 the file will be copied and to the new name and the old file will be deleted. (see: How to rename files and folder in Amazon S3?)
Flume by default extend files with .tmp when the file is not full. After the rotation the file will be renamed to the final filename. In HDFS this will be no problem but in S3 this can cause problems according to this issue:
https://issues.apache.org/jira/browse/FLUME-2445
Because S3 with HDFS sink seams not 100% trustworthy I prefer the more safe way of saving all files local and sync/delete the finished files with the aws tool s3 sync (http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html)
In worse case the files are not synced or the local disk is full but both problems can be easily solved via a monitoring system that anyways should be used.

Using flume to read IBM MQ data

I want to read data from IBM MQ and put it into HDFs.
Looked into JMS source of flume, seems it can connect to IBM MQ, but I’m not understanding what does “destinationType” and “destinationName” mean in the list of required properties. Can someone please explain?
Also, how I should be configuring my flume agents
flumeAgent1(runs on the machine same as MQ) reads MQ data ---- flumeAgent2(Runs on Hadoop cluster) writes into Hdfs
OR only one agent is enough on Hadoop cluster
Can someone help me in understanding how MQs can be integrated with flume
Reference
https://flume.apache.org/FlumeUserGuide.html
Thanks,
Chhaya
Regarding the Flume agent architecture, it is composed in its minimalist form by a source in charge of receiving or polling for events, and converting the events into Flume events that are put in a channel. Then, a sink takes those events in order to persist the data somewhere, or send the data to another agent. All these components (source, channel, sink, i.e. an agent) run in the same machine. Different agents may be distributed, instead.
Being said that, your scenario seems to require a single agent based on a JMS source, a channel, typically Memory Channel, and a HDFS sink.
The JMS source, as stated in the documentation, has only been tested for ActiveMQ, but shoukd work for any other queue systemm. The documentation also provides an example:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = jms
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory
a1.sources.r1.connectionFactory = GenericConnectionFactory
a1.sources.r1.providerURL = tcp://mqserver:61616
a1.sources.r1.destinationName = BUSINESS_DATA
a1.sources.r1.destinationType = QUEUE
a1 is the name of the single agent. c1 is the name for the channel and its configuration must be still completed; and a sink configuration is totally missing. It can be easily completed by adding:
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = ...
a1.sinks.k1...
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1...
r1 is the JMS source, and as can be seen, destinationName simply ask for a string name. destinationType can only take two values: queue or topic. I think the important parameters are providerURL and initialContextFactory and connectionFactory, which must be adapted for IBM MQ.

In Storm Spout, Naming the Consumer Group

I am currently using:
https://github.com/wurstmeister/storm-kafka-0.8-plus/commits/master
which has been moved to:
https://github.com/apache/storm/tree/master/external/storm-kafka
I want to specify the Kafka Consumer Group Name. By looking at the storm-kafka code, I followed the setting, id, to find that is is never used when dealing with a consumer configuration, but is used in creating the zookeeper path at which offset information is stored. Here in this link is an example of why I would want to do this: https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/
Am I correct in saying that the Consumer Group Name cannot be set using the https://github.com/apache/storm/tree/master/external/storm-kafka code?
So far, storm-kafka integration is implemented using SimpleConsumer API of kafka and the format it stores consumer offset in zookeeper is implemented in their own way(JSON format).
If you write spout config like below,
SpoutConfig spoutConfig = new SpoutConfig(zkBrokerHosts,
"topic name",
"/kafka/consumers(just an example, path to store consumer offset)",
"yourTopic");
It will write consumer offset in subdirectories of /kafka/consumers/yourTopic.
Note that by default storm-kafka uses same zookeeper that your Storm uses.

Hadoop MapReduce log4j - log messages to a custom file in userlogs/job_ dir?

Its not clear to me as how one should configure Hadoop MapReduce log4j at a job level. Can someone help me answer these questions.
1) How to add support log4j logging from a client machine. i.e I want to use log4j property file at the client machine, and hence don't want to disturb the Hadoop log4j setup in the cluster. I would think having the property file in the project/jar should suffice, and hadoop's distributed cache should do the rest transferring the map-reduce jar.
2) How to log messages to a custom file in $HADOOP_HOME/logs/userlogs/job_/ dir.
3) Will map reduce task use both the log4j property file? the one supplied by the client job and the one present in the hadoop cluster? If yes, then the log4j.rootLogger would add both the property values?
Thanks
Srivatsan Nallazhagappan
You can configure log4j directly in your code. For example you can call PropertyConfigurator.configure(properties); e.g. in mapper/reducer setup method.
This is example with properties stored on hdfs:
InputStream is = fs.open(log4jPropertiesPath);
Properties properties = new Properties();
properties.load(is);
PropertyConfigurator.configure(properties);
where fs is FileSystem object and log4jPropertiesPath is path on hdfs.
With this you can also output logs to a dir with job_id. For example you can modify our properities before calling PropertyConfigurator.configure(properties);
Enumeration propertiesNames = properties.propertyNames();
while (propertiesNames.hasMoreElements()) {
String propertyKey = (String) propertiesNames.nextElement();
String propertyValue = properties.getProperty(propertyKey);
if (propertyValue.indexOf(JOB_ID_PATTERN) != -1) {
properties.setProperty(propertyKey, propertyValue.replace(JOB_ID_PATTERN, context.getJobID().toString()));
}
}
There is no straight forward way to override the log4j properties at each job level.
Map Reduce job itself doesn't store the logs in Hadoop,it writes logs in local file system(${hadoop.log.dir}/userlogs) of the datanodes. There is a separate process from Yarn called log-aggregation which collect those logs and combines.
Use yarn logs --applicationId <appId> to fetch the full log, then use unix command to parse and extract the part of the log you need.

Resources