spark.kafka.poll.time in MapR stream consumers - spark-streaming

I am working with MapR streams and setting the parameter "spark.kafka.poll.time" in my direct kafka API consumer; However, I don't know exactly what is the meaning of this parameter?

According to the MapR documention is the query interval time for a consumer on the MapR Streams (http://maprdocs.mapr.com/home/Spark/Spark_IntegrateMapRStreams_Consume.html). Mostly you have to specify it only when using Spark Streaming to connect to Kafka. In a standard Java Kafka Consumer, on the poll method, there is a interval in millis that you have to specify it, so there could be an analogy between the two of them.
For Java:
ConsumerRecords<String, String> records = kafkaConsumer.poll(consumerPoolTime);
For Spark Streaming as Map params:
"spark.kafka.poll.time" -> "300",
// other params
KafkaUtils.createDirectStream[String, String](ssc, kafkaParams, topics)

Related

How to increase number of threads processing records in topology?

I have a Topology :
Topology builder = new Topology();
builder.addSource("source",stringDeserializer,stringDeserializer, "TOPIC-DEV-ACH")
.addProcessor("process1", ProcessorOne::new , "source")
.addProcessor("process2", ProcessorTwo::new , "source")
.addProcessor("process3", ProcessorThree::new , "source")
.addSink("sink", "asink" ,stringSerializer, stringSerializer, "process1","process2","process3");
If I log:
Thread.currentThread().getName() in process(K var1, V var2)
result :
processor1 97527H7-e45cfcd3-6fb7-4fa9-b6a1-b3f5ed122304-StreamThread-1
processor2 97527H7-e45cfcd3-6fb7-4fa9-b6a1-b3f5ed122304-StreamThread-1
processor3 97527H7-e45cfcd3-6fb7-4fa9-b6a1-b3f5ed122304-StreamThread-1
I want a multiThreading to execute each Processor in a thread and then merge all results, is it possible with KafkaStreams library?
A KafkaStreams instance uses StreamThreads for stream processing.
The number of StreamThreads is controlled by StreamsConfig.NUM_STREAM_THREADS_CONFIG (num.stream.threads) configuration property that defaults to 1 and hence what you see.
Please note that although your Kafka Streams application can use multiple threads, a single topology (with all processors) is executed by a single thread.
A single thread executing a whole topology is simply a Kafka consumer of the source topics and with that it should be obvious that the number of thread (processing a single topology) is exactly the number of partitions of the topics (modulo number of KafkaStreams instances of the Kafka Streams application).

Strange 'emitted' numbers behavior / zero stat numbers in Topology stats (Storm 1.0.3)

This is what my storm UI stat looks like.
The problem is that I have no idea where those numbers (of emitted tuples are coming from).
My topology is pretty simple: kafka spout -> bolt (persisting data into hbase)
topology works - when I put data into kafka topic, I get them processed by bolt and persisted in hbase, which I then verify with scan operator in hbase shell (so new records are being inserted)
however each time I submit new message into kafka and when it’s persisted by bolt - my topology doesn’t increase number of emitted by ‘1’.
periodically I get all numbers increased by 20 - without sending any new messages into kafka. I.e. my kafka topic gets no messages for hours, but the number of tuples emitted always get increased in chunks of 20 over time. I still get the same number of records in hbase.
I get no exceptions/errors anywhere in apache storm logs.
I’m not doing ack() or fail() any of my tuples in my bolt implementation (which is BasicBolt type doing ack automatically)
my capacity or latency in bolt metrics is always staying zero even when I load a lot of messages in Kafka
my kafka offset log ($KAFKA/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker) shows all the messages are processed and Kafka Lag for given topic/group is 0.
So my question:
what are those ‘stealth’ tuples that increase ‘emitted’ in both Spout and Bolt over time by 20s?
is it possible to enable ‘debugging’ in storm UI to see what those tuples are?
why capacity/latency in bolt metrics is always zero while bolt is confirmed to persist data?
Environment details
I’m using Java 8 + Apache Storm 1.0.3
[devops#storm-wk1-prod]~/storm/supervisor/stormdist% storm version
Running: /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/opt/apache-storm-1.0.3 -Dstorm.log.dir=/opt/apache-storm-1.0.3/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /opt/apache-storm-1.0.3/lib/storm-core-1.0.3.jar:/opt/apache-storm-1.0.3/lib/kryo-3.0.3.jar:/opt/apache-storm-1.0.3/lib/reflectasm-1.10.1.jar:/opt/apache-storm-1.0.3/lib/asm-5.0.3.jar:/opt/apache-storm-1.0.3/lib/minlog-1.3.0.jar:/opt/apache-storm-1.0.3/lib/objenesis-2.1.jar:/opt/apache-storm-1.0.3/lib/clojure-1.7.0.jar:/opt/apache-storm-1.0.3/lib/disruptor-3.3.2.jar:/opt/apache-storm-1.0.3/lib/log4j-api-2.1.jar:/opt/apache-storm-1.0.3/lib/log4j-core-2.1.jar:/opt/apache-storm-1.0.3/lib/log4j-slf4j-impl-2.1.jar:/opt/apache-storm-1.0.3/lib/slf4j-api-1.7.7.jar:/opt/apache-storm-1.0.3/lib/log4j-over-slf4j-1.6.6.jar:/opt/apache-storm-1.0.3/lib/servlet-api-2.5.jar:/opt/apache-storm-1.0.3/lib/storm-rename-hack-1.0.3.jar:/opt/storm/conf org.apache.storm.utils.VersionInfo
Storm 1.0.3
URL https://git-wip-us.apache.org/repos/asf/storm.git -r eac433b0beb3798c4723deb39b3c4fad446378f4
Branch (no branch)
Compiled by ptgoetz on 2017-02-07T20:22Z
From source with checksum c78e52de4b8a22d99551d45dfe9c1a4b
My storm.yaml:
I'm running 2 instances with storm supervisor, each having the following config:
storm.zookeeper.servers:
- "10.138.0.8"
- "10.138.0.9"
- "10.138.0.16"
storm.zookeeper.port: 2181
nimbus.seeds: ["10.138.0.10"]
storm.local.dir: "/var/log/storm"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
worker.childopts: "-Xmx768m"
nimbus.childopts: "-Xmx512m"
supervisor.childopts: "-Xmx256m"
toplogy.yaml
nimbus.host: "10.138.0.10"
# In Storm 0.7.x, this is necessary in order to give workers time to
# initialize. In Storm 0.8.0 and later, it may not be necessary because Storm
# has added a separate, longer timeout for the initial launch of a worker.
supervisor.worker.timeout.secs: 60
topology.workers: 1
topology
import tbolts
import tspouts
def create(builder):
"""Create toplogy through Petrel library
"""
# spout getting data from kafka instance
# we run 2 tasks of kafka spout
builder.setSpout("kafka",
tspouts.KafkaSpout(), 2)
# persistence bolt
# we run 4 tasks of persistence bolt
builder.setBolt("persistence",
tbolts.PersistenceBolt(), 4).shuffleGrouping("kafka")
The reason your emit count jumps up by 20 is due to the fact that Storm only samples every 20th tuple buy default to update its metrics. This sampling rate is controlled by the topology.stats.sample.rate config variable and can be changed per topology. So you could set this to be 1.0 (it is 0.05 by default) and you would get an accurate emit count, however this would introduce a significant processing overhead and may cause your Acker and/or metrics consumer instances to become overloaded. Use with caution.

Spark streaming jobs duration in program

How do I get in my program (which is running the spark streaming job) the time taken for each rdd job.
for example
val streamrdd = KafkaUtils.createDirectStream[String, String, StringDecoder,StringDecoder](ssc, kafkaParams, topicsSet)
val processrdd = streamrdd.map(some operations...).savetoxyz
In the above code for each microbatch rdd the job is run for map and saveto operation.
I want to get the timetake for each streaming job. I can see the job in port 4040 UI, but want to get in the spark code itself.
Pardon if my question is not clear.
You can use the StreamingListener in you spark app. This interface provides a method onBatchComplete that can give you total time taken by the batch jobs.
context.addStreamingListener(new StatusListenerImpl());
StatusListenerImpl is the implementation class that you have to implement using StreamingListener.
There are more other methods also available in listener you should explore them as well.

Using flume to read IBM MQ data

I want to read data from IBM MQ and put it into HDFs.
Looked into JMS source of flume, seems it can connect to IBM MQ, but I’m not understanding what does “destinationType” and “destinationName” mean in the list of required properties. Can someone please explain?
Also, how I should be configuring my flume agents
flumeAgent1(runs on the machine same as MQ) reads MQ data ---- flumeAgent2(Runs on Hadoop cluster) writes into Hdfs
OR only one agent is enough on Hadoop cluster
Can someone help me in understanding how MQs can be integrated with flume
Reference
https://flume.apache.org/FlumeUserGuide.html
Thanks,
Chhaya
Regarding the Flume agent architecture, it is composed in its minimalist form by a source in charge of receiving or polling for events, and converting the events into Flume events that are put in a channel. Then, a sink takes those events in order to persist the data somewhere, or send the data to another agent. All these components (source, channel, sink, i.e. an agent) run in the same machine. Different agents may be distributed, instead.
Being said that, your scenario seems to require a single agent based on a JMS source, a channel, typically Memory Channel, and a HDFS sink.
The JMS source, as stated in the documentation, has only been tested for ActiveMQ, but shoukd work for any other queue systemm. The documentation also provides an example:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = jms
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory
a1.sources.r1.connectionFactory = GenericConnectionFactory
a1.sources.r1.providerURL = tcp://mqserver:61616
a1.sources.r1.destinationName = BUSINESS_DATA
a1.sources.r1.destinationType = QUEUE
a1 is the name of the single agent. c1 is the name for the channel and its configuration must be still completed; and a sink configuration is totally missing. It can be easily completed by adding:
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = ...
a1.sinks.k1...
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1...
r1 is the JMS source, and as can be seen, destinationName simply ask for a string name. destinationType can only take two values: queue or topic. I think the important parameters are providerURL and initialContextFactory and connectionFactory, which must be adapted for IBM MQ.

Using Kafka to import data to Hadoop

Firstly I was thinking what to use to get events into Hadoop, where they will be stored and periodically analysis would be performed on them (possibly using Ooozie to schedule periodic analysis) Kafka or Flume, and decided that Kafka is probably a better solution, since we also have a component that does event processing, so in this way, both batch and event processing components get data in the same way.
But know I'm looking for suggestions concretely how to get data out of broker to Hadoop.
I found here that Flume can be used in combination with Kafka
Flume - Contains Kafka Source (consumer) and Sink (producer)
And also found on the same page and in Kafka documentation that there is something called Camus
Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great.
I'm interested in what would be a better (and easier, better documented solution) to do that? Also, are there any examples or tutorials how to do it?
When should I use this variants over simpler, High level consumer?
I'm opened for suggestions if there is another/better solution than this two.
Thanks
You can use flume to dump data from Kafka to HDFS. Flume has kafka source and sink. Its a matter of property file change. An example is given below.
Steps:
Create a kafka topic
kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 -- partitions 1 --topic testkafka
Write to the above created topic using kafka console producer
kafka-console-producer --broker-list localhost:9092 --topic testkafka
Configure a flume agent with the following properties
flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.zookeeperConnect = localhost:2181
flume1.sources.kafka-source-1.topic =testkafka
flume1.sources.kafka-source-1.batchSize = 100
flume1.sources.kafka-source-1.channels = hdfs-channel-1
flume1.channels.hdfs-channel-1.type = memory
flume1.sinks.hdfs-sink-1.channel = hdfs-channel-1
flume1.sinks.hdfs-sink-1.type = hdfs
flume1.sinks.hdfs-sink-1.hdfs.writeFormat = Text
flume1.sinks.hdfs-sink-1.hdfs.fileType = DataStream
flume1.sinks.hdfs-sink-1.hdfs.filePrefix = test-events
flume1.sinks.hdfs-sink-1.hdfs.useLocalTimeStamp = true
flume1.sinks.hdfs-sink-1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d
flume1.sinks.hdfs-sink-1.hdfs.rollCount=100
flume1.sinks.hdfs-sink-1.hdfs.rollSize=0
flume1.channels.hdfs-channel-1.capacity = 10000
flume1.channels.hdfs-channel-1.transactionCapacity = 1000
Save the above config file as example.conf
Run the flume agent
flume-ng agent -n flume1 -c conf -f example.conf - Dflume.root.logger=INFO,console
Data will be now dumped to HDFS location under the following path
/tmp/kafka/%{topic}/%y-%m-%d
Most of the time, I see people using Camus with azkaban
You can you at the github repo of Mate1 for their implementation of Camus. It's not a tutorial but I think it could help you
https://github.com/mate1/camus

Resources