I have a kafka streams application which reads from one topic with 100 partitions. Does some stateless transformations on individual messages and produces it to another topic. There are 2 instances of the application running on 2 machines with 4 cores and 8Gb ram, num.stream.threads is configured to 50. The throughput for output topic is very low. What are the tunining parameters to look at for increasing the processing speed of the Kafka Stream Application.
Related
We have 3 node of kafka cluster with around 32 topic and 400+ partition
spread across these servers. We have the load evenly distributed amongst
this partition however we are observing that 2 broker server are running
around >60% CPU where as the third one is running just abour 10%. How do we
ensure that all server are running smoothly? Do i need to reassing the
partition (kafka-reassign-parition cmd).
PS: The partition are evenly distributed across all the broker servers.
In some cases, this is a result of the way that individual consumer groups determine which partition to use within the __consumer_offsets topic.
On a high level, each consumer group updates only one partition within this topic. This often results in a __consumer_offsets topic with a highly uneven distribution of message rates.
It may be the case that:
You have a couple very heavy consumer groups, meaning they need to update the __consumer_offsets topic frequently. One of these groups uses a partition that has the 2nd broker as its leader. The other uses a partition that has the 3rd broker as its leader.
This would result in a significant amount of the CPU being utilized for updating this topic, and would only occur on the 2nd and 3rd brokers (as seen in your screenshot).
A detailed blog post is found here
I am interested in using Kafka to stream data (100 K records per second) that has to be consumed by multiple consumers(nosql, lucene) and wanted to know if Kafka is good resource for my requirements or any alternative that are useful as well. The consumers consume data are:
Consumer 1 - consumes data as soon as it comes to topic.
Consumer 2 - consumes data in batches from topic
Yes, Kafka is a perfect fit for your requirement. Read about Kafka Streams here
If you want to read the data in batches, use the Kafka Consumer
I am running a Spark-Kafka Streaming job with 4 executors(1 core each). And the kafka source topic had 50 partitions.
In the foreachpartition of the streaming java program, i am connecting to oracle and doing some work. Apache DBCP2 is being used for connection pool.
Spark-streaming program is making 4 connections to database- may be 1 for each executor. But, My Expectation is - since there are 50 partitions, there should be 50 threads running and 50 database connections exist.
How do i increase the parallelism without increasing the number of cores.
Your expectations are wrong. One core is one available thread in Spark nomenclature and one partition that can be processed at the time.
4 "cores" -> 4 threads -> 4 partitions processed concurently.
In spark executor, each core processes partitions one by one(one at a time). As you have 4 executors and each has only 1 core, that means you can only process 4 partitions concurrently at a time. So, if your Kafka has 50 partitions, your spark cluster need to run 13 rounds(4 partitions each round, 50 / 4 = 12.5) to finish a batch job. That is also why you can only see 4 connections to database.
I am facing a problem in using spark streaming with Apache Kafka where Spark is deployed on yarn. I am using Direct Approach (No Receivers) to read data from Kafka with 1 topic and 48 partitions. With this setup on a 5 node (4 worker) spark cluster (24 GB memory available on each machine) and spark configurations (spark.executor.memory=2gb, spark.executor.cores=1), there should be 48 executors on Spark cluster (12 executor on each machine).
Spark Streaming documentation also confirms that there is a one-to-one mapping between Kafka and RDD partitions. So for 48 kafka partitions, there should be 48 RDD partitions and each partition is being executed by 1 executor.But while running this, only 12 executors are created and spark cluster capacity remains unused & we are not able to get the desired throughput.
It seems that this Direct Approach to read data from Kafka in Spark Streaming is not behaving according to Spark Streaming documentation. Can anyone suggest, what wrong I am doing here as I am not able to scale horizontally to increase the throughput.
We have the Storm configured in a single node development server with most of the configurations set to default (not local mode).
Having storm nimbus, supervisor and workers running in that single node only and UI also configured.
AFAIK parallelism and configuration differs from topology to topology.
I think finding the right parallelism and configuration is by trial and error method only.
So, to find the best parallelism we have started testing our Storm topology with various configurations in a single node.
Strangely the results are unexpected:
Our topology processes stream of xml files from HDFS directory.
Having a single spout (Parallelism always 1) and four bolts.
Single worker
Whatever the topology parallelism we get the almost same performance results (the rate of data processed)
Multiple workers
Whatever the topology parallelism we get the similar performance as of single worker until sometime (most of the cases it is 10 minutes).
But after that complete topology gets restarted without any error traces.
We had observed that Whatever data processed in 20 minutes with single worker took 90 minutes with 5 workers having the same parallelism.
Also Topology had restarted 7 times with 5 workers.
And CPU usage is relatively high.
(Someone else also had faced this topology restart issue http://search-hadoop.com/m/LrAq5ZWeaU but no answer)
After testing many configurations we found that single worker with less no of parallelism (each bolt with 2 or 3 instances) works better than high parallelism or more no of workers.
Ideally the performance of Storm topology should be better with more no workers/ parallelism.
Apparently this rule is not holding good here.
why can't we set more than a single worker in a single node?
What are the maximum no of workers can be run in a single node?
What are the Storm configurations changes that are need to scale the performance? (I have tried nimbus.childopts and worker.childopts)
If your CPU usage is high on the one node then you're not going to get any better performance as you increase parallelism. If you do increase parallelism, there will just be greater contention for a constant number of CPU cycles. Not knowing any more about your specific topology, I can only suggest that you look for ways to reduce the CPU usage across your bolts and spouts. Only then can you would it make sense to add more bolt and spout instances.