We are facing performance issue while integrating Spark-Kafka streams.
Project setup:
We are using Kafka topics with 3 partitions and producing 3000 messages in each partition and processing it in Spark direct streaming.
Problem we are facing:
In the processing end we are having Spark direct stream approach to process the same. As per the below documentation. Spark should create parallel direct streams as many as the number of partitions in the topic (which is 3 in this case). But while reading we can see all the messages from partition 1 is getting processed first then second then third. Any help why it is not processing parallel? as per my understanding if it is reading parallel from all the partition at the same time then the message output should be random.
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#approach-2-direct-approach-no-receivers
Did you try setting the spark.streaming.concurrentJobs parameter.
May be in your case, it can be set to three.
sparkConf.set("spark.streaming.concurrentJobs", "3").
Thanks.
Related
The scenario is:
We have small messages coming from hundreds of thousands of IoT devices, sending information every second to the main Gateway of our system (infrastructure monitoring)
We are streaming these messages into Kafka (into 20 different topics) using the ID of the IoT sensors with a configurable partitioning formula for partitioning the data
We have some Kafka Connect processes, reading these messages from these 20 topics, writing into Hadoop HDFS aggregating these messages every minute, partitioning data into different HDFS staging directories (basically by groups of devices)
We would like to efficiently import all these data into Impala trying to optimize also Parquet file size for faster queries.
For now, we have two processes:
First process: Every 20 minutes run some code that further compact all files into a CURRENT_DAY repository and then load the data into Impala
Second process: Every day run some Impala SQL code to compact data in the CURRENT_DAY and then truncate the CURRENT_DAY for freeing space into the CURRENT_DAY before new data comes in
Issues:
The problem is that we can see data in Impala only after 20 minutes they are generated, after the First process loads the data into Impala
The second problem is that when the day approaches the end, the Impala queries become slower
I have found many related questions on StackOverflow, but I didn't find a general approach to this problem.
How to efficiently update Impala tables whose files are modified very frequently
how to efficiently move data from Kafka to an Impala table?
How can I achieve streaming data aggregation per batch using Spark Structured Streaming?
Question: Is there any general approach for this scenario that seems quite common? small data from a large number of devices and optimization of Impala queries.
Versions:
Hadoop = 3.1.4 (TBV)
Impala = 3.4.0
Kafka = 2.7.0 (Scala 2.13)
PlugIn KafkaConnect per Hadoop = kafka-connect-hdfs3:1.1.1
Now there is a problem that puzzles me. How should I count the execution times of bolt and spout in Storm? I have tried to use ConcurrentHashmap (considering multithreading), but it can't be done on multiple machines. Can you help me solve this problem?
Considering your question i think you are trying to keep a track of number of tuple got executed and not the amount of time bolt or spout takes to execute one tuple.
You can use metices with graphite for visualisation. It gives a time series data.
Database can also be used for the same purpose.
I am comparing the throughput of spark streaming and Kafka streams. My results state that Kafka Streams has a higher throughput than Spark streaming. Is this correct? Shouldn't it be the other way around?
Thanks
No one streaming platform is universally faster than all others for every use case. Don't get fooled by benchmarketing results that compare apples to oranges (like Kafka Streams reading from a disk based source vs Spark Streaming reading from an in-memory source). You haven't posted your test but it is entirely possible that it represents a use case (and test environment) in which Kafka Streams is indeed faster.
I'm working on a research project where I installed a complete data analysis pipeline on Google Cloud Platform. We estimate unique visitors per URL in real-time using HyperLogLog on Spark. I used Dataproc to set up the Spark Cluster. One goal of this work is to measure the throughput of the architecture depending on the cluster size. The Spark cluster has three nodes (minimal configuration)
A data stream is simulated with own data generators written in Java where I used the kafka producer API. The architecture looks as follows:
Data generators -> Kafka -> Spark Streaming -> Elasticsearch.
The problem is: As I increase the number of produced events per second on my data generators and it goes beyond ~ 1000 events/s the input rate in my Spark job suddenly collapses and begin to vary a lot.
As you can see on the screenshot from the Spark Web UI, the processing times and scheduling delays keep constant short, while the input rate goes down.
Screenshot from Spark Web UI
I tested it with a complete simple Spark job which only does a simple mapping, to exclude causes like slow Elasticsearch writes or problems with the job itself. Kafka also seems to receive and send all the events correctly.
Furthermore I experimented with the Spark configuration parameters:
spark.streaming.kafka.maxRatePerPartition and spark.streaming.receiver.maxRate
with the same result.
Does anybody have some ideas what goes wrong here? It really seems to be up to the Spark Job or Dataproc... but I'm not sure. All CPU and memory utilizations seem to be okay.
EDIT: Currently I have two kafka partitions on that topic (placed on one machine). But I think Kafka should even with only one partition do more than 1500 Events/s. The problem also was with one partition at the beginning of my experiments. I use direct approach with no receivers, so Spark reads with two worker nodes concurretly from the topic.
EDIT 2: I found out what causes this bad throughput. I forgot to mention one component in my architecture. I use one central Flume agent to log all the events from my simulator instances via log4j via netcat. This flume agent is the cause of the performance problem! I changed the log4j configuration to use asynchronuous loggers (https://logging.apache.org/log4j/2.x/manual/async.html) via disruptor. I scaled the Flume agent up to more CPU cores and RAM and changed the channel to a file channel. But it still has a bad performance. No effect... any other ideas how to tune Flume performance?
Hard to say given the sparse amount of information. I would suspect a memory issue - at some point, the servers may even start swapping. So, check the JVM memory utilizations and swapping activity on all servers. Elasticsearch should be capable of handling ~15.000 records/second with little tweaking. Check the free and committed RAM on the servers.
As I mentioned before CPU and RAM utilizations are totally fine. I found out a "magic limit", it seems to be exactly 1500 events per second. As I exceed this limit the input rate immediately begins to wobble.
The misterious thing is that processing times and scheduling delays stay constant. So one can exclude backpressure effects, right?
The only thing I can guess is a technical limit with GCP/Dataproc... I didn't find any hints on the Google documentation.
Some other ideas?
I want to fire multiple web requests in parallel and then aggregate the data in a storm topology? which of the following way is preferred
1) create multiple threads within a bolt
2) Create multiple bolts and create a merging bolt to aggregate the data.
I would like to create multiple threads within a bolt because merging data in another bolt is not a simple process. But i see there are some concerns around that I found on internet
https://mail-archives.apache.org/mod_mbox/storm-user/201311.mbox/%3CCAAYLz+pUZ44GNsNNJ9O5hjTr2rZLW=CKM=FGvcfwBnw613r1qQ#mail.gmail.com%3E
but didn't get clear reason why not to create multiple threads. Any pointers will help.
On a side note does that mean i should not use java8's capabilities of parallel streams as well as mentioned in https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html?
Increase number of tasks for the bolt, its like spawning multiple instances of the same. And also increase the number of executors (threads) to handle them evenly.
Make sure #executors <= #tasks. Storm will do the rest for you.