Currently we are using spark with 1 minute batch interval to process the data. The data flow is like HTTP endpoint -> Spring XD -> kafka -> Spark Streaming -> HBASE. Format is JSON. We are running the spark jobs in a environment which has 6 nodemanagers each with 16 CPU cores and 110 GB of RAM. For caching metedata scala's triemap is used,so the cache will be per executor. Results on Spark with below settings:
Kafka partitions - 45
Spark executors - 3
Cores per executor -15
Number of JSON records 455 000 received over a time of 10 minutes.
Spark processed the records in 12 minutes. And each executor is able to process about 350-400 records per sec. Json parsing, validations, other stuffs are done before loading into HBASE.
With almost the same code with modifications for flink I ran the code with flink streaming deployed in YARN cluster. Results on Flink with below settings:
Kafka partitions - 45
Number of Task Managers for running the job - 3
Slots per TM - 15
Parallelism - 45
Number of JSON records 455 000 received over a time of 10 minutes.
But flink takes almost 50 minutes to process the records. With each TM process 30-40 records per second.
What am i missing here? Are there any other parameters/Configurations apart from the mentioned one impacts performance? The Job flow is DataStream -> Map -> custom functions. How could I improve the performance of flink here?
Related
I am using Kinesis data stream as a source and elasticsearch as a sink.
I am using Flink job to process this data a little bit then sink this data to elasticsearch.
In the production environment, the Kinesis data stream can generate 50,000 events per second.
it's taking a lot of time to process data to process 500,000 events it takes nearly around 50 minutes of time.
Elasticsearch version 7.7 running on SSD-based storage.
Elasticsearch nodes: 2
Shards: 5
Replicas: 1 per shard
Refresh interval: 1 sec (default)
We are using AWS opensearch elasticsearch.
Can someone please suggest what causes this delay?
I have a cluster with 1 Master (m4.large), 6 Core (m4.large), and 4 Task (m4.large) nodes. The 15GB of cloudfront log data splits into 35 mappers and 64 reducers. Currently, it's taking more than 30 minutes to process fully -- too long for my purposes, so I stop the job to reconfigure.
How long would I expect the processing to take with this setup? What would be a reasonable resizing to get the job to run in under 15 minutes?
I am running a Spark-Kafka Streaming job with 4 executors(1 core each). And the kafka source topic had 50 partitions.
In the foreachpartition of the streaming java program, i am connecting to oracle and doing some work. Apache DBCP2 is being used for connection pool.
Spark-streaming program is making 4 connections to database- may be 1 for each executor. But, My Expectation is - since there are 50 partitions, there should be 50 threads running and 50 database connections exist.
How do i increase the parallelism without increasing the number of cores.
Your expectations are wrong. One core is one available thread in Spark nomenclature and one partition that can be processed at the time.
4 "cores" -> 4 threads -> 4 partitions processed concurently.
In spark executor, each core processes partitions one by one(one at a time). As you have 4 executors and each has only 1 core, that means you can only process 4 partitions concurrently at a time. So, if your Kafka has 50 partitions, your spark cluster need to run 13 rounds(4 partitions each round, 50 / 4 = 12.5) to finish a batch job. That is also why you can only see 4 connections to database.
I am facing a problem in using spark streaming with Apache Kafka where Spark is deployed on yarn. I am using Direct Approach (No Receivers) to read data from Kafka with 1 topic and 48 partitions. With this setup on a 5 node (4 worker) spark cluster (24 GB memory available on each machine) and spark configurations (spark.executor.memory=2gb, spark.executor.cores=1), there should be 48 executors on Spark cluster (12 executor on each machine).
Spark Streaming documentation also confirms that there is a one-to-one mapping between Kafka and RDD partitions. So for 48 kafka partitions, there should be 48 RDD partitions and each partition is being executed by 1 executor.But while running this, only 12 executors are created and spark cluster capacity remains unused & we are not able to get the desired throughput.
It seems that this Direct Approach to read data from Kafka in Spark Streaming is not behaving according to Spark Streaming documentation. Can anyone suggest, what wrong I am doing here as I am not able to scale horizontally to increase the throughput.
We have very simple Spark Streaming job (implemented in Java), which is:
reading JSONs from Kafka via DirectStream (acks on Kafka messages are turned off)
parsing the JSONs into POJO (using GSON - our messages are only ~300 bytes)
map the POJO to tuple of key-value (value = object)
reduceByKey (custom reduce function - always comparing 1 field - quality - from the objects and leaves the object instance with higher quality)
store the result in the state (via mapWithState stores the object with highest quality per key)
store the result to HDFS
The JSONs are generated with set of 1000 IDs (keys) and all the events are randomly distributed to Kafka topic partitions. This also means, that resulting set of objects is max 1000, as the job is storing only the object with highest quality for each ID.
We were running the performance tests on AWS EMR (m4.xlarge = 4 cores, 16 GB memory) with following parameters:
number of executors = number of nodes (i.e. 1 executor per node)
number of Kafka partitions = number of nodes (i.e. in our case also executors)
batch size = 10 (s)
sliding window = 20 (s)
window size = 600 (s)
block size = 2000 (ms)
default Parallelism - tried different settings, however best results getting when the default parallelism is = number of nodes/executors
Kafka cluster contains just 1 broker, which is utilized to max ~30-40% during the peak load (we're pre-filling the data to topic and then independently executing the test). We have tried to increase the num.io.threads and num.network.threads, but without significant improvement.
The he results of performance tests (about 10 minutes of continuous load) were (YARN master and Driver nodes are on top of the node counts bellow):
2 nodes - able to process max. 150 000 events/s without any processing delay
5 nodes - 280 000 events/s => 25 % penalty if compared to expected "almost linear scalability"
10 nodes - 380 000 events/s => 50 % penalty if compared to expected "almost linear scalability"
The CPU utilization in case of 2 nodes was ~
We also played around other settings including:
- testing low/high number of partitions
- testing low/high/default value of defaultParallelism
- testing with higher number of executors (i.e. divide the resources to e.g. 30 executors instead of 10)
but the settings above were giving us the best results.
So - the question - is Kafka + Spark (almost) linearly scalable? If it should be scalable much better, than our tests shown - how it can be improved. Our goal is to support hundreds/thousands of Spark executors (i.e. scalability is crucial for us).
We have resolved this by:
increasing the capacity of Kafka cluster
more CPU power - increased the number of nodes for Kafka (1 Kafka node per 2 spark exectur nodes seemed to be fine)
more brokers - basically 1 broker per executor gave us the best results
setting proper default parallelism (number of cores in cluster * 2)
ensuring all the nodes will have approx. the same amount of work
batch size/blockSize should be ~equal or a multiple of number of executors
At the end, we've been able to achieve 1 100 000 events/s processed by the spark cluster with 10 executor nodes. The tuning made also increased the performance on configurations with less nodes -> we have achieved practically linear scalability when scaling from 2 to 10 spark executor nodes (m4.xlarge on AWS).
At the beginning, CPU on Kafka node wasn't approaching to limits, however it was not able to respond to the demands of Spark executors.
Thnx for all suggestions, particulary for #ArturBiesiadowski, who suggested the Kafka cluster incorrect sizing.