Kafka + Spark scalability - performance

We have very simple Spark Streaming job (implemented in Java), which is:
reading JSONs from Kafka via DirectStream (acks on Kafka messages are turned off)
parsing the JSONs into POJO (using GSON - our messages are only ~300 bytes)
map the POJO to tuple of key-value (value = object)
reduceByKey (custom reduce function - always comparing 1 field - quality - from the objects and leaves the object instance with higher quality)
store the result in the state (via mapWithState stores the object with highest quality per key)
store the result to HDFS
The JSONs are generated with set of 1000 IDs (keys) and all the events are randomly distributed to Kafka topic partitions. This also means, that resulting set of objects is max 1000, as the job is storing only the object with highest quality for each ID.
We were running the performance tests on AWS EMR (m4.xlarge = 4 cores, 16 GB memory) with following parameters:
number of executors = number of nodes (i.e. 1 executor per node)
number of Kafka partitions = number of nodes (i.e. in our case also executors)
batch size = 10 (s)
sliding window = 20 (s)
window size = 600 (s)
block size = 2000 (ms)
default Parallelism - tried different settings, however best results getting when the default parallelism is = number of nodes/executors
Kafka cluster contains just 1 broker, which is utilized to max ~30-40% during the peak load (we're pre-filling the data to topic and then independently executing the test). We have tried to increase the num.io.threads and num.network.threads, but without significant improvement.
The he results of performance tests (about 10 minutes of continuous load) were (YARN master and Driver nodes are on top of the node counts bellow):
2 nodes - able to process max. 150 000 events/s without any processing delay
5 nodes - 280 000 events/s => 25 % penalty if compared to expected "almost linear scalability"
10 nodes - 380 000 events/s => 50 % penalty if compared to expected "almost linear scalability"
The CPU utilization in case of 2 nodes was ~
We also played around other settings including:
- testing low/high number of partitions
- testing low/high/default value of defaultParallelism
- testing with higher number of executors (i.e. divide the resources to e.g. 30 executors instead of 10)
but the settings above were giving us the best results.
So - the question - is Kafka + Spark (almost) linearly scalable? If it should be scalable much better, than our tests shown - how it can be improved. Our goal is to support hundreds/thousands of Spark executors (i.e. scalability is crucial for us).

We have resolved this by:
increasing the capacity of Kafka cluster
more CPU power - increased the number of nodes for Kafka (1 Kafka node per 2 spark exectur nodes seemed to be fine)
more brokers - basically 1 broker per executor gave us the best results
setting proper default parallelism (number of cores in cluster * 2)
ensuring all the nodes will have approx. the same amount of work
batch size/blockSize should be ~equal or a multiple of number of executors
At the end, we've been able to achieve 1 100 000 events/s processed by the spark cluster with 10 executor nodes. The tuning made also increased the performance on configurations with less nodes -> we have achieved practically linear scalability when scaling from 2 to 10 spark executor nodes (m4.xlarge on AWS).
At the beginning, CPU on Kafka node wasn't approaching to limits, however it was not able to respond to the demands of Spark executors.
Thnx for all suggestions, particulary for #ArturBiesiadowski, who suggested the Kafka cluster incorrect sizing.

Related

Spark partition on nodes foreachpartition

I have a spark cluster (DataProc) with a master and 4 workers (2 preemtible), in my code I have some thing like this:
JavaRDD<Signal> rdd_data = javaSparkContext.parallelize(myArray);
rdd_data.foreachPartition(partitionOfRecords -> {
while (partitionOfRecords.hasNext()) {
MyData d = partitionOfRecords.next();
LOG.info("my data: " + d.getId().toString());
}
})
myArray is composed by 1200 MyData objects.
I don't understand why spark uses only 2 cores, divide my array into 2 partitions, and doesn't use 16 cores.
I need to set the number of partition?
Thanks in advance for any help.
Generally it's always a good idea to specific the number of partitions as the second argument to parallelize since the optimal slicing of your dataset should really be independent from the particular shape of the cluster you're using, and Spark can at best use current sizes of executors as a "hint".
What you're seeing here is that Spark will default to asking taskScheduler for current number of executor cores to use as the defaultParallelism, combined with the fact that in Dataproc Spark dynamic allocation is enabled. Dynamic allocation is important because otherwise a single job submitted to a cluster might just specify max executors even if it sits idle and then it will prevent other jobs from being able to use those idle resources.
So on Dataproc, if you're using default n1-standard-4, Dataproc configures 2 executors per machine and gives each executor 2 cores. The value of spark.dynamicAllocation.minExecutors should be 1, so your default job, upon startup without doing any work, would sit on 1 executor with 2 cores. Then taskScheduler will report that 2 cores are currently reserved in total, and therefore defaultParallelism will be 2.
If you had a large cluster and you were already running a job for awhile (say, you have a map phase that runs for longer than 60 seconds) you'd expect dynamic allocation to have taken all available resources, so the next step of the job that uses defaultParallelism would then presumably be 16, which is the total cores on your cluster (or possibly 14, if 2 are consumed by an appmaster).
In practice, you probably want to parallelize into a larger number of partitions than total cores available anyways. Then if there's any skew in how long each element takes to process, you can have nice balancing where fast tasks finish and then those executors can start taking on new partitions while the slow ones are still running, instead of always having to wait for a single slowest partition to finish. It's common to choose a number of partitions anywhere from 2x the number of available cores to something 100x or more.
Here's another related StackOverflow question: spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Why some worker nodes cost more CPU for system during running Spark application?

I have 1 master node and 4 worker nodes. I set up the cluster using Ambari and all monitoring metrics are collected from its dashboard. Spark on the top of Hadoop, so I have got YARN and HDFS. I run a very simple word count script and found that one of the worker nodes did the most job. The word count job is divided into 149 tasks. 98 tasks are done by one node.
Here is my code for counting words
val file = sc.textFile("/data/2gdata.txt") //read file from HDFS
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.collect*
This picture illustrates the events timeline and CPU usage for each worder nodes
Aggregated Metrics by Executor are shown here
Each task has same size of input file. I assume they would spend similar time such as around 30 seconds to count word in the piece of input file. Some tasks spent more than 10 minutes.
I realized those nodes doing less job cost more CPU for system operation as shown in blue area in the first graph. The worker did more tasks and cost more CPU for user (application).
I am wondering what kinds of system operations required for a Spark application. Why three of worker nodes cost more CPU for system? I also enabled spark.speculation, but those stragglers will be killed after 10 minutes and performance didn't get better. Moreover, those stragglers are node_local, so I assume this issue is not related to HDFS replication. (There are 3 replications under the rack.)
Thank you very much.
Even the input file size is same for each task, during the shuffle and reduce phase, some task might process more data than other tasks, data skewing may cause more CPU costs.
You can repartitioning the data in between may improve the performance.

Optimizing Bulk Indexing in elasticsearch

We have an elastic search cluster of 3 nodes of the following configurations
#Cpu Cores Memory(GB) Disk(GB) IO Performance
36 244.0 48000 very high
The machines are in 3 different zones namely eu-west-1c,eu-west-1a,eu-west-1b.
Each elastic search instance is being allocated 30GB of heap space.
we are using the above cluster for running aggregations only. The cluster has replication factor of 1 and all the string fields are not analyzed , doc_values is true for all the fields.
We are pumping data into this cluster running 6 instances of logstash in parallel ( having a batch size of 1000)
When more instances of logstash are started one by one the nodes of the ElasticSearch cluster starts throwing out of memory error.
What could be the possible optimizations to speed up bulk indexing rate on the cluster?= Will presence of nodes of cluster in the same zone increase bulk indexing? Will adding more nodes in the cluster help ?
Couple of steps taken so far
Increase the bulk queue size from 50 to 1000
Increase refresh interval from 1 seconds to 2 minutes
Changed segments merge throttling to none (
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing- performance.html)
We cannot set the replication factor to 0 due to inconsistency involved if one of the nodes goes down.

Apache Spark - more partitions, slower performance

I have been running an Apache Spark program on EC2 and my results are showing that by partitioning the data into the recommended 2-3 partitions per core is far slower than using the (seemingly) default partitions of 2.
Nanoseconds to complete without partitioning:
u1 - 650472112996
u2 - 654525970891
u3 - 498530412012
u4 - 568162162934
u5 - 659015135256
The number of total partitions was 2, I checked using sparkContext.partitions.size. *
Nanoseconds to complete with 2 partitions per core (total of 32 partitions):
u1 - 1562831187137
u2 - 1690008041723
u3 - 1652170780785
u4 - 3381857826333
u5 - 3969656930184
This particular sets of experiments was run across a 16 node cluster.
While without partitioning takes approximately 10 minutes every time, with partitioning took anywhere from 25-60 minutes! The exact same program and parameters were used. Checking the Spark UI showed that the datafile was too big for the default partitioned nodes and often data was spilled to disk, this never occurred when using 32 partitions. Additionally, only ever 2 tasks could be run on the default partitioned run (as clearly, there's only 2 partitions); meanwhile, the 32 partition cluster has 16 tasks running concurrently. I would've thought the partitioned run would've been quicker, not slower!
What could cause such a decrease in speed? Shuffling of data across the nodes perhaps? Also, why would one run, i.e. u1, take 25 minutes and another run, i.e. u5, take 1 hour on the 32 partitioned cluster, but when ran with default partitioning they took about the same time?
Apache Spark docs say that the number of partitions will be equal to the number of cores in the cluster; however, I have tested this across cluster sizes ranging from 2 - 20, using different EC2 instances, and every time it defaults to 2 partitions.

Hadoop Terasort unstable benchmark results

I have a Cloudera Hadoop cluster and I'm doing some benchmarks running Terasort but I'm getting very unstable results from 105 - 150 minutes. Some times I've seen it was replicating more than usual or doing a lot of garbage collections but some other times they were pretty much the same.
I don't know the reason of the unstable results, any hint or recommendation will be very welcome :)
I run the benchmarks as follows:
I've chosen the number of maps and reduces tasks following this guide http://wiki.apache.org/hadoop/HowManyMapsAndReduces
Speculative maps and reduce execution is off.
Generating dataset:
10,000,000,000 rows of 100 bytes ~= 953674 M
Block size = 128 MB
Number of maps tasks = 3725 (number-of-rows * row-size) / (block-size*2) I do times 2 because the maps tasks time was too low, like 7 seconds.
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar teragen -Ddfs.replication=3 -Dmapred.map.tasks=3725 10000000000 /terasort-in
Running terasort:
num-of-worker-nodes = 4
num-of-cores-per-node = 8
Reduce tasks = 56 ( 1.75 * num-of-worker-nodes * num-of-cores-per-node )
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar terasort -Ddfs.replication=1 -Dmapred.reduce.tasks=56 /terasort-in /terasort-out
The service and role distribution among nodes is as follows:
6 Nodes - 8 cores, 16 GB RAM and 2 HD each - running just HDFS and MapReduce:
1st node, just master roles:
Namenode.
Cloudera management services.
2nd node, just master roles:
JobTracker.
SecondaryNamenode.
3rd to 6th nodes, just worker roles:
TaskTracker.
Datanode.
I use the 2nd node as client because is the one with the lowest load.
Please tell me if you need any configuration property value or detail.
Update: After Chris White's answer I've tried to reduce the number of pollings between the jobtracker and tasktrackers by having just 1 worker and very few maps and reduces, now the benchmarks are pretty stable :)
There are many factors that you need to take into consideration when looking at performance:
This could be a polling problem combined with the small number of processing slots you have available.
The Task Trackers poll the running tasks periodically to determine if they have finished, and the Job Tracker also polls the Task Trackers. With your ~3700 map tasks (if i've read your question correctly), if there was say a ~1 second difference in polling times, then this could account for the ~hour you are seeing in timing differences.
If you have a larger cluster with more processing slots, i imagine this number would become more stable, but no MR job will every have a constant running time, there are too many polling and other external timings (JVM start up time for example) that can adjust the overall runtime.
What was the data locality counters say for both jobs? If one job had considerably more data lock tasks than another then i would expect it to run fast too.

Resources