I have been running an Apache Spark program on EC2 and my results are showing that by partitioning the data into the recommended 2-3 partitions per core is far slower than using the (seemingly) default partitions of 2.
Nanoseconds to complete without partitioning:
u1 - 650472112996
u2 - 654525970891
u3 - 498530412012
u4 - 568162162934
u5 - 659015135256
The number of total partitions was 2, I checked using sparkContext.partitions.size. *
Nanoseconds to complete with 2 partitions per core (total of 32 partitions):
u1 - 1562831187137
u2 - 1690008041723
u3 - 1652170780785
u4 - 3381857826333
u5 - 3969656930184
This particular sets of experiments was run across a 16 node cluster.
While without partitioning takes approximately 10 minutes every time, with partitioning took anywhere from 25-60 minutes! The exact same program and parameters were used. Checking the Spark UI showed that the datafile was too big for the default partitioned nodes and often data was spilled to disk, this never occurred when using 32 partitions. Additionally, only ever 2 tasks could be run on the default partitioned run (as clearly, there's only 2 partitions); meanwhile, the 32 partition cluster has 16 tasks running concurrently. I would've thought the partitioned run would've been quicker, not slower!
What could cause such a decrease in speed? Shuffling of data across the nodes perhaps? Also, why would one run, i.e. u1, take 25 minutes and another run, i.e. u5, take 1 hour on the 32 partitioned cluster, but when ran with default partitioning they took about the same time?
Apache Spark docs say that the number of partitions will be equal to the number of cores in the cluster; however, I have tested this across cluster sizes ranging from 2 - 20, using different EC2 instances, and every time it defaults to 2 partitions.
Related
I have a spark cluster (DataProc) with a master and 4 workers (2 preemtible), in my code I have some thing like this:
JavaRDD<Signal> rdd_data = javaSparkContext.parallelize(myArray);
rdd_data.foreachPartition(partitionOfRecords -> {
while (partitionOfRecords.hasNext()) {
MyData d = partitionOfRecords.next();
LOG.info("my data: " + d.getId().toString());
}
})
myArray is composed by 1200 MyData objects.
I don't understand why spark uses only 2 cores, divide my array into 2 partitions, and doesn't use 16 cores.
I need to set the number of partition?
Thanks in advance for any help.
Generally it's always a good idea to specific the number of partitions as the second argument to parallelize since the optimal slicing of your dataset should really be independent from the particular shape of the cluster you're using, and Spark can at best use current sizes of executors as a "hint".
What you're seeing here is that Spark will default to asking taskScheduler for current number of executor cores to use as the defaultParallelism, combined with the fact that in Dataproc Spark dynamic allocation is enabled. Dynamic allocation is important because otherwise a single job submitted to a cluster might just specify max executors even if it sits idle and then it will prevent other jobs from being able to use those idle resources.
So on Dataproc, if you're using default n1-standard-4, Dataproc configures 2 executors per machine and gives each executor 2 cores. The value of spark.dynamicAllocation.minExecutors should be 1, so your default job, upon startup without doing any work, would sit on 1 executor with 2 cores. Then taskScheduler will report that 2 cores are currently reserved in total, and therefore defaultParallelism will be 2.
If you had a large cluster and you were already running a job for awhile (say, you have a map phase that runs for longer than 60 seconds) you'd expect dynamic allocation to have taken all available resources, so the next step of the job that uses defaultParallelism would then presumably be 16, which is the total cores on your cluster (or possibly 14, if 2 are consumed by an appmaster).
In practice, you probably want to parallelize into a larger number of partitions than total cores available anyways. Then if there's any skew in how long each element takes to process, you can have nice balancing where fast tasks finish and then those executors can start taking on new partitions while the slow ones are still running, instead of always having to wait for a single slowest partition to finish. It's common to choose a number of partitions anywhere from 2x the number of available cores to something 100x or more.
Here's another related StackOverflow question: spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit
I am trying to run a Simple Query Assuming running queries with spark.sql("query") compared to Dataframes has no performance Difference as I am using Spark 2.1.0 i have Catalyst Optimizer to take care of the optimization part & Tungsten Enabled.
Here i am joining 2 tables with a Left-Outer join. My 1st table is 200 GB & is the Driving table(being on left side) and the 2nd table is 2GB and there has to be no Filters as per our Business requirement.
Configuration of My Cluster. As this is Shared Cluster i have a assigned a specific queue which allows me to use 3-TB of Memory(Yes 3 tera bytes) but the No.of VCORES is 480 . That means i can only run 480 Parallel tasks. On top of that AT YARN LEVEL i have a Constraint to having MAX of 8 cores per node. And MAX of 16 GB of Container Memory Limit. Because of which i cannot give my Executor-Memory(which is per node) more than 12 GB as i am giving 3-GB as ExecutorMemoryOverhead to be on safer side which becomes 15 GB of per node memory utilization.
So after calculating 480 total allowed vcores with 8-cores per node limit i have got 480/8 = 60 Nodes for my computation. Which comes to 60*15 = 900 GB of usable memory(I don't why total queue memory is assigned 3 TB) And this is at peak .. IF i am the only one using the Queue but that's not always the case.
Now the doubt is how Spark this whole 900 GB of memory. From the Numbers & stats i can clearly say that my Job will run without any issues as the data size i am trying to process is just 210-250 GB MaX & i have 900 GB of available memory.
But i keep getting Container getting killed error msgs. And i cannot increase the YARN Container size becoz it is at YARN level and overall cluster will get the increased container size which is not the right thing. I have also tried Disabling vmem-check.enabled property to FALSE in my code using sparksession.config(property) but that doesn't help too May be i am not allowed to change anything at YARN Level so it might be ignoring that.
Now on what basis spark splits the data initially is it based on the Block size defined at Cluster Level (assuming 128 MB) I am thinking this because when my Job is started i see that my Big Table which is of around 200 GB has 2000 tasks so on what basis Spark calculates this 2000 tasks(partitions) I thought may be the Default partition size when spark starts to load my table is quite big by seeing the Input Size/Records && Shuffle Write Size/Records Under the Stage Tab of Spark UI and that is the reason why i am getting Container Killed Error & suggestion to increase Executor memory overhead which did not helped either.
I tried to Repartition the Data from 10k to 100k partitions and tried persisting to MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY but nothing Helped. Many of my task were getting failed and at the End job used to get Fail. Sometimes with Container killed, Direct Buffer, and others.
Now here what is the use of Persist /Caching and how does it behave ..???? I am doing
val result = spark.sql("query big_table").repartition(10000, $<column name>).persist()
The column in Repartition is the Joining key so it gets distributed. TO make this work before the JOIN i am doing result.show(1) . So the action is performed and data gets persisted on DISK and Spark will read data persisted on DISK for JOIN and there will be no load on memory as it is stored in small chunks on Disks(Am i correct over HERE ..??)
Why in HIVE this same job with the same Big Table plus some additional tables with Left Join get completed. Though it takes time but it completes successfully But it Fails in Spark..?? Why ?? Is Spark not the Complete Replacement of HIVE..?? Doesn't Spark works like HIVE when it comes to Spilling to Disk & write data to disk while using DISK for PERSISTING.
Does yarn-container size plays a role if we have less container size but good number of nodes ??
Does Spark combines memory of all the available nodes (15 GB Per Node as per container size) and Combine them to load a large partition..??
We have very simple Spark Streaming job (implemented in Java), which is:
reading JSONs from Kafka via DirectStream (acks on Kafka messages are turned off)
parsing the JSONs into POJO (using GSON - our messages are only ~300 bytes)
map the POJO to tuple of key-value (value = object)
reduceByKey (custom reduce function - always comparing 1 field - quality - from the objects and leaves the object instance with higher quality)
store the result in the state (via mapWithState stores the object with highest quality per key)
store the result to HDFS
The JSONs are generated with set of 1000 IDs (keys) and all the events are randomly distributed to Kafka topic partitions. This also means, that resulting set of objects is max 1000, as the job is storing only the object with highest quality for each ID.
We were running the performance tests on AWS EMR (m4.xlarge = 4 cores, 16 GB memory) with following parameters:
number of executors = number of nodes (i.e. 1 executor per node)
number of Kafka partitions = number of nodes (i.e. in our case also executors)
batch size = 10 (s)
sliding window = 20 (s)
window size = 600 (s)
block size = 2000 (ms)
default Parallelism - tried different settings, however best results getting when the default parallelism is = number of nodes/executors
Kafka cluster contains just 1 broker, which is utilized to max ~30-40% during the peak load (we're pre-filling the data to topic and then independently executing the test). We have tried to increase the num.io.threads and num.network.threads, but without significant improvement.
The he results of performance tests (about 10 minutes of continuous load) were (YARN master and Driver nodes are on top of the node counts bellow):
2 nodes - able to process max. 150 000 events/s without any processing delay
5 nodes - 280 000 events/s => 25 % penalty if compared to expected "almost linear scalability"
10 nodes - 380 000 events/s => 50 % penalty if compared to expected "almost linear scalability"
The CPU utilization in case of 2 nodes was ~
We also played around other settings including:
- testing low/high number of partitions
- testing low/high/default value of defaultParallelism
- testing with higher number of executors (i.e. divide the resources to e.g. 30 executors instead of 10)
but the settings above were giving us the best results.
So - the question - is Kafka + Spark (almost) linearly scalable? If it should be scalable much better, than our tests shown - how it can be improved. Our goal is to support hundreds/thousands of Spark executors (i.e. scalability is crucial for us).
We have resolved this by:
increasing the capacity of Kafka cluster
more CPU power - increased the number of nodes for Kafka (1 Kafka node per 2 spark exectur nodes seemed to be fine)
more brokers - basically 1 broker per executor gave us the best results
setting proper default parallelism (number of cores in cluster * 2)
ensuring all the nodes will have approx. the same amount of work
batch size/blockSize should be ~equal or a multiple of number of executors
At the end, we've been able to achieve 1 100 000 events/s processed by the spark cluster with 10 executor nodes. The tuning made also increased the performance on configurations with less nodes -> we have achieved practically linear scalability when scaling from 2 to 10 spark executor nodes (m4.xlarge on AWS).
At the beginning, CPU on Kafka node wasn't approaching to limits, however it was not able to respond to the demands of Spark executors.
Thnx for all suggestions, particulary for #ArturBiesiadowski, who suggested the Kafka cluster incorrect sizing.
I want to analyze 7TB of data and store the output in a database, say HBase.
My monthly increment is 500GB, but to analyze 500GB data I don't need to go through 7TB of data again.
Currently I am thinking of using Hadoop with Hive for analyzing the data, and
Hadoop with MapReducer and HBase to process and store the data.
At the moment I have 5 machines of following configuration:
Data Node Server Configuration: 2-2.5 Ghz hexa core CPU, 48 GB RAM, 1 TB -7200 RPM (X 8)
Number of data nodes: 5
Name Node Server: Enterprise class server configuration (X 2) (1 additional for secondary
I want to know if the above process is sufficient given the requirements, and if anyone has any suggestions.
Sizing
There is a formula given by Hortonworks to calculate your sizing
((Initial Size + YOY Growth + Intermediate Data Size) * Repl Cpount * 1.2) /Comp Ratio
Assuming default vars
repl_count == 3 (default)
comp_ration = 3-4 (default)
Intermediate data size = 30%-50% of raw data size .-
1,2 factor - temp space
So for your first year, you will need 16.9 TB. You have 8TB*5 == 40. So space is not the topic.
Performance
5 Datanodes. Reading 1 TB takes in average 2.5 hours (source Hadoop - The definitive guide) on a single drive. 600 GB with one drive would be 1.5 hours. Estimating that you have replicated so that you can use all 5 nodes in parallel, it means reading the whole data with 5 nodes can get up to 18 minutes.
You may have to add some more time time depending on what you do with your queries and how have configured your data processing.
Memory consumution
48 GB is not much. The default RAM for many data nodes is starting from 128 GB. If you use the cluster only for processing, it might work out. Depending also a bit, how you configure the cluster and which technologies you use for processing. If you have concurrent access, it is likely that you might run into heap errors.
To sum it up:
It depends much what you want to do with you cluster and how complex your queries are. Also keep in mind that concurrent access could create problems.
If 18 minutes processing time for 600 GB data (as a baseline - real values depend on much factors unknown answering that questions) is enough and you do not have concurrent access, go for it.
I would recommend transforming the data on arrival. Hive can give tremendous speed boost by switching to a columnar compressed format, like ORC or Parquet. We're talking about potential x30-x40 times improvements in queries performance. With latest Hive you can leverage streaming data ingest on ORC files.
You can leave things as you planned (HBase + Hive) and just rely on brute force 5 x (6 Core, 48GB, 7200 RPM) but you don't have to. A bit of work can get you into interactive ad-hoc query time territory, which will open up data analysis.
I have a Cloudera Hadoop cluster and I'm doing some benchmarks running Terasort but I'm getting very unstable results from 105 - 150 minutes. Some times I've seen it was replicating more than usual or doing a lot of garbage collections but some other times they were pretty much the same.
I don't know the reason of the unstable results, any hint or recommendation will be very welcome :)
I run the benchmarks as follows:
I've chosen the number of maps and reduces tasks following this guide http://wiki.apache.org/hadoop/HowManyMapsAndReduces
Speculative maps and reduce execution is off.
Generating dataset:
10,000,000,000 rows of 100 bytes ~= 953674 M
Block size = 128 MB
Number of maps tasks = 3725 (number-of-rows * row-size) / (block-size*2) I do times 2 because the maps tasks time was too low, like 7 seconds.
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar teragen -Ddfs.replication=3 -Dmapred.map.tasks=3725 10000000000 /terasort-in
Running terasort:
num-of-worker-nodes = 4
num-of-cores-per-node = 8
Reduce tasks = 56 ( 1.75 * num-of-worker-nodes * num-of-cores-per-node )
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar terasort -Ddfs.replication=1 -Dmapred.reduce.tasks=56 /terasort-in /terasort-out
The service and role distribution among nodes is as follows:
6 Nodes - 8 cores, 16 GB RAM and 2 HD each - running just HDFS and MapReduce:
1st node, just master roles:
Namenode.
Cloudera management services.
2nd node, just master roles:
JobTracker.
SecondaryNamenode.
3rd to 6th nodes, just worker roles:
TaskTracker.
Datanode.
I use the 2nd node as client because is the one with the lowest load.
Please tell me if you need any configuration property value or detail.
Update: After Chris White's answer I've tried to reduce the number of pollings between the jobtracker and tasktrackers by having just 1 worker and very few maps and reduces, now the benchmarks are pretty stable :)
There are many factors that you need to take into consideration when looking at performance:
This could be a polling problem combined with the small number of processing slots you have available.
The Task Trackers poll the running tasks periodically to determine if they have finished, and the Job Tracker also polls the Task Trackers. With your ~3700 map tasks (if i've read your question correctly), if there was say a ~1 second difference in polling times, then this could account for the ~hour you are seeing in timing differences.
If you have a larger cluster with more processing slots, i imagine this number would become more stable, but no MR job will every have a constant running time, there are too many polling and other external timings (JVM start up time for example) that can adjust the overall runtime.
What was the data locality counters say for both jobs? If one job had considerably more data lock tasks than another then i would expect it to run fast too.