Improving compute performance of Spark ML ALS

Improving compute performance of Spark ML ALS - performance

I have a spark job that performs Alternating Least Squares (ALS) on an implicit feedback ratings matrix. I create the ALS object as follows.
val als = new ALS()
.setCheckpointInterval(5)
.setRank(150)
.setAlpha(30.0)
.setMaxIter(25)
.setRegParam(0.001)
.setUserCol("userId")
.setItemCol("itemId")
.setRatingCol("rating")
.setImplicitPrefs(true)
.setIntermediateStorageLevel("MEMORY_ONLY")
.setFinalStorageLevel("MEMORY_ONLY")
The ratings matrix is created and used to fit the ALS model as follows.
val ratingsSchema = StructType(Array(
StructField("userId", IntegerType, nullable = true),
StructField("itemId", IntegerType, nullable = true),
StructField("rating", DoubleType, nullable = true)))
val ratings = spark
.read
.format("parquet")
.schema(ratingsSchema)
.load("/ratings")
.cache()
val model = als.fit(ratings)
There are roughly 150 million unique users and 1 million items in the ratings DataFrame, which has around 850 million rows.
Based on the numbers above, the ratings DataFrame should occupy ~20 GB space in memory when fully loaded. The userFactors DataFrame would be 150 MM x 150 doubles = 180 GB (roughly). The itemFactors DataFrame should be only 1.2GB.
It is taking a really long time for the job to complete (15+ hours). My cluster specs are as follows.
Provider: AWS EMR version 5.14.0
Spark version: 2.3.0
Cluster:
1 MASTER node: m4.xlarge (8 cores, 16GB mem, 32GB storage)
2 CORE nodes: i3.xlarge (4 cores, 30GB mem, 950 GB storage)
20 TASK nodes: r4.4xlarge (16 cores, 122GB mem, 32 GB storage)
Total TASK cores = 320
Total TASK memory: 2440 GB
Based on the numbers above, all DFs should easily fit in memory (there is also TB+ HDFS available, if needed).
Job configuration:
--executor-memory 102g
--num-executors 20
--executor-cores 15
I can see that there are 20 executors running (and a driver too). I have tried caching the ratings DF and also without caching.
How do I tune the system to make it run faster, if possible?
Does anyone have insights into the ALS job? Does it do a lot of shuffling? How can we minimize the shuffle?
The ratings matrix is in parquet format with 200 files stored in a bucket in S3.
Would it work better if I have lots of small instances (say 50) or should I get a few (say, 5) very big instances like r4.16xlarge (64 cores, 488GB mem)?

Related

How to increase hive concurrent mappers to more than 4?

Summary
When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. I would like to use the whole cluster.
Details
I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12).
I have a table of several billion rows. When I perform a simple
select count(*) from mytable;
hive creates hundreds of map tasks, but only 4 are running simultaneously.
This means that my cluster is mostly idle during the query which seems wasteful. I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all.
We are using mapreduce as the engine. I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers).
The memory settings are as follows:
yarn.nodemanager.resource.memory-mb 188928 MB
yarn.scheduler.minimum-allocation-mb 20992 MB
yarn.scheduler.maximum-allocation-mb 188928 MB
yarn.app.mapreduce.am.resource.mb 20992 MB
mapreduce.map.memory.mb 20992 MB
mapreduce.reduce.memory.mb 20992 MB
and we are running on 41 nodes. By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. Instead I get 4.
Vcore settings:
yarn.nodemanager.resource.cpu-vcores 24
yarn.scheduler.minimum-allocation-vcores 1
yarn.scheduler.maximum-allocation-vcores 24
yarn.app.mapreduce.am.resource.cpu-vcores 1
mapreduce.map.cpu.vcores 1
mapreduce.reduce.cpu.vcores 1
Is there are way to get hive/mapreduce to use more of my cluster?
How would a go about figuring out the bottle neck?
Could it be that Yarn is not assigning tasks fast enough?
I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM).

Running parallel tasks depends on your memory setting in yarn
for example if you have 4 data nodes and your yarn memory properties are defined as below
yarn.nodemanager.resource.memory-mb 1 GB
yarn.scheduler.minimum-allocation-mb 1 GB
yarn.scheduler.maximum-allocation-mb 1 GB
yarn.app.mapreduce.am.resource.mb 1 GB
mapreduce.map.memory.mb 1 GB
mapreduce.reduce.memory.mb 1 GB
according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container
and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory
so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task
P.S. - Here we are taking about maximum tasks that can be launched,it may be some less than that also

need to improve performance of spark-sql joins

I have been working on a project where we are using spark-sql as analytic platform and currently I am facing issues while joining two data frames df1 & df2
df1 has 25000 records
df2 has 127000 records
When I am joining these two tables in spark-dataframe it is taking lot of time to join
val df_join = df1.join(df2, df2("col1") ===
df1("col1")).drop(df1("col2"))
I checked the Spark-UI for status and it is showing some astonishing numbers
and input size/records are increasing weirdly
Kindly let me know why and how input size is increasing considerably and how should i tune my spark job
attached are the screen shot of cluster
3 node cluster running on yarn
6 gb for driver
5 gb for executor allocated and 2 cores per executor
Status of job after more than 30 mins, input size has increased to almost 1000GB

What about the files with smaller size than the hadoop block size: spark + machine learning

My hadoop block size if 128 MB and my file is 30 MB.
And my cluster on which spark is running is a 4 node cluster with total of 64 cores.
And now my task is to run a random forest or gradient boosting algorithm with paramater grid and 3-fold cross validation on top of this.
few lines of the code:
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit, CrossValidator}
import org.apache.spark.ml.regression.GBTRegressor
val gbt_model = new GBTRegressor().setLabelCol(target_col_name).setFeaturesCol("features").setMaxIter(2).setMaxDepth(2).setMaxBins(1700)
var stages: Array[org.apache.spark.ml.PipelineStage] = index_transformers :+ assembler :+ gbt_model
val paramGrid = new ParamGridBuilder().addGrid(gbt_model.maxIter, Array(100, 200)).addGrid(gbt_model.maxDepth, Array(2, 5, 10)).build()
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new RegressionEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(5)
val cvModel = cv.fit(df_train)
My file has around
Input: 10 discrete/ string/ char features + 2 integer feature
Output: A Integer response/ output variable
and this takes more than 4 hours to run on my cluster. What I observed is that my code runs on just 1 node with just 3 containers.
Questions:
What I can do here to make sure my code runs on all the four nodes or uses the maximum possible cores for fast computation.
What can I do in terms of partitioning my data (DataFrame in scala, and csv file on Hadoop cluster) to get speed and computation improvements
Regards,

When you submit your job, you can pass the number of executors you want via the parameter --num-executors. You can also specify the number of cores and the amount of memory each executor will use, via --executor-cores and --executor-memory.

Hadoop machine configuration

I want to analyze 7TB of data and store the output in a database, say HBase.
My monthly increment is 500GB, but to analyze 500GB data I don't need to go through 7TB of data again.
Currently I am thinking of using Hadoop with Hive for analyzing the data, and
Hadoop with MapReducer and HBase to process and store the data.
At the moment I have 5 machines of following configuration:
Data Node Server Configuration: 2-2.5 Ghz hexa core CPU, 48 GB RAM, 1 TB -7200 RPM (X 8)
Number of data nodes: 5
Name Node Server: Enterprise class server configuration (X 2) (1 additional for secondary
I want to know if the above process is sufficient given the requirements, and if anyone has any suggestions.

Sizing
There is a formula given by Hortonworks to calculate your sizing
((Initial Size + YOY Growth + Intermediate Data Size) * Repl Cpount * 1.2) /Comp Ratio
Assuming default vars
repl_count == 3 (default)
comp_ration = 3-4 (default)
Intermediate data size = 30%-50% of raw data size .-
1,2 factor - temp space
So for your first year, you will need 16.9 TB. You have 8TB*5 == 40. So space is not the topic.
Performance
5 Datanodes. Reading 1 TB takes in average 2.5 hours (source Hadoop - The definitive guide) on a single drive. 600 GB with one drive would be 1.5 hours. Estimating that you have replicated so that you can use all 5 nodes in parallel, it means reading the whole data with 5 nodes can get up to 18 minutes.
You may have to add some more time time depending on what you do with your queries and how have configured your data processing.
Memory consumution
48 GB is not much. The default RAM for many data nodes is starting from 128 GB. If you use the cluster only for processing, it might work out. Depending also a bit, how you configure the cluster and which technologies you use for processing. If you have concurrent access, it is likely that you might run into heap errors.
To sum it up:
It depends much what you want to do with you cluster and how complex your queries are. Also keep in mind that concurrent access could create problems.
If 18 minutes processing time for 600 GB data (as a baseline - real values depend on much factors unknown answering that questions) is enough and you do not have concurrent access, go for it.

I would recommend transforming the data on arrival. Hive can give tremendous speed boost by switching to a columnar compressed format, like ORC or Parquet. We're talking about potential x30-x40 times improvements in queries performance. With latest Hive you can leverage streaming data ingest on ORC files.
You can leave things as you planned (HBase + Hive) and just rely on brute force 5 x (6 Core, 48GB, 7200 RPM) but you don't have to. A bit of work can get you into interactive ad-hoc query time territory, which will open up data analysis.

Apache Spark: The number of cores vs. the number of executors

I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN.
The test environment is as follows:
Number of data nodes: 3
Data node machine spec:
CPU: Core i7-4790 (# of cores: 4, # of threads: 8)
RAM: 32GB (8GB x 4)
HDD: 8TB (2TB x 4)
Network: 1Gb
Spark version: 1.0.0
Hadoop version: 2.4.0 (Hortonworks HDP 2.1)
Spark job flow: sc.textFile -> filter -> map -> filter -> mapToPair -> reduceByKey -> map -> saveAsTextFile
Input data
Type: single text file
Size: 165GB
Number of lines: 454,568,833
Output
Number of lines after second filter: 310,640,717
Number of lines of the result file: 99,848,268
Size of the result file: 41GB
The job was run with following configurations:
--master yarn-client --executor-memory 19G --executor-cores 7 --num-executors 3 (executors per data node, use as much as cores)
--master yarn-client --executor-memory 19G --executor-cores 4 --num-executors 3 (# of cores reduced)
--master yarn-client --executor-memory 4G --executor-cores 2 --num-executors 12 (less core, more executor)
Elapsed times:
50 min 15 sec
55 min 48 sec
31 min 23 sec
To my surprise, (3) was much faster.
I thought that (1) would be faster, since there would be less inter-executor communication when shuffling.
Although # of cores of (1) is fewer than (3), #of cores is not the key factor since 2) did perform well.
(Followings were added after pwilmot's answer.)
For the information, the performance monitor screen capture is as follows:
Ganglia data node summary for (1) - job started at 04:37.
Ganglia data node summary for (3) - job started at 19:47. Please ignore the graph before that time.
The graph roughly divides into 2 sections:
First: from start to reduceByKey: CPU intensive, no network activity
Second: after reduceByKey: CPU lowers, network I/O is done.
As the graph shows, (1) can use as much CPU power as it was given. So, it might not be the problem of the number of the threads.
How to explain this result?

To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as
possible: Imagine a cluster with six nodes running NodeManagers, each
equipped with 16 cores and 64GB of memory. The NodeManager capacities,
yarn.nodemanager.resource.memory-mb and
yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 *
1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100%
of the resources to YARN containers because the node needs some
resources to run the OS and Hadoop daemons. In this case, we leave a
gigabyte and a core for these system processes. Cloudera Manager helps
by accounting for these and configuring these YARN properties
automatically.
The likely first impulse would be to use --num-executors 6
--executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
63GB + the executor memory overhead won’t fit within the 63GB capacity
of the NodeManagers. The application master will take up a core on one
of the nodes, meaning that there won’t be room for a 15-core executor
on that node. 15 cores per executor can lead to bad HDFS I/O
throughput.
A better option would be to use --num-executors 17
--executor-cores 5 --executor-memory 19G. Why?
This config results in three executors on all nodes except for the one
with the AM, which will have two executors.
--executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.
The explanation was given in an article in Cloudera's blog, How-to: Tune Your Apache Spark Jobs (Part 2).

Short answer: I think tgbaggio is right. You hit HDFS throughput limits on your executors.
I think the answer here may be a little simpler than some of the recommendations here.
The clue for me is in the cluster network graph. For run 1 the utilization is steady at ~50 M bytes/s. For run 3 the steady utilization is doubled, around 100 M bytes/s.
From the cloudera blog post shared by DzOrd, you can see this important quote:
I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.
So, let's do a few calculations see what performance we expect if that is true.
Run 1: 19 GB, 7 cores, 3 executors
3 executors x 7 threads = 21 threads
with 7 cores per executor, we expect limited IO to HDFS (maxes out at ~5 cores)
effective throughput ~= 3 executors x 5 threads = 15 threads
Run 3: 4 GB, 2 cores, 12 executors
2 executors x 12 threads = 24 threads
2 cores per executor, so hdfs throughput is ok
effective throughput ~= 12 executors x 2 threads = 24 threads
If the job is 100% limited by concurrency (the number of threads). We would expect runtime to be perfectly inversely correlated with the number of threads.
ratio_num_threads = nthread_job1 / nthread_job3 = 15/24 = 0.625
inv_ratio_runtime = 1/(duration_job1 / duration_job3) = 1/(50/31) = 31/50 = 0.62
So ratio_num_threads ~= inv_ratio_runtime, and it looks like we are network limited.
This same effect explains the difference between Run 1 and Run 2.
Run 2: 19 GB, 4 cores, 3 executors
3 executors x 4 threads = 12 threads
with 4 cores per executor, ok IO to HDFS
effective throughput ~= 3 executors x 4 threads = 12 threads
Comparing the number of effective threads and the runtime:
ratio_num_threads = nthread_job2 / nthread_job1 = 12/15 = 0.8
inv_ratio_runtime = 1/(duration_job2 / duration_job1) = 1/(55/50) = 50/55 = 0.91
It's not as perfect as the last comparison, but we still see a similar drop in performance when we lose threads.
Now for the last bit: why is it the case that we get better performance with more threads, esp. more threads than the number of CPUs?
A good explanation of the difference between parallelism (what we get by dividing up data onto multiple CPUs) and concurrency (what we get when we use multiple threads to do work on a single CPU) is provided in this great post by Rob Pike: Concurrency is not parallelism.
The short explanation is that if a Spark job is interacting with a file system or network the CPU spends a lot of time waiting on communication with those interfaces and not spending a lot of time actually "doing work". By giving those CPUs more than 1 task to work on at a time, they are spending less time waiting and more time working, and you see better performance.

As you run your spark app on top of HDFS, according to Sandy Ryza
I’ve noticed that the HDFS client has trouble with tons of concurrent
threads. A rough guess is that at most five tasks per executor can
achieve full write throughput, so it’s good to keep the number of
cores per executor below that number.
So I believe that your first configuration is slower than third one is because of bad HDFS I/O throughput

From the excellent resources available at RStudio's Sparklyr package page:
SPARK DEFINITIONS:
It may be useful to provide some simple definitions
for the Spark nomenclature:
Node: A server
Worker Node: A server that is part of the cluster and are available to
run Spark jobs
Master Node: The server that coordinates the Worker nodes.
Executor: A sort of virtual machine inside a node. One Node can have
multiple Executors.
Driver Node: The Node that initiates the Spark session. Typically,
this will be the server where sparklyr is located.
Driver (Executor): The Driver Node will also show up in the Executor
list.

I haven't played with these settings myself so this is just speculation but if we think about this issue as normal cores and threads in a distributed system then in your cluster you can use up to 12 cores (4 * 3 machines) and 24 threads (8 * 3 machines). In your first two examples you are giving your job a fair number of cores (potential computation space) but the number of threads (jobs) to run on those cores is so limited that you aren't able to use much of the processing power allocated and thus the job is slower even though there is more computation resources allocated.
you mention that your concern was in the shuffle step - while it is nice to limit the overhead in the shuffle step it is generally much more important to utilize the parallelization of the cluster. Think about the extreme case - a single threaded program with zero shuffle.

I think one of the major reasons is locality. Your input file size is 165G, the file's related blocks certainly distributed over multiple DataNodes, more executors can avoid network copy.
Try to set executor num equal blocks count, i think can be faster.

Spark Dynamic allocation gives flexibility and allocates resources dynamically. In this number of min and max executors can be given. Also the number of executors that has to be launched at the starting of the application can also be given.
Read below on the same:
http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation

There is a small issue in the First two configurations i think. The concepts of threads and cores like follows. The concept of threading is if the cores are ideal then use that core to process the data. So the memory is not fully utilized in first two cases. If you want to bench mark this example choose the machines which has more than 10 cores on each machine. Then do the bench mark.
But dont give more than 5 cores per executor there will be bottle neck on i/o performance.
So the best machines to do this bench marking might be data nodes which have 10 cores.
Data node machine spec:
CPU: Core i7-4790 (# of cores: 10, # of threads: 20)
RAM: 32GB (8GB x 4)
HDD: 8TB (2TB x 4)

In the 2.) configuration you're reducing the parallel tasks and thus I believe your comparison isn't fair.
Make the --num-executors to atleast 5.
Thus, you will have 20 tasks running in comparison to your 21 tasks in 1.) configuration.
Then, the comparison will be fair as per me.
Also, please calculate the executor memory accordingly.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio