Spark Program running very slow on cluster - hadoop

I am trying to run my PySpark in Cluster with 2 nodes and 1 master (all have 16 Gb RAM). I have run my spark with below command.
spark-submit --master yarn --deploy-mode cluster --name "Pyspark"
--num-executors 40 --executor-memory 2g CD.py
However my code runs very slow, it takes almost 1 hour to parse 8.2 GB of data.
Then i tried to change the configuration in my YARN. I changed following properties.
yarn.scheduler.increment-allocation-mb = 2 GiB
yarn.scheduler.minimum-allocation-mb = 2 GiB
yarn.scheduler.increment-allocation-mb = 2 GiB
yarn.scheduler.maximum-allocation-mb = 2 GiB
After doing these changes still my spark is running very slow and taking more than 1 hour to parse 8.2 GB of files.

could you please try with the below configuration
spark.executor.memory 5g
spark.executor.cores 5
spark.executor.instances 3
spark.driver.cores 2

Related

How to utilize all vcores in R3 2X large 8 node cluster

My spark job is only using 32 Vcores out of 127 in total. Please refer to the image below.
My spark-submit command is:
spark-submit --executor-memory 12G --num-executors 32 --executor-cores 3 --conf spark.executor.memoryOverhead=1.5g
How do I tweak spark-submit params to utilize all the resources available in the cluster.

Hadoop multinode cluster too slow. How do I increase speed of data processing?

I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.
After running an MR job, i checked my RAM Usage which is mentioned below:
Namenode
free -g
total used free shared buff/cache available
Mem: 31 7 15 0 8 22
Swap: 31 0 31
Datanode :
Slave1 :
free -g
total used free shared buff/cache available
Mem: 31 6 6 0 18 24
Swap: 31 3 28
Slave2:
total used free shared buff/cache available
Mem: 31 2 4 0 24 28
Swap: 31 1 30
Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.
Here is the output of ps command of the JAR that I submnitted to execute the MR job:
/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-classpath --classpath of jars
org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02
Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.
EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -
nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
Here is some more information :
18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
I believe you can edit the mapred-default.xml
The Params you are looking for are
mapreduce.job.running.map.limit
mapreduce.job.running.reduce.limit
0 (Probably what it is set too at the moment) means UNLIMITED.
Looking at your Memory 32G/Machine seems too small.
What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.
Based on your yarn-site.xml your yarn.scheduler.minimum-allocation-mb setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.
Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.
To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:
For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB
TEZ:
YARN:

spark-submit in local mode - configuration

I am doing a spark-submit using --master local on my laptop (spark 1.6.1) to load data into hive tables. Laptop has 8 GB RAM and 4 cores. I have not set any properties manually - just using defaults.
When I load 50k records, the jobs finishes successfully. But when I try and load 200k records, I get a "GC Overhead Limit Exceeded" error.
In --master local mode, are there properties for job memory or heap memory that could be set manually?
Try to increase --driver-memory, --executor-memory, default value is 1g for both.
command should be like this:
spark-submit --master local --driver-memory 2g --executor-memory 2g --class classpath jarfile

Spark on YARN: execute driver without worker

Running Spark on YARN, cluster mode.
3 data nodes with YARN
YARN => 32 vCores, 32 GB RAM
I am submitting Spark program like this:
spark-submit \
--class com.blablacar.insights.etl.SparkETL \
--name ${JOB_NAME} \
--master yarn \
--num-executors 1 \
--deploy-mode cluster \
--driver-memory 512m \
--driver-cores 1 \
--executor-memory 2g \
--executor-cores 20 \
toto.jar json
I can see 2 jobs are running fine on 2 nodes. But I can see also 2 other job with just a driver container !
Is it possible to not run driver if there no resource for worker?
Actually, there is a setting to limit resources to "Application Master" (in case of Spark, this is the driver):
yarn.scheduler.capacity.maximum-am-resource-percent
From http://maprdocs.mapr.com/home/AdministratorGuide/Hadoop2.xCapacityScheduler-RunningPendingApps.html:
Maximum percent of resources in the cluster that can be used to run
application masters - controls the number of concurrent active
applications.
This way, YARN will not take full resources for Spark drivers, and keep resources for workers. Youpi !

h2o starting on YARN not working

When I start H2o on a cdh cluster I get the following error. I downloaded everything formt he wbesite and followed the tutorial. The command I ran was
hadoop jar h2odriver.jar -nodes 2 -mapperXmx 1g -output hdfsOutputDirName
It shows that containers are not being used. It's not clear what settings these would be on hadoop. I have given all settings memory. It's the 0.0 for memory that doesnt make sense, and why are the containers not using memory. Is the cluster even running now?
----- YARN cluster metrics -----
Number of YARN worker nodes: 3
----- Nodes -----
Node: http://data-node-3:8042 Rack: /default, RUNNING, 1 containers used, 1.0 / 6.0 GB used, 1 / 4 vcores used
Node: http://data-node-1:8042 Rack: /default, RUNNING, 0 containers used, 0.0 / 6.0 GB used, 0 / 4 vcores used
Node: http://data-node-2:8042 Rack: /default, RUNNING, 0 containers used, 0.0 / 6.0 GB used, 0 / 4 vcores used
----- Queues -----
Queue name: root.default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 0.00
Maximum capacity: -1.00
Application count: 0
Queue 'root.default' approximate utilization: 0.0 / 0.0 GB used, 0 / 0 vcores used
----------------------------------------------------------------------
WARNING: Job memory request (2.2 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (2) exceeds queue available virtual cores capacity (0)
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1462681033282_0008'
You should setup your default queue to have available resources to run 2nodes cluster.
See warnings:
WARNING: Job memory request (2.2 GB) exceeds queue available memory capacity (0.0 GB)
you ask 1GB per node (+overhead) but there is no available resources in the YARN queue
WARNING: Job virtual cores request (2) exceeds queue available virtual cores capacity (0)
you ask for 2 virtual cores but no cores are available in your default queue
Please check YARN documentation - for example setup of capacity scheduler and max available resources:
https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
I made the following changes in Cloudera Manager yarn configuration
Setting Value
yarn.scheduler.maximum-allocation-vcores 8
yarn.nodemanager.resource.cpu-vcores 4
yarn.nodemanager.resource.cpu-vcores 4
yarn.scheduler.maximum-allocation-mb 16 GB

Resources