I am working on Hadoop performance modeling. Hadoop has 200+ parameters so setting them manually is not possible. So often we run our hadoop jobs with default parameter value(like using default value io.sort.mb, io.sort.record.percent, mapred.output.compress etc). But using default parameter value gives us sub optimal performance. There is some work done in this area by Herodotos Herodotou (http://www.cs.duke.edu/starfish/files/vldb11-job-optimization.pdf) to improve performance. But i have following doubt in their work --
They are fixing the value of parameters at the job start time( according to proportionality assumption of data) for all the phases( read, map, collect etc.) of MapReduce job. Can we set different value of these parameters for each phase at run time according to run time environment( like cluster configuration, underling file system etc.), by changing Hadoop configuration log files of a particular node to get optimal performance from a node ?
They are using white box model for Hadoop core are they still applicable for
current Hadoop ( http://arxiv.org/pdf/1106.0940.pdf) ?

No, you couldn't dynamically change MapReduce parameters per job per node.
Configuring set of nodes
Rather what you could do is change the configuration parameters per node statically in the configuration files (generally located in /etc/hadoop/conf), so that you could take the most out of your cluster with different h/w configurations.
Example: Assume you have 20 worker nodes with different hardware configurations like:
10 with configuration of 128GB RAM, 24 Cores
10 with configuration of 64GB RAM, 12 Cores
In that case you would want to configure each of identical servers to take most out of the hardware for example, you would want to run more child tasks (mappers & reducers) on worker nodes with more RAM and Cores, for example:
Nodes with 128GB RAM, 24 Cores => 36 worker tasks (mappers + reducers), JVM heap for each worker task would be around 3GB.
Nodes with 64GB RAM, 12 Cores => 18 worker tasks (mappers + reducers), JVM heap for each worker task would be around 3GB.
So, you would want to configure the set of nodes respectively with appropriate parameters.
Using ToolRunner to pass configuration parameters dynamically to a Job:
Also, you could dynamically change the MapReduce job parameters per job but these parameters would be applied to the entire cluster not just to a set of nodes. Provided your MapReduce job driver extends ToolRunner.
ToolRunner allows you to parse generic hadoop command line arguments. You'll be able to pass MapReduce configuration parameters using -D property.name=property.value.
You can pretty much pass almost all hadoop parameters dynamically to a job. But most commonly passed MapReduce configuration parameters dynamically to a job are:
Here is an example terasort job passing lots of parameters dynamically per job:
hadoop jar hadoop-mapreduce-examples.jar tearsort \
-Ddfs.replication=1 -Dmapreduce.task.io.sort.mb=500 \
-Dmapreduce.map.sort.splill.percent=0.9 \
-Dmapreduce.reduce.shuffle.parallelcopies=10 \
-Dmapreduce.reduce.shuffle.memory.limit.percent=0.1 \
-Dmapreduce.reduce.shuffle.input.buffer.percent=0.95 \
-Dmapreduce.reduce.input.buffer.percent=0.95 \
-Dmapreduce.reduce.shuffle.merge.percent=0.95 \
-Dmapreduce.reduce.merge.inmem.threshold=0 \
-Dmapreduce.job.speculative.speculativecap=0.05 \
-Dmapreduce.map.speculative=false \
-Dmapreduce.map.reduce.speculative=false \

 -Dmapreduce.job.jvm.numtasks=-1 \
-Dmapreduce.job.reduces=84 \

 -Dmapreduce.task.io.sort.factor=100 \
-Dmapreduce.map.output.compress=true \

org.apache.hadoop.io.compress.SnappyCodec \
-Dmapreduce.job.reduce.slowstart.completedmaps=0.4 \
-Dmapreduce.reduce.merge.memtomem.enabled=fasle \
-Dmapreduce.reduce.memory.totalbytes=12348030976 \
-Dmapreduce.reduce.memory.mb=12288 \

 -Dmapreduce.reduce.java.opts=“-Xms11776m -Xmx11776m \
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode \
-XX:+CMSIncrementalPacing -XX:ParallelGCThreads=4” \

 -Dmapreduce.map.memory.mb=4096 \

 -Dmapreduce.map.java.opts=“-Xmx1356m” \
/terasort-input /terasort-output


Spark job just hangs with large data

I am trying to query from s3 (15 days of data). I tried querying them separately (each day) it works fine. It works fine for 14 days as well. But when I query 15 days the job keeps running forever (hangs) and the task # is not updating.
My settings :
I am using 51 node cluster r3.4x large with dynamic allocation and maximum resource turned on.
All I am doing is =
val startTime="2017-11-21T08:00:00Z"
val endTime="2017-12-05T08:00:00Z"
val start = DateUtils.getLocalTimeStamp( startTime )
val end = DateUtils.getLocalTimeStamp( endTime )
val days: Int = Days.daysBetween( start, end ).getDays
val files: Seq[String] = (0 to days)
.map( start.plusDays )
.map( d => s"$input_path${DateTimeFormat.forPattern( "yyyy/MM/dd" ).print( d )}/*/*" )
sqlSession.sparkContext.textFile( files.mkString( "," ) ).count
When I run the same with 14 days, I got 197337380 (count) and I ran the 15th day separately and got 27676788. But when I query 15 days total the job hangs
Update :
The job works fine with :
var df = sqlSession.createDataFrame(sc.emptyRDD[Row], schema)
for(n <- files ){
val tempDF = sqlSession.read.schema( schema ).json(n)
df = df(tempDF)
But can some one explain why it works now but not before ?
UPDATE : After setting mapreduce.input.fileinputformat.split.minsize to 256 GB it works fine now.
Dynamic allocation and maximize resource allocation are both different settings, one would be disabled when other is active. With Maximize resource allocation in EMR, 1 executor per node is launched, and it allocates all the cores and memory to that executor.
I would recommend taking a different route. You seem to have a pretty big cluster with 51 nodes, not sure if it is even required. However, follow this rule of thumb to begin with, and you will get a hang of how to tune these configurations.
Cluster memory - minimum of 2X the data you are dealing with.
Now assuming 51 nodes is what you require, try below:
r3.4x has 16 CPUs - so you can put all of them to use by leaving one for the OS and other processes.
Set your number of executors to 150 - this will allocate 3 executors per node.
Set number of cores per executor to 5 (3 executors per node)
Set your executor memory to roughly total host memory/3 = 35G
You got to control the parallelism (default partitions), set this to number of total cores you have ~ 800
Adjust shuffle partitions - make this twice of number of cores - 1600
Above configurations have been working like a charm for me. You can monitor the resource utilization on Spark UI.
Also, in your yarn config /etc/hadoop/conf/capacity-scheduler.xml file, set yarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator - which will allow Spark to really go full throttle with those CPUs. Restart yarn service after change.
You should be increasing the executor memory and # executors, If the data is huge try increasing the Driver memory.
My suggestion is to not use the dynamic resource allocation and let it run and see if it still hangs or not (Please note that spark job can consume entire cluster resources and make other applications starve for resources try this approach when no jobs are running). if it doesn't hang that means you should play with the resource allocation, then start hardcoding the resources and keep increasing resources so that you can find the best resource allocation you can possibly use.
Below links can help you understand the resource allocation and optimization of resources.

How to make Hadoop/EMR use more containers per node

I'm in the process of moving our application from Hadoop 1.0.3 to 2.7, on EMR v5.1.0. I got it running, but I'm still having problems getting my head around the resource-allocation system in Yarn. With the default settings provided by EMR, Hadoop only allocates one container per node, even if I select a larger instance type for the nodes. This is a problem, since we'll now be using twice as many nodes to do the same amount of work.
I want to squeeze more containers into one node, and ensure that we're using all the available resources. I assume that I shouldn't touch yarn.nodemanager.resource.memory-mb or yarn.nodemanager.resource.cpu-vcores, since those are set by EMR to reflect the actual available resources. Which settings do I have to change?
Your container sizes are defined by setting the memory (default criteria for a container) and vcores. The following can be configured:
All the following criteria must be satified (they are per container, except for yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb which are per NodeManager hence per DataNode):
1 <= yarn-scheduler.minimum-allocation-vcores <= yarn-scheduler.maximum-allocation-vcores
yarn-scheduler.maximum-allocation-vcores <= yarn.nodemanager.resource.cpu-vcores
yarn-scheduler.increment-allocation-vcores = 1
1024 <= yarn-scheduler.minimum-allocation-mb <= yarn-scheduler.maximum-allocation-mb
yarn-scheduler.maximum-allocation-mb <= yarn.nodemanager.resource.memory-mb
yarn-scheduler.increment-allocation-mb = 512
You can also see this helpful link https://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_yarn_tuning.html

Why are locality levels all ANY in a Spark wordcount application running on HDFS?

I ran a Spark cluster of 12 nodes (8G memory and 8 cores for each) for some tests.
I'm trying to figure out why data localities of a simple wordcount app in "map" stage are all "Any". The 14GB dataset is stored in HDFS.
I have run into the same problem and in my case it was a problem with the configuration. I was running on the EC2 and I had a name mismatch. Maybe the same thing happened to you.
When you check how HDFS sees you cluster it should be something along this lines:
hdfs dfsadmin -printTopology
Rack: /default-rack
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
And the same should be seen in executors' address in the UI (by default it's http://your-cluster-public-dns:8080/).
In my case I was using public hostname for spark slaves. I have changed my SPARK_LOCAL_IP in $SPARK/conf/spark-env.sh to use the private name as well, and after that change I get NODE_LOCAL most of the times.
I encounter the same problem today. This is my situation:
My cluster have 9 workers(each setup one executor by default) ,when i set --total-executor-cores 9, the Locality lever is NODE_LOCAL, but when i set the total-executor-cores below 9 such as --total-executor-cores 7, then Locality lever become ANY, and the total time cost is 10X than NODE_LOCAL lever. You can have a try.
I'm running my cluster on EC2s, and I fixed my problem by adding the following to spark-env.sh on the name node
SPARK_MASTER_HOST=<name node hostname>
and then adding the following to spark-env.sh on the data nodes
SPARK_LOCAL_HOSTNAME=<data node hostname>
Don't start slaves like this start-all.sh. u should start every slave alonely
$SPARK_HOME/sbin/start-slave.sh -h <hostname> <masterURI>

How to tune Spark application with hadoop custom input format

My spark application process the files (average size is 20 MB) with custom hadoop input format and stores the result in HDFS.
Following is the code snippet.
Configuration conf = new Configuration();
JavaPairRDD<Text, Text> baseRDD = ctx
.newAPIHadoopFile(input, CustomInputFormat.class,Text.class, Text.class, conf);
JavaRDD<myClass> mapPartitionsRDD = baseRDD
.mapPartitions(new FlatMapFunction<Iterator<Tuple2<Text, Text>>, myClass>() {
//my logic goes here
//few more translformations
This application creates 1 task/ partition per file and processes and stores the corresponding part file in HDFS.
i.e, For 10,000 input files 10,000 tasks are created and 10,000 part files are stored in HDFS.
Both mapPartitions and map operations on baseRDD are creating 1 task per file.
SO question
How to set the number of partitions for newAPIHadoopFile?
suggests to set
conf.setInt("mapred.max.split.size", 4); for configuring no of partitions.
But when this parameter is set CPU is utilized at maximum and none of the stage is not started even after long time.
If I don't set this parameter then application will be completed successfully as mentioned above.
How to set number of partitions with newAPIHadoopFile and increase the efficiency?
What happens with mapred.max.split.size option?
What happens with mapred.max.split.size option?
In my use case file size is small and changing the split size options are irrelevant here.
more info on this SO: Behavior of the parameter "mapred.min.split.size" in HDFS
Just use baseRDD.repartition(<a sane amount>).mapPartitions(...). That will move the resulting operation to fewer partitions, especially if your files are small.

How concurrent # mappers and # reducers are calculated in Hadoop 2 + YARN?

I've searched by sometime and I've found that a MapReduce cluster using hadoop2 + yarn has the following number of concurrent maps and reduces per node:
Concurrent Maps # = yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb
Concurrent Reduces # = yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb
However, I've set up a cluster with 10 machines, with these configurations:
'yarn_site' => {
'yarn.nodemanager.resource.cpu-vcores' => '32',
'yarn.nodemanager.resource.memory-mb' => '16793',
'yarn.scheduler.minimum-allocation-mb' => '532',
'yarn.nodemanager.vmem-pmem-ratio' => '5',
'yarn.nodemanager.pmem-check-enabled' => 'false'
'mapred_site' => {
'mapreduce.map.memory.mb' => '4669',
'mapreduce.reduce.memory.mb' => '4915',
'mapreduce.map.java.opts' => '-Xmx4669m',
'mapreduce.reduce.java.opts' => '-Xmx4915m'
But after the cluster is set up, hadoop allows 6 containers for the entire cluster. What am I forgetting? What am I doing wrong?
Not sure if this is the same issue you're having, but I had a similar issue, where I launched an EMR cluster of 20 nodes of c3.8xlarge in the core instance group and similarly found the cluster to be severely underutilized when running a job (only 30 mappers were running concurrently across the entire cluster, even though the memory/vcore configs in YARN and MapReduce for my particular cluster show that over 500 concurrent containers can run). I was using Hadoop 2.4.0 on AMI 3.5.0.
It turns out that the instance group matters for some reason. When I relaunched the cluster with 20 nodes in task instance group and only 1 core node, that made a HUGE difference. I got over 500+ mappers running concurrently (in my case, the mappers were mostly downloading files from S3 and as such don't need HDFS).
I'm not sure why the different instance group type makes a difference, given that both can equally run tasks, but clearly they are being treated differently.
I thought I'd mention it here, given that I ran into this issue myself and using a different group type helped.
