Dividing tasks among Spark workers - hadoop

I am running my program on a Spark cluster. But when I look at the UI while the job is running, I see that only one worker does most of the tasks. My cluster has one master and 4 workers where the master is also a worker.
I want my task to complete as quickly as possible and I believe that if the number of tasks were to be divided equally among the workers, the job will be completed faster.
Is there any way I can customize this?
System.setProperty("spark.default.parallelism","20")
val sc = new SparkContext("spark://10.100.15.2:7077","SimpleApp","/home/madhura/spark",List("hdfs://master:54310/simple-project_2.10-1.0.jar"))
val dRDD = sc.textFile("hdfs://master:54310/in*",10)
val keyval=dRDD.coalesce(100,true).mapPartitionsWithIndex{(ind,iter) => iter.map(x => process(ind,x.trim().split(' ').map(_.toDouble),q,m,r))}
I tried this but it did not help.

Related

How to hold Jenkins multiJob execution until chosen nodes are free?

I have a question about jenkins multijob possibilities:
current state:
I have 8 Jenkins nodes for job execution, 2 Linux and 6 Windows.
I have Multijob set up, consisting of 3 subJobs.
MultiJob setting: it has restriction to run only on Linux nodes
SubJob settings: n1 can run only on Win node1, n2 only on Win node2, n3 only on Win node3
Desired state:
When I build the multiJob, I need it to check and wait till Win nodes 1,2,3 are free
I need to execute subJobs 1,2,3 in the same time
this wouldn’t be problem, if all nodes were free...but if at least one of those three node is running some other job, it’s a problem already, because one subJob will be late compared to the other two
Is there any way to set up some pre-build script/another way to run subJobs only if all three chosen nodes are free/to wait for them to be free?
Thanks a lot for all ideas :)
You can check the status of the build executor on particular node as a pre-build action.
If the build executor is idle, that means no job is running but if it's busy, something is running into it.
Simple groovy script can be used for this purpose.
import hudson.model.Node
import hudson.model.Slave
import jenkins.model.Jenkins
Jenkins jenkins = Jenkins.instance
def jenkinsNodes =jenkins.nodes
for (Node node in jenkinsNodes)
{
// Make sure slave is online
if (!node.getComputer().isOffline())
{
//Make sure that the slave busy executor number is 0.
if(node.getComputer().countBusy()==0)
{
...put your logic...
}
}
}
Thanks,
Subhadeep

Preemption with Tez along with the yarn FairShare scheduler supported?

We've been switching our 10 nodes cluster from MapReduce to Tez lately and we are experiencing issues with resource management since then. It seems like preemption does not work as expected :
a very consuming job arrives it gets all free ressources
a second job arrives and wait for resources to be freed by job1
job2 gets a very little resource (5%) over a long time and it keeps increasing very slowly but most of the time never reach the fair share.
I'm assuming the preemption mechanism used by the FairShare yarn scheduler is not working as it should and resources only get assigned to job2 when some job1 containers are done.
I've looked into Tez doc and I could think that Tez would have been developed with the Capacity Scheduler as a defacto scheduler, but can't find any help for the FairShare scheduler.
Some conf variables used that may help :
hive.server2.tez.default.queues=default
hive.server2.tez.initialize.default.sessions=false
hive.server2.tez.session.lifetime=162h
hive.server2.tez.session.lifetime.jitter=3h
hive.server2.tez.sessions.init.threads=16
hive.server2.tez.sessions.per.default.queue=10
hive.tez.auto.reducer.parallelism=false
hive.tez.bucket.pruning=false
hive.tez.bucket.pruning.compat=true
hive.tez.container.max.java.heap.fraction=0.8
hive.tez.container.size=-1
hive.tez.cpu.vcores=-1
hive.tez.dynamic.partition.pruning=true
hive.tez.dynamic.partition.pruning.max.data.size=104857600
hive.tez.dynamic.partition.pruning.max.event.size=1048576
hive.tez.enable.memory.manager=true
hive.tez.exec.inplace.progress=true
hive.tez.exec.print.summary=false
hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
hive.tez.input.generate.consistent.splits=true
hive.tez.log.level=INFO
hive.tez.max.partition.factor=2.0
hive.tez.min.partition.factor=0.25
hive.tez.smb.number.waves=0.5
hive.tez.task.scale.memory.reserve-fraction.min=0.3
hive.tez.task.scale.memory.reserve.fraction=-1.0
hive.tez.task.scale.memory.reserve.fraction.max=0.5
yarn.scheduler.fair.preemption=true
yarn.scheduler.fair.preemption.cluster-utilization-threshold=0.7
yarn.scheduler.maximum-allocation-mb=32768
yarn.scheduler.maximum-allocation-vcores=4
yarn.scheduler.minimum-allocation-mb=2048
yarn.scheduler.minimum-allocation-vcores=1
yarn.resourcemanager.scheduler.address=${yarn.resourcemanager.hostname}:8030
yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
yarn.resourcemanager.scheduler.client.thread-count=50
yarn.resourcemanager.scheduler.monitor.enable=false
yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy

Spark job just hangs with large data

I am trying to query from s3 (15 days of data). I tried querying them separately (each day) it works fine. It works fine for 14 days as well. But when I query 15 days the job keeps running forever (hangs) and the task # is not updating.
My settings :
I am using 51 node cluster r3.4x large with dynamic allocation and maximum resource turned on.
All I am doing is =
val startTime="2017-11-21T08:00:00Z"
val endTime="2017-12-05T08:00:00Z"
val start = DateUtils.getLocalTimeStamp( startTime )
val end = DateUtils.getLocalTimeStamp( endTime )
val days: Int = Days.daysBetween( start, end ).getDays
val files: Seq[String] = (0 to days)
.map( start.plusDays )
.map( d => s"$input_path${DateTimeFormat.forPattern( "yyyy/MM/dd" ).print( d )}/*/*" )
sqlSession.sparkContext.textFile( files.mkString( "," ) ).count
When I run the same with 14 days, I got 197337380 (count) and I ran the 15th day separately and got 27676788. But when I query 15 days total the job hangs
Update :
The job works fine with :
var df = sqlSession.createDataFrame(sc.emptyRDD[Row], schema)
for(n <- files ){
val tempDF = sqlSession.read.schema( schema ).json(n)
df = df(tempDF)
}
df.count
But can some one explain why it works now but not before ?
UPDATE : After setting mapreduce.input.fileinputformat.split.minsize to 256 GB it works fine now.
Dynamic allocation and maximize resource allocation are both different settings, one would be disabled when other is active. With Maximize resource allocation in EMR, 1 executor per node is launched, and it allocates all the cores and memory to that executor.
I would recommend taking a different route. You seem to have a pretty big cluster with 51 nodes, not sure if it is even required. However, follow this rule of thumb to begin with, and you will get a hang of how to tune these configurations.
Cluster memory - minimum of 2X the data you are dealing with.
Now assuming 51 nodes is what you require, try below:
r3.4x has 16 CPUs - so you can put all of them to use by leaving one for the OS and other processes.
Set your number of executors to 150 - this will allocate 3 executors per node.
Set number of cores per executor to 5 (3 executors per node)
Set your executor memory to roughly total host memory/3 = 35G
You got to control the parallelism (default partitions), set this to number of total cores you have ~ 800
Adjust shuffle partitions - make this twice of number of cores - 1600
Above configurations have been working like a charm for me. You can monitor the resource utilization on Spark UI.
Also, in your yarn config /etc/hadoop/conf/capacity-scheduler.xml file, set yarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator - which will allow Spark to really go full throttle with those CPUs. Restart yarn service after change.
You should be increasing the executor memory and # executors, If the data is huge try increasing the Driver memory.
My suggestion is to not use the dynamic resource allocation and let it run and see if it still hangs or not (Please note that spark job can consume entire cluster resources and make other applications starve for resources try this approach when no jobs are running). if it doesn't hang that means you should play with the resource allocation, then start hardcoding the resources and keep increasing resources so that you can find the best resource allocation you can possibly use.
Below links can help you understand the resource allocation and optimization of resources.
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html

Spark streaming jobs duration in program

How do I get in my program (which is running the spark streaming job) the time taken for each rdd job.
for example
val streamrdd = KafkaUtils.createDirectStream[String, String, StringDecoder,StringDecoder](ssc, kafkaParams, topicsSet)
val processrdd = streamrdd.map(some operations...).savetoxyz
In the above code for each microbatch rdd the job is run for map and saveto operation.
I want to get the timetake for each streaming job. I can see the job in port 4040 UI, but want to get in the spark code itself.
Pardon if my question is not clear.
You can use the StreamingListener in you spark app. This interface provides a method onBatchComplete that can give you total time taken by the batch jobs.
context.addStreamingListener(new StatusListenerImpl());
StatusListenerImpl is the implementation class that you have to implement using StreamingListener.
There are more other methods also available in listener you should explore them as well.

How concurrent # mappers and # reducers are calculated in Hadoop 2 + YARN?

I've searched by sometime and I've found that a MapReduce cluster using hadoop2 + yarn has the following number of concurrent maps and reduces per node:
Concurrent Maps # = yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb
Concurrent Reduces # = yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb
However, I've set up a cluster with 10 machines, with these configurations:
'yarn_site' => {
'yarn.nodemanager.resource.cpu-vcores' => '32',
'yarn.nodemanager.resource.memory-mb' => '16793',
'yarn.scheduler.minimum-allocation-mb' => '532',
'yarn.nodemanager.vmem-pmem-ratio' => '5',
'yarn.nodemanager.pmem-check-enabled' => 'false'
},
'mapred_site' => {
'mapreduce.map.memory.mb' => '4669',
'mapreduce.reduce.memory.mb' => '4915',
'mapreduce.map.java.opts' => '-Xmx4669m',
'mapreduce.reduce.java.opts' => '-Xmx4915m'
}
But after the cluster is set up, hadoop allows 6 containers for the entire cluster. What am I forgetting? What am I doing wrong?
Not sure if this is the same issue you're having, but I had a similar issue, where I launched an EMR cluster of 20 nodes of c3.8xlarge in the core instance group and similarly found the cluster to be severely underutilized when running a job (only 30 mappers were running concurrently across the entire cluster, even though the memory/vcore configs in YARN and MapReduce for my particular cluster show that over 500 concurrent containers can run). I was using Hadoop 2.4.0 on AMI 3.5.0.
It turns out that the instance group matters for some reason. When I relaunched the cluster with 20 nodes in task instance group and only 1 core node, that made a HUGE difference. I got over 500+ mappers running concurrently (in my case, the mappers were mostly downloading files from S3 and as such don't need HDFS).
I'm not sure why the different instance group type makes a difference, given that both can equally run tasks, but clearly they are being treated differently.
I thought I'd mention it here, given that I ran into this issue myself and using a different group type helped.

Resources