Nomad understand scheduler node selection - nomad

I have a 3 node test cluster and several jobs (simple config, no constraints, java services). My problem is every time I start a job it will be started on the first node. If I increase the count=2 and add a distinct host constraint there are also allocations on the other nodes. But if I start 50 jobs with count=1, there are 50 allocations on the first node and non on node2 or node3.
job "test" {
datacenters = ["dc1"]
type = "service"
group "test" {
count = 1
task "test" {
driver = "java"
config {
jar_path = "/usr/share/java/test.jar"
jvm_options = [
"-Xmx256m",
"-Xms256m"]
}
resources {
cpu = 300
memory = 256
}
}
}
Now I want to understand/see how Nomad selects the node for the allocations. All 3 nodes have the same resources - so the jobs should be distributed equally?
EDIT: Suddenly the jobs will be distributed. So my new question is: Is there a verbose output or something where I can see how and why Nomad choose a specific node while starting a new job.

As given in the official documentation
The second phase is ranking, where the scheduler scores feasible nodes to find the best fit. Scoring is primarily based on bin packing, which is used to optimize the resource utilization and density of applications, but is also augmented by affinity and anti-affinity rules. Nomad automatically applies a job anti-affinity rule which discourages colocating multiple instances of a task group. The combination of this anti-affinity and bin packing optimizes for density while reducing the probability of correlated failures.
This means that Nomad will try to "fill" a particular node at first. Let me take an example:
Suppose you have three jobs with requirements:
j1(200M), j2(300M), j3(500M)
and have three nodes with free resources
n1(1G), n2(2G), n3(3G).
In this case, Nomad will choose a node that will fill first. So, when you try to schedule J1, N1 will be selected. Now, the state of the nodes with their remaining resources will be:
n1(800M), n2(2G), n3(3G)
Now, suppose you wanna schedule j2. In this case, n1 will be selected as that node will be filling faster than n2 and n3.
Hence, your final allocation with free resources will look like
n1 (j1,j2,j3)(100M) n2(2G) n3(3G)
Now if a job j4 comes in with 200M of requirement, n2 will be selected. This will send your cluster state into
n1 (j1,j2,j3)(100M) n2(j4)(1800M) n3(3G)
If you would like to understand more how bin-packing works in Nomad, you can check the advanced documentation at https://www.nomadproject.io/docs/internals/scheduling.html
Also, calculation of weights related to assignment of allocations is exposed on the result of the evaluation API. This would mean if you issue the following command:
$ nomad plan <job> file
And then note down the eval ID, and make an HTTP request on to the evaluation api
$ curl localhost:4646/v1/evaluation/<eval_id>
You would get the result of scheduler calculations and conditions for the to be scheduled nomad job.
This plan command is very useful for understanding TaskGroup allocations. It will also tell you if you have enough resources to run the job in your datacenter.

Related

Spark partition on nodes foreachpartition

I have a spark cluster (DataProc) with a master and 4 workers (2 preemtible), in my code I have some thing like this:
JavaRDD<Signal> rdd_data = javaSparkContext.parallelize(myArray);
rdd_data.foreachPartition(partitionOfRecords -> {
while (partitionOfRecords.hasNext()) {
MyData d = partitionOfRecords.next();
LOG.info("my data: " + d.getId().toString());
}
})
myArray is composed by 1200 MyData objects.
I don't understand why spark uses only 2 cores, divide my array into 2 partitions, and doesn't use 16 cores.
I need to set the number of partition?
Thanks in advance for any help.
Generally it's always a good idea to specific the number of partitions as the second argument to parallelize since the optimal slicing of your dataset should really be independent from the particular shape of the cluster you're using, and Spark can at best use current sizes of executors as a "hint".
What you're seeing here is that Spark will default to asking taskScheduler for current number of executor cores to use as the defaultParallelism, combined with the fact that in Dataproc Spark dynamic allocation is enabled. Dynamic allocation is important because otherwise a single job submitted to a cluster might just specify max executors even if it sits idle and then it will prevent other jobs from being able to use those idle resources.
So on Dataproc, if you're using default n1-standard-4, Dataproc configures 2 executors per machine and gives each executor 2 cores. The value of spark.dynamicAllocation.minExecutors should be 1, so your default job, upon startup without doing any work, would sit on 1 executor with 2 cores. Then taskScheduler will report that 2 cores are currently reserved in total, and therefore defaultParallelism will be 2.
If you had a large cluster and you were already running a job for awhile (say, you have a map phase that runs for longer than 60 seconds) you'd expect dynamic allocation to have taken all available resources, so the next step of the job that uses defaultParallelism would then presumably be 16, which is the total cores on your cluster (or possibly 14, if 2 are consumed by an appmaster).
In practice, you probably want to parallelize into a larger number of partitions than total cores available anyways. Then if there's any skew in how long each element takes to process, you can have nice balancing where fast tasks finish and then those executors can start taking on new partitions while the slow ones are still running, instead of always having to wait for a single slowest partition to finish. It's common to choose a number of partitions anywhere from 2x the number of available cores to something 100x or more.
Here's another related StackOverflow question: spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

About nodes number on Flink

I'm developing a Flink toy-application on my local machine before to deploy the real one on a real cluster.
Now I have to determine how many nodes I need to set the cluster.
But I'm still a bit confused about how many nodes I have to consider to execute my application.
For example if I have the following code (from the doc):
DataStream<String> lines = env.addSource(new FlinkKafkaConsumer<>()...);
DataStream<Event> events = lines.map((line)->parse(line));
DataStream<Statistics> stats = events
.keyBy("id");
.timeWindow(Time.seconds(10))
.apply(new MyWindowAggregationFunction());
stats.addSink(new RollingSink(path));
This means that operations "on same line" are executed on same node? (It sounds a bit strange to me)
Some confirms:
If the answer to previous question is yes and if I set parallelism to 1 I can establish how many nodes I need counting how many operations I have to perform ?
If I set parallelism to N but I have less than N nodes available Flink automatically scales the elaboration on available nodes?
My throughput and data load are not relevant I think, it is not heavy.
If you haven't already, I recommend reading https://ci.apache.org/projects/flink/flink-docs-release-1.3/concepts/runtime.html, which explains how the Flink runtime is organized.
Each task manager (worker node) has some number of task slots (at least one), and a Flink cluster needs exactly as many task slots as the highest parallelism used in the job. So if the entire job has a parallelism of one, then a single node is sufficient. If the parallelism is N and fewer than N task slots are available, the job can't be executed.
The Flink community is working on dynamic rescaling, but as of version 1.3, it's not yet available.

How to submit a job to any [subset] of nodes from nodelist in SLURM?

I have a couple of thousand jobs to run on a SLURM cluster with 16 nodes. These jobs should run only on a subset of the available nodes of size 7. Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded. Therefore, multiple jobs should run at the same time on a single node. None of the tasks should spawn over multiple nodes.
Currently I submit each of the jobs as follow:
sbatch --nodelist=myCluster[10-16] myScript.sh
However this parameter makes slurm to wait till the submitted job terminates, and hence leaves 3 nodes completely unused and, depending on the task (multi- or single-threaded), also the currently active node might be under low load in terms of CPU capability.
What are the best parameters of sbatch that force slurm to run multiple jobs at the same time on the specified nodes?
You can work the other way around; rather than specifying which nodes to use, with the effect that each job is allocated all the 7 nodes, specify which nodes not to use:
sbatch --exclude=myCluster[01-09] myScript.sh
and Slurm will never allocate more than 7 nodes to your jobs. Make sure though that the cluster configuration allows node sharing, and that your myScript.sh contains #SBATCH --ntasks=1 --cpu-per-task=n with n the number of threads of each job.
Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded.
I understand that you want the single-threaded jobs to share a node, whereas the parallel ones should be assigned a whole node exclusively?
multiple jobs should run at the same time on a single node.
As far as my understanding of SLURM goes, this implies that you must define CPU cores as consumable resources (i.e., SelectType=select/cons_res and SelectTypeParameters=CR_Core in slurm.conf)
Then, to constrain parallel jobs to get a whole node you can either use --exclusive option (but note that partition configuration takes precedence: you can't have shared nodes if the partition is configured for exclusive access), or use -N 1 --tasks-per-node="number_of_cores_in_a_node" (e.g., -N 1 --ntasks-per-node=8).
Note that the latter will only work if all nodes have the same number of cores.
None of the tasks should spawn over multiple nodes.
This should be guaranteed by -N 1.
Actually I think the way to go is setting up a 'reservation' first. According to this presentation http://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf (last slide).
Scenario: Reserve ten nodes in the default SLURM partition starting at noon and with a duration of 60 minutes occurring daily. The reservation will be available only to users alan and brenda.
scontrol create reservation user=alan,brenda starttime=noon duration=60 flags=daily nodecnt=10
Reservation created: alan_6
scontrol show res
ReservationName=alan_6 StartTime=2009-02-05T12:00:00
EndTime=2009-02-05T13:00:00 Duration=60 Nodes=sun[000-003,007,010-013,017] NodeCnt=10 Features=(null) PartitionName=pdebug Flags=DAILY Licenses=(null)
Users=alan,brenda Accounts=(null)
# submit job with:
sbatch --reservation=alan_6 myScript.sh
Unfortunately I couldn't test this procedure, probaly due to a lack of privileges.

Confusion of how hadoop splits work

We are Hadoop newbies, we realize that hadoop is for processing big data, and how Cartesian product is extremely expensive. However we are having some experiments where we are running a Cartesian product job similar to the one in the MapReduce Design Patterns book except with a reducer calculating avg of all intermediate results( including only upper half of A*B, so total is A*B/2).
Our setting: 3 node cluster, block size = 64M, we tested different data set sizes ranging from
5000 points (130KB) to 10000 points (260KB).
Observations:
1- All map tasks are running on one node, sometimes on the master machine, other times on one of the slaves, but it never processed on more than one machine.Is there a way to force hadoop to distribute the splits therefore map tasks among machines? Based on what factors dose hadoop decide which machine is going to process the map tasks( in our case once it decided the master, in another case it decided a slave).
2- In all cases where we are testing the same job on different data sizes, we are getting 4 map tasks. Where dose the number 4 comes from?since our data size is less than the block size, why are we having 4 splits not 1.
3- Is there a way to see more information about exact splits for a running job.
Thanks in advance
What version of Hadoop are you using? I am going to assume a later version that uses YARN.
1) Hadoop should distribute the map tasks among your cluster automatically and not favor any specific nodes. It will place a map task as close to the data as possible, i.e. it will choose a NodeManager on the same host as a DataNode hosting a block. If such a NodeManager isn't available, then it will just pick a node to run your task. This means you should see all of your slave nodes running tasks when your job is launched. There may be other factors blocking Hadoop from using a node, such as the NodeManager being down, or not enough memory to start up a JVM on a specific node.
2) Is your file size slightly above 64MB? Even one byte over 67,108,864 bytes will create two splits. The CartesianInputFormat first computes the cross product of all the blocks in your data set. Having a file that is two blocks will create four splits -- A1xB1, A1xB2, A2xB1, A2xB2. Try a smaller file and see if you are still getting four splits.
3) You can see the running job in the UI of your ResourceManager. https://:8088 will open the main page (jobtracker-host:50030 for MRv1) and you can navigate to your running job from there, which will get you to see individual tasks that are running. If you want more specifics on what the input format is doing, add some log statements to the CartesianInputFormat's getSplits method and re-run your code to see what is going on.

In Hadoop can we control the number of nodes per job programatically?

I am running a job timing analysis. I have a pre configured cluster with 8 nodes. I want to run a given job with 8 nodes, 6 nodes , 4 nodes and 2 nodes respectively and note down the corresponding run times. Is there a way i can do this programatically, i.e by using appropriate settings in the Job configuration in Java code ?
There are a couple of ways. Would prefer in the same order.
exclude files can be used to not allow some of the task trackers/data nodes connect to the job tracker/ name node. Check this faq. The properties to be used are mapreduce.jobtracker.hosts.exclude.filename and dfs.hosts.exclude. Note than once the files have been changed, the name node and the job tracker have to be refreshed using the mradmin and dfsadmin commands with the refreshNodes option and it might take some time for the cluster to settle because data blocks have to be moved from the excluded nodes.
Another way is to stop the task tracker on the nodes. Then the map/reduce tasks will not be scheduled on that node. But, the data will still be fetched from all the data nodes. So, the data nodes also need to be stopped. Make sure that the name node gets out of safe mode and the replication factor is also set properly (with 2 data nodes, the replication factor can't be 3).
A Capacity Scheduler can also be used to limit the usage of a cluster by a particular job. But, when resources are free/idle then the scheduler will allocate resources beyond capacity for better utilization of the cluster. I am not sure if this can be stopped.
Well are you good with scripting ? If so play around with start scripts of the daemons. Since this is an experimental setup, I think restarting hadoop for each experiment should be fine.

Resources