how can I see the cpu useage in a slurm job - cluster-computing

Is there a way to monitor the % of cpu useage in a cluster using slurm.
For example imagine I have 200 nodes, and I send an mpi work that use all those 200 nodes, it could be that only one node is been used(really calculating stuff) while the other are not.
Is there an option that tell me - average cpu load in the 200 nodes, or current cpu load on every one of the cpus?
EDIT: on a BlueGene machine
Thanks.

The slurm command:
sstat "jobid"
replace "jobid" with your integer jobid.
It will return several columns including 'AveCPU' & 'AveDiskRead'

Related

Does SLURM support running multiple jobs on one node at the same time?

Our computer cluster runs slurm version 15.08.13 and mpich version is 3.2.1. My question is, could Slurm support multiple jobs running on one node at the same time? Our computer cluster has 16 cores cpu per node. We want to run two jobs at the same time on one node, each job uses 8 cores.
We have found that if a job uses all of the cpu cores for one node, the state of node becomes "allocated". If a job uses only part of the cpu cores for one node, the state of node becomes "mixed", but subsequent jobs can only be queued and the state of job is "pending".
Our order for submitting an job is as follows:
srun -N1 -n8 testProgram
So, does Slurm support running multiple jobs on one node at the same time? Thanks.
Yes, provided it was configured with SelectType=select/cons_res, which does not seem to be the case on your system. You can check with scontrol show config | grep Select. See more information here
Yes, you need to set SelectType=select/cons_res or SelectType=select/cons_tes
and SelectTypeParameters=CR_CPU_Memory
The difference between cons_res and cons_tes is that cons_tres adds GPUs support.

Why some worker nodes cost more CPU for system during running Spark application?

I have 1 master node and 4 worker nodes. I set up the cluster using Ambari and all monitoring metrics are collected from its dashboard. Spark on the top of Hadoop, so I have got YARN and HDFS. I run a very simple word count script and found that one of the worker nodes did the most job. The word count job is divided into 149 tasks. 98 tasks are done by one node.
Here is my code for counting words
val file = sc.textFile("/data/2gdata.txt") //read file from HDFS
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.collect*
This picture illustrates the events timeline and CPU usage for each worder nodes
Aggregated Metrics by Executor are shown here
Each task has same size of input file. I assume they would spend similar time such as around 30 seconds to count word in the piece of input file. Some tasks spent more than 10 minutes.
I realized those nodes doing less job cost more CPU for system operation as shown in blue area in the first graph. The worker did more tasks and cost more CPU for user (application).
I am wondering what kinds of system operations required for a Spark application. Why three of worker nodes cost more CPU for system? I also enabled spark.speculation, but those stragglers will be killed after 10 minutes and performance didn't get better. Moreover, those stragglers are node_local, so I assume this issue is not related to HDFS replication. (There are 3 replications under the rack.)
Thank you very much.
Even the input file size is same for each task, during the shuffle and reduce phase, some task might process more data than other tasks, data skewing may cause more CPU costs.
You can repartitioning the data in between may improve the performance.

How to submit a job to any [subset] of nodes from nodelist in SLURM?

I have a couple of thousand jobs to run on a SLURM cluster with 16 nodes. These jobs should run only on a subset of the available nodes of size 7. Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded. Therefore, multiple jobs should run at the same time on a single node. None of the tasks should spawn over multiple nodes.
Currently I submit each of the jobs as follow:
sbatch --nodelist=myCluster[10-16] myScript.sh
However this parameter makes slurm to wait till the submitted job terminates, and hence leaves 3 nodes completely unused and, depending on the task (multi- or single-threaded), also the currently active node might be under low load in terms of CPU capability.
What are the best parameters of sbatch that force slurm to run multiple jobs at the same time on the specified nodes?
You can work the other way around; rather than specifying which nodes to use, with the effect that each job is allocated all the 7 nodes, specify which nodes not to use:
sbatch --exclude=myCluster[01-09] myScript.sh
and Slurm will never allocate more than 7 nodes to your jobs. Make sure though that the cluster configuration allows node sharing, and that your myScript.sh contains #SBATCH --ntasks=1 --cpu-per-task=n with n the number of threads of each job.
Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded.
I understand that you want the single-threaded jobs to share a node, whereas the parallel ones should be assigned a whole node exclusively?
multiple jobs should run at the same time on a single node.
As far as my understanding of SLURM goes, this implies that you must define CPU cores as consumable resources (i.e., SelectType=select/cons_res and SelectTypeParameters=CR_Core in slurm.conf)
Then, to constrain parallel jobs to get a whole node you can either use --exclusive option (but note that partition configuration takes precedence: you can't have shared nodes if the partition is configured for exclusive access), or use -N 1 --tasks-per-node="number_of_cores_in_a_node" (e.g., -N 1 --ntasks-per-node=8).
Note that the latter will only work if all nodes have the same number of cores.
None of the tasks should spawn over multiple nodes.
This should be guaranteed by -N 1.
Actually I think the way to go is setting up a 'reservation' first. According to this presentation http://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf (last slide).
Scenario: Reserve ten nodes in the default SLURM partition starting at noon and with a duration of 60 minutes occurring daily. The reservation will be available only to users alan and brenda.
scontrol create reservation user=alan,brenda starttime=noon duration=60 flags=daily nodecnt=10
Reservation created: alan_6
scontrol show res
ReservationName=alan_6 StartTime=2009-02-05T12:00:00
EndTime=2009-02-05T13:00:00 Duration=60 Nodes=sun[000-003,007,010-013,017] NodeCnt=10 Features=(null) PartitionName=pdebug Flags=DAILY Licenses=(null)
Users=alan,brenda Accounts=(null)
# submit job with:
sbatch --reservation=alan_6 myScript.sh
Unfortunately I couldn't test this procedure, probaly due to a lack of privileges.

Hadoop workload

I am currently using wordcount application in hadoop as a benchmark. I find that the cpu usage is fairly nearly constant around 80-90%. I would like to have a fluctuating cpu usage. Is there any hadoop application that can give me this capability? Thanks a lot.
I don't think there's a way to throttle or specify a range for hadoop to use. Hadoop will use the CPU available to it. When I'm running a lot of jobs, I'm constantly in the 90%+ range.
One way you can control the CPU usage is to change the maximum number of mappers/reducers each tasktracker can run simultaneously. This is done through the
mapred.tasktracker.{map|reduce}.tasks.maximum setting in $HADOOP_HOME/conf/core-site.xml.
It will use less CPU on that tasktracker when the number of mapper/reducers is limited.
Another way is to set the configuration value for mapred.tasktracker.{map|reduce}.tasks when setting up the job. This will force that job to use that many mappers/reducers. This number will be split across the available tasktrackers, so if you have 4 nodes and want each node to have 1 mapper you'd set mapred.tasktracker.map.tasks to 4. It's also possible that if a node can run 4 mappers, it will run all 4, I don't know exactly how hadoop will split out the tasks, but forcing a number, per job, is an option.
I hope that helps get you to where you're going. I still don't quite understand what you are looking for. :)

PBS: Requesting only a single core per node without requesting the entire node

I've got processes that need to be farmed out over a cluster that supports PBS, however, due to limitations with the process, I can only run one process per node at a time. Each node has two processors, the ghetto approach would be to simply request two processors per job. But that wastes a core per job. Is it possible to request a single core per job while making sure that only a single process from all of my jobs is running at a time on a given node?
qsub -l place=free:excl should do the trick

Resources