I am trying to submit a parallel job on CentOS 6.8 machine with 3 nodes. Is there a queueing system and how to use it (want to specify a number of nodes used and a number of threads per node used, etc...)?
Any good resource/link to start learning from?
Thanks a lot.
GNU Parallel is a very simple solution: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-GNU-Parallel-as-queue-system-batch-manager
Related
Our computer cluster runs slurm version 15.08.13 and mpich version is 3.2.1. My question is, could Slurm support multiple jobs running on one node at the same time? Our computer cluster has 16 cores cpu per node. We want to run two jobs at the same time on one node, each job uses 8 cores.
We have found that if a job uses all of the cpu cores for one node, the state of node becomes "allocated". If a job uses only part of the cpu cores for one node, the state of node becomes "mixed", but subsequent jobs can only be queued and the state of job is "pending".
Our order for submitting an job is as follows:
srun -N1 -n8 testProgram
So, does Slurm support running multiple jobs on one node at the same time? Thanks.
Yes, provided it was configured with SelectType=select/cons_res, which does not seem to be the case on your system. You can check with scontrol show config | grep Select. See more information here
Yes, you need to set SelectType=select/cons_res or SelectType=select/cons_tes
and SelectTypeParameters=CR_CPU_Memory
The difference between cons_res and cons_tes is that cons_tres adds GPUs support.
I have a a hadoop system running. It has all together 8 map slots in parallel. The DFS block size is 128M.
Now suppose I have two jobs: both of them have large input files, say a hundred G. I want them to run in parallel in the hadoop system. (Because the users do not want to wait. They want to see some progress.) I want the first ones take 5 map slots in parallel, the second one runs on the rest 3 map slots. Is that possible to specify the number of map-slots? Currently I use command line to start it as Hadoop jar jarfile classname input output. Can I specify it in the command line?
Thank you very much for the help.
Resource allocation can be done using a Scheduler. Classic Hadoop uses a JobQueueTaskScheduler, while YARN uses CapacityScheduler by default. According to the Hadoop documentation
This document describes the CapacityScheduler, a pluggable scheduler for Hadoop which allows for multiple-tenants to securely share a large cluster such that their applications are allocated resources in a timely manner under constraints of allocated capacities.
I want to run map reduce tasks on a single machine and I want to use all the cores of my machine. Which is the best approach? If I install hadoop in pseudo distributed mode it is possible to use all the cores?
You can make use of the properties mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum to increase the number of Mappers/Reducers spawned simultaneously on a TaskTracker as per your hardware specs. By default, it is set to 2, hence a maximum of 2 maps and 2 reduces will run at a given instance. But, one thing to keep in mind is that if your input is very small then framework will decide it's not worth parallelizing the execution. In such a case you need to handle it by tweaking the default split size through mapred.max.split.size.
Having said that, I, based on my personal experience, have noticed that MR jobs are normally I/O(perhaps memory, sometimes) bound. So, CPU does not really become a bottleneck under normal circumstances. As a result you might find it difficult to fully utilize all the cores on one machine at a time for a job.
I would suggest to devise some strategy to decide the proper number of Mappers/Reducers to efficiently carry out the processing to make sure that you are properly utilizing the CPU since Mappers/Reducers take up slots on each node. One approach could be to take the number of cores, multiply it by .75 and then set the number of Mappers and Reducers as per your needs. For example, you have 12 physical cores or 24 virtual cores, then you could have 24*.75 = 18 slots. Now based on your needs you can decide whether to use 9Mappers+9Reducers or 12Mappers+6Reducers or something else.
I'm reposting my answer from this question: Hadoop and map-reduce on multicore machines
For Apache Hadoop 2.7.3, my experience has been that enabling YARN will also enable multi-core support. Here is a simple guide for enabling YARN on a single node:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node
The default configuration seems to work pretty well. If you want to tune your core usage, then perhaps look into setting 'yarn.scheduler.minimum-allocation-vcores' and 'yarn.scheduler.maximum-allocation-vcores' within yarn-site.xml (https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml)
Also, see here for instructions on how to configure a simple Hadoop sandbox with multicore support: https://bitbucket.org/aperezrathke/hadoop-aee
I am currently using wordcount application in hadoop as a benchmark. I find that the cpu usage is fairly nearly constant around 80-90%. I would like to have a fluctuating cpu usage. Is there any hadoop application that can give me this capability? Thanks a lot.
I don't think there's a way to throttle or specify a range for hadoop to use. Hadoop will use the CPU available to it. When I'm running a lot of jobs, I'm constantly in the 90%+ range.
One way you can control the CPU usage is to change the maximum number of mappers/reducers each tasktracker can run simultaneously. This is done through the
mapred.tasktracker.{map|reduce}.tasks.maximum setting in $HADOOP_HOME/conf/core-site.xml.
It will use less CPU on that tasktracker when the number of mapper/reducers is limited.
Another way is to set the configuration value for mapred.tasktracker.{map|reduce}.tasks when setting up the job. This will force that job to use that many mappers/reducers. This number will be split across the available tasktrackers, so if you have 4 nodes and want each node to have 1 mapper you'd set mapred.tasktracker.map.tasks to 4. It's also possible that if a node can run 4 mappers, it will run all 4, I don't know exactly how hadoop will split out the tasks, but forcing a number, per job, is an option.
I hope that helps get you to where you're going. I still don't quite understand what you are looking for. :)
I'm currently working on a cluster using the ClusterVisionOS 3.1. This will be my first time working with a cluster, so I probably haven't tried the "obvious".
I can submit a single job to the cluster with the "qsub" command(this I got working properly)
But the problem starts when submitting multiple jobs at once. I could write a script sending them all at once, but then all nodes would be occupied with my jobs and there are more people here wanting to submit their job.
So here's the deal:
32 nodes (4 processors/slots each)
The best thing would be to tell the cluster to use 3 nodes (12 processors) and queue all my jobs on these nodes/processors, if this is even possible. If I could let the nodes use 1 processor for each job, then that would be perfect.
Ok, so i guess i found out, there is no solution to this problem. My personal solution is write a script that connects through ssh to the cluster and then just let the script check how many jobs are already running under your user name. The script checks if that number does not exceed, lets say, 20 jobs at the same time. As long as this number is not reached it keep submitting jobs.
Maybe its an ugly solution, but a working one!
About the processor thing, the jobs were already submitted to different single processors, fully utilizing the full extent of the nodes.