How can I get information about the status of the computational nodes on the torque-managed cluster (I am interested in the number of nodes that are allocated for jobs vs. idle ones)? Under SLURM I would use sinfo.
You would normally use "pbsnodes -a" and then parse the output for what you'd like. See this
Related
Does stateless node mean just being independent of each others? can you explain this concept w.r.t to hadoop
The explanation can be as follows: each mapper/reducer has no idea about all the other mappers/reducers (i.e. about their current states, their particular outputs if any, etc.). Such statelessness is not great for certain data processing workloads (e.g. graph data) but allows easy parallelization (a particular map/reduce task can be run on any node, meaning a failed mapper/reducer is not an issue, just start a new one on the same input split/mappers' outputs).
I would say that statefulness of the nodes in computing infrastructures has slightly different meaning from what you have defined. Remember there is always coordination process running somewhere, so there is no complete independence between the nodes.
What it can actually mean in computing infrastructures is that the nodes does not store anything about the computation they are performing on persistent storage. Consider the following, you have master running on some machine delegating the tasks to the workers, the workers maintain the information in RAM and retrieve it from RAM when necessary for task computation. Workers also write results into RAM. You can consider the worker nodes as stateless, since whenever the worker node fails (from power cut for example) it would not have any mechanism which would allow it to recover the execution from the point it has stopped at. But still master will know that the node has failed and would delegate the task to another machine in the cluster.
Regarding Hadoop, the architecture is statefull, first of all, because whenever the job is starting its execution it will transfer all the metadata to the worker node (the jar file, split location, etc). Secondly, when the job is scheduled on the node which does not contain the input data, it will be transferred there. Additionally, the intermediate data is being stored on the disk, exactly for failure recovery reasons, so the failure recovery mechanisms can resume the job from the point where execution has stopped.
I have a couple of thousand jobs to run on a SLURM cluster with 16 nodes. These jobs should run only on a subset of the available nodes of size 7. Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded. Therefore, multiple jobs should run at the same time on a single node. None of the tasks should spawn over multiple nodes.
Currently I submit each of the jobs as follow:
sbatch --nodelist=myCluster[10-16] myScript.sh
However this parameter makes slurm to wait till the submitted job terminates, and hence leaves 3 nodes completely unused and, depending on the task (multi- or single-threaded), also the currently active node might be under low load in terms of CPU capability.
What are the best parameters of sbatch that force slurm to run multiple jobs at the same time on the specified nodes?
You can work the other way around; rather than specifying which nodes to use, with the effect that each job is allocated all the 7 nodes, specify which nodes not to use:
sbatch --exclude=myCluster[01-09] myScript.sh
and Slurm will never allocate more than 7 nodes to your jobs. Make sure though that the cluster configuration allows node sharing, and that your myScript.sh contains #SBATCH --ntasks=1 --cpu-per-task=n with n the number of threads of each job.
Some of the tasks are parallelized, hence use all the CPU power of a single node while others are single threaded.
I understand that you want the single-threaded jobs to share a node, whereas the parallel ones should be assigned a whole node exclusively?
multiple jobs should run at the same time on a single node.
As far as my understanding of SLURM goes, this implies that you must define CPU cores as consumable resources (i.e., SelectType=select/cons_res and SelectTypeParameters=CR_Core in slurm.conf)
Then, to constrain parallel jobs to get a whole node you can either use --exclusive option (but note that partition configuration takes precedence: you can't have shared nodes if the partition is configured for exclusive access), or use -N 1 --tasks-per-node="number_of_cores_in_a_node" (e.g., -N 1 --ntasks-per-node=8).
Note that the latter will only work if all nodes have the same number of cores.
None of the tasks should spawn over multiple nodes.
This should be guaranteed by -N 1.
Actually I think the way to go is setting up a 'reservation' first. According to this presentation http://slurm.schedmd.com/slurm_ug_2011/Advanced_Usage_Tutorial.pdf (last slide).
Scenario: Reserve ten nodes in the default SLURM partition starting at noon and with a duration of 60 minutes occurring daily. The reservation will be available only to users alan and brenda.
scontrol create reservation user=alan,brenda starttime=noon duration=60 flags=daily nodecnt=10
Reservation created: alan_6
scontrol show res
ReservationName=alan_6 StartTime=2009-02-05T12:00:00
EndTime=2009-02-05T13:00:00 Duration=60 Nodes=sun[000-003,007,010-013,017] NodeCnt=10 Features=(null) PartitionName=pdebug Flags=DAILY Licenses=(null)
Users=alan,brenda Accounts=(null)
# submit job with:
sbatch --reservation=alan_6 myScript.sh
Unfortunately I couldn't test this procedure, probaly due to a lack of privileges.
I've noticed that all map and all reduce tasks are running on a single node (node1). I tried creating a file consisting of a single hdfs block which resides on node2. When running a mapreduce tasks whose input consists only of this block resident on node2, the task still runs on node1. I was under the impression that hadoop prioritizes running tasks on the nodes that contain the input data. I see no errors reported in log files. Any idea what might be going on here?
I have a 3-node cluster running on kvms created by following the cloudera cdh4 distributed installation guide.
I was under the impression that hadoop prioritizes running tasks on
the nodes that contain the input data.
Well, there might be an exceptional case :
If the node holding the data block doesn't have any free CPU slots, it won't be able to start any mappers on that particular node. In such a scenario instead of waiting data block will be moved to a nearby node and get processed there. But before that framework will try to process the replica of that block, locally(If RF > 1).
HTH
I don't understand when you say "I tried creating a file consisting of a single hdfs block which resides on node2". I don't think you can "direct" hadoop cluster to store some block in some specific node.
Hadoop will decide number of mappers based on input's size. If input size is less than hdfs block size (default I think is 64m), it will spawn just one mapper.
You can set job param "mapred.max.split.size" to whatever size you want to force spawning multiple reducers (default should suffice in most cases).
Can anyone help me understand following observation that is opposite to my understand of Hadoop data locality.
A Hadoop cluster with 3 nodes:
master: 10.28.75.146
slave1: 10.157.6.202
slave2: 10.31.130.224
run a task successfully. From job console:
Task Attempts:attempt_201304030122_0003_m_000000_0
Machine: /default-rack/10.31.130.224<p>
Task log: INFO: consuming hdfs://10.28.75.146:9000/input/22.seq
We know 224 node is processing /input/22.seq data. By command:
$hadoop fsck /input -files -blocks -locations |grep -A 1 "22.seq"
/input/22.seq 61731242 bytes, 1 block(s): OK
0. blk_-8703092405392537739_1175 len=61731242 repl=1 [10.157.6.202:9200]
22.seq fits in one block which is smaller than default HDFS block size (64MB) and not replicated to other node.
Question: since 22.seq is not local to 224 node, why Hadoop assigns 224 node processing data remotely on 202?
Note: this is not an exception. I notice many data files are fetched remotely, and observe huge network traffic on eth0 at both nodes. I am expecting near-zero traffic between two nodes, since all my data files are <64MB, and data should processed locally.
FYI: This is observed on Amazon's AWS EMR.
I am not sure if this will answer your question fully, but I will attempt to shine some light.
The network traffic you encountered above may have been influenced by the process by which the mapreduce framework submits a job; part of which transfers by default 10 copies of your job jar and all libraries contained therein across the cluster (in cases like yours where there are not 10 nodes I am not sure how it would behave): there are heatbeats and getting input split info and reporting progress which seem like small bandwidth operation although I am ignorant about the specifics on their network resource consumption.
Regarding the job you are running: If it is a map only job then Hadoop tries (tries because there may be resource limiting factors running on the data-local node) for data locality optimization and runs the job where the input split is located. It sounds like in your case, the file is less than the default 64MB so 1 split should equal your data which in turn should result in one map since the number is maps is directly proportional to the number of splits you have, but if your job is a Map and Reduce job then the network traffic may be picking up some of the reduce copy and sort phase HTTP network traffic which can end up on separate nodes.
N Input Splits = N Maps --output--> M partitions = M Reducers
Of course the network traffic and data locality optimizations are dependent on the availability of the nodes resources so your test assumptions should take this into consideration.
Hope I was a tiny bit helpful.
Short answer - because Hadoop scheduler sucks. It has no up-front global plan on which file split should go where. As nodes ask for work - it looks at the available splits - and gives out the best match. There are parameters that tune how aggressive Hadoop is in finding a best match (ie. - when a request for work arrives - does it give the best match available at that time? or does it wait for sometime to see if other, better matching nodes also send requests?)
By default (and I am pretty sure this is the case with EMR) - the scheduler would always give back some work to a requesting node - if there was any work available. You can see that if your input is small (spans only a few blocks/nodes), but the number of nodes are larger (in comparison) - then you will get very poor locality. On the other hand - if the size of input is large - then your odds of getting good locality goes up a lot.
The FairScheduler has parameters to delay scheduling - so as to get better locality. However i don't think that is the default scheduler with EMR.
I am running a job timing analysis. I have a pre configured cluster with 8 nodes. I want to run a given job with 8 nodes, 6 nodes , 4 nodes and 2 nodes respectively and note down the corresponding run times. Is there a way i can do this programatically, i.e by using appropriate settings in the Job configuration in Java code ?
There are a couple of ways. Would prefer in the same order.
exclude files can be used to not allow some of the task trackers/data nodes connect to the job tracker/ name node. Check this faq. The properties to be used are mapreduce.jobtracker.hosts.exclude.filename and dfs.hosts.exclude. Note than once the files have been changed, the name node and the job tracker have to be refreshed using the mradmin and dfsadmin commands with the refreshNodes option and it might take some time for the cluster to settle because data blocks have to be moved from the excluded nodes.
Another way is to stop the task tracker on the nodes. Then the map/reduce tasks will not be scheduled on that node. But, the data will still be fetched from all the data nodes. So, the data nodes also need to be stopped. Make sure that the name node gets out of safe mode and the replication factor is also set properly (with 2 data nodes, the replication factor can't be 3).
A Capacity Scheduler can also be used to limit the usage of a cluster by a particular job. But, when resources are free/idle then the scheduler will allocate resources beyond capacity for better utilization of the cluster. I am not sure if this can be stopped.
Well are you good with scripting ? If so play around with start scripts of the daemons. Since this is an experimental setup, I think restarting hadoop for each experiment should be fine.