Arrays and user job limitations on SLURM? - bash

I'm a new SLURM user and I'm trying to figure out the best way to submit a job that requires the same command to run 400,000 times with different input files (approximately 200MB memory per CPU, 4 minutes for one instance, each instance runs independently).
I read through the documentation, and so far it seems that arrays are the way to go.
I can use up to 3 nodes on my HPC with 20 cores each, which means that I could run 60 instances of my command at the same time. However, user limit for jobs running at the same time is 10 jobs, with 20 jobs in the queue.
So far, everything I've tried runs each instance of the command as a separate job, thus limiting it to 10 instances in parallel.
How can I fully utilize all available cores in light of the job limits?
Thanks in advance for your help!

You can have a look at tools like GREASY that will allow you to run a single Slurm job and spawn multiple subtasks.
The documentation specifies how to install it and use it and can be found here

You don't even need the job array to attain the defined objective. Firstly submit a job via sbatch job_script command, in the job_script you can customise the job submission. You can use srun parameters & along with the for loop to run the maximum jobs.

Related

Scheduling a job multiple times

I wrote a background job that accepts some parameters, and I scheduled it to run periodically. I now want to schedule it with a different parameter set, but the parse.com console says I have no background jobs.
I've worked around the problem adding the same job multiple times with slightly different names, but this solution is far from ideal. There should be a way to schedule a job with multiple parameter set and different schedules.
Is there a way to schedule the same job multiple times?

How to allocate specific number of mappers to multiple job in Hadoop?

I am executing multiple PIG Scripts say script1, script2, script3, script4. In that I script1 is executing independently and script2,3,4 executing parallely after scripts get executed.
I am giving input file of size 7-8 GB. So after executing script1, I am observing that instead of parallely executing script 2,3,4 only script2 is executing as it is consuming 33-35 mappers. Other remain in like queue (means script3,4 have not get mapper allocation). Due to this too much time requires to execute all scripts.
So what I am thinking is that If I am able to set the limit of mapper to each script then may be time require to execute wll be less as all scripts may get allocation of mappers.
So is there any way to allocate specific number of mappers to multiple scripts?
If your map number is correctly set (according to your core/node and disks/node values), then having 1 job consuming all your maps or having N job consuming MapNumber / N maps will have the same result. But if you really want to distribute your maps on an amount of jobs you can set the per job map number (mapreduce.job.maps in mapred-site.xml i think).
Considering you still have free map slots, there are some config to enable jobs parallel executions like discussed here : Running jobs parallely in hadoop
You can also set a map number for each job (even if I am not sure it really works) if you provide a job.xml in which you set your map number to your hadoop command.
you can add the following line at the beginning of your script :
set mapred.map.tasks 8
and this will let all of your scripts to run concurrently.
please note that if your machine is saturated this will not affect how long all the scripts run

How to check the overall progress of PIG job

A pig script can be translated into multiple MR jobs and I am wondering if there is an interface or a way to see the progress of the overall PIG script like how many jobs are scheduled, executed and so on.
We had the same problem at Twitter, as some of our Pig scripts spin up dozens of Map-Reduce jobs and it's sometimes hard to tell which of them is doing what, reason about efficiency of the plan, understand how many will run in parallel, etc.
So we created Twitter Ambrose: https://github.com/twitter/ambrose
It spins up a little jetty server which gives you a nice web ui that shows the job DAG, colors the nodes as the jobs complete, gives you stats about the jobs, and tells you which relations each job is trying to calculate.
There is a command illustrate but it throws an exception on my deployment. So I use another approach.
You can get the information on how many MR jobs are scheduled by using explain command and looking at the Physical Plan section, which is at the end of the explain report. To get the number of MR jobs for the script I do the following:
./pig -e 'explain -script ./script_name.pig' > ./explain.txt
grep MapReduce ./explain.txt | wc -l
Now we have the number of MR jobs planned. To monitor script execution, before you run it, you need to access Hadoop's jobtracker page (via "http://(IP_or_node_name):50030/jobtracker.jsp") and write down the name of last job (Completed Jobs section). Submit the script. Refresh the jobtracker page and count how many running jobs there are and how many are completed after the one you have noted. Now you can get an idea of how many jobs are left to be executed.
Click on each job and see its statistics and progress.
A much simpler approach would be to run the script on a small dataset, note down the number of jobs, it is displayed on the console output after the script execution. As pig does not change its execution plan, it will be the same with the big dataset. By looking into stats of each job on Hadoop's jobtracker page (via "http://(IP_or_node_name):50030/jobtracker.jsp") you can get the idea of the proportion of time each MR job takes. Than you can use it to approximately interpolate the execution time on large dataset. If you have skewed data and some Cartesian products, execution time prediction might become tricky.

hadoop FIFO scheduling does not make the submitted jobs run in parallel?

I have configured map capacity with 4000 maps, and configure each job with 500 maps, based on my understanding of FIFO mode and the link
Running jobs parallely in hadoop
if I submit 8 jobs, these 8 jobs should run in parallel, right? However, I still see that the 8 jobs I submitted run in sequential, which is something make me feel strange.
Another way is to try fair scheduler, but I have some other running bugs...
How to make this run in parallel?
I am the only user now.
Question: what does the job tracker web UI show for total running jobs?
Actually I have submitted like 80 jobs, so all jobs are submitted successfully since I can see 80 of them
under "Running Jobs" section, but they just run sequentially
Question: how many input files are you currently processing? what does this relate to with regards to the number of mappers for the job?
Since for each job I configure 500 maps through mapred-site.xml setting map.task.num=500.
below is the information
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts
map 1.40% 500 402 91 7 0 0 / 0
reduce 0.00% 1 1 0 0 0 0 / 0
Question: You can configure your Input format to only run 500 maps, but there are occasions where Hadoop ignores this value: if you have more then 500 input files, for example.
I am sure this will not happen, since I customized the inputformat, so that the number of mappers to run is exactly the number of mappers I configure in mapred-site.xml
Question: When you start your job, how many files are you running over, what's the Input Format you are using, and what if any file compression are you using on the input files
Ok, I actually run only one file, but this file will be fully loaded to all maptasks, so I actually use the distrbutecache mechanism to let each maptask load this file fully. I did not use compression currently
Question: What does the job tracker show for the total number of configured mapper and reducer slots? Does this match up with your expected value of 5000?
Below is the information
Maps Reduces TotalSubmissions Nodes Map Task Capacity Reduce Task Capacity Avg. Tasks/Node Blacklisted Nodes
83 0 80 8 4000 80 510.00 0
Whether you run the FairScheduler or the CapacityScheduler, you should still be able to run jobs in parallel, but there are some reasons that you may see that your jobs run sequentially:
Are you the only person using the cluster, if not, how many other people are using it:
Question: what does the job tracker web UI show for total running jobs?
If you are indeed the only job(s) running on the cluster at a particular point in time, then check the Job Tracker web UI for your currently running job - how many input files are you currently processing? what does this relate to with regards to the number of mappers for the job?
You can configure your Input format to only run 500 maps, but there are occasions where Hadoop ignores this value: if you have more then 500 input files, for example.
Question: When you start your job, how many files are you running over, what's the Input Format you are using, and what if any file compression are you using on the input files
Question: What does the job tracker show for the total number of configured mapper and reducer slots? Does this match up with your expected value of 5000?

how to automatically run a bash script when my qsub jobs are finished on a server?

I would like to run a script when all of the jobs that I have sent to a server are done.
for example, I send
ssh server "for i in config*; do qsub ./run 1 $i; done"
And I get back a list of the jobs that were started. I would like to automatically start another script on the server to process the output from these jobs once all are completed.
I would appreciate any advice that would help me avoid the following inelegant solution:
If I save each of the 1000 job id's from the above call in a separate file, I could check the contents of each file against the current list of running jobs, i.e. output from a call to:
ssh qstat
I would only need to check every half hour, but I would imagine that there is a better way.
It depends a bit on what job scheduler you are using and what version, but there's another approach that can be taken too if your results-processing can also be done on the same queue as the job.
One very handy way of managing lots of related job in more recent versions of torque (and with grid engine, and others) is to launch the any individual jobs as a job array (cf. http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#-t). This requires mapping the individual runs to numbers somehow, which may or may not be convenient; but if you can do it for your jobs, it does greatly simplify managing the jobs; you can qsub them all in one line, you can qdel or qhold them all at once (while still having the capability to deal with jobs individually).
If you do this, then you could submit an analysis job which had a dependency on the array of jobs which would only run once all of the jobs in the array were complete: (cf. http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#dependencyExamples). Submitting the job would look like:
qsub analyze.sh -W depend=afterokarray:427[]
where analyze.sh had the script to do the analysis, and 427 would be the job id of the array of jobs you launched. (The [] means only run after all are completed). The syntax differs for other schedulers (eg, SGE/OGE) but the ideas are the same.
Getting this right can take some doing, and certainly Tristan's approach has the advantage of being simple, and working with any scheduler; but learning to use job arrays in this situation if you'll be doing alot of this may be worth your time.
Something you might consider is having each job script just touch a filename in a dedicated folder like $i.jobdone, and in your master script, you could simply use ls *.jobdone | wc -l to test for the right number of jobs done.
You can use wait to stop execution until all your jobs are done. You can even collect all the exit statuses and other running statistics (time it took, count of jobs done at the time, whatever) if you cycle around waiting for specific ids.
I'd write a small C program to do the waiting and collecting (if you have permissions to upload and run executables), but you can easily use the bash wait built-in for roughly the same purpose, albeit with less flexibility.
Edit: small example.
#!/bin/bash
...
waitfor=''
for i in tasks; do
task &
waitfor="$waitfor $!"
done
wait $waitfor
...
If you run this script in background, It won't bother you and whatever comes after the wait line will run when your jobs are over.

Resources