efficiency in limiting the number of parallel jobs in slurm - performance

My question is based on THIS question.
I should consider using --array=0-60000%200 to limit the number of jobs running in parallel to 200 in slurm. It seems to me that it takes up to a minute to lunch a new job every time that an old job is finished. Given the number of jobs that I am planning to run, I might be wasting a lot of time this way.
I wrote a "most probably" very inefficient alternative, consisting in a script that launches the jobs, checking the number of jobs in the queue and adding jobs if I am still bellow the max number of jobs allowed and while I reached the max number of parallel jobs, sleep for 5 seconds, as follows:
#!/bin/bash
# iterate procedure $1 times. $1=60000
for ((i=0;i<=$1;i++))
do
# wait until any queued process is finished
q=$(squeue -u myuserName | wc -l) #I don't care about +/-1 lines (e.g. title)
while [ $q -gt 200 ] #max number of parallel jobs set to 200
do
sleep 5
q=$(squeue -u myuserName | wc -l)
done
# run the job with sbatch
sbatch...
done
It seems to do a better job compared to my previous method, nevertheless,
I would like to know how inefficient is in reality this implementation? and why?
Could I be harming the scheduling efficiency of other users on the same cluster?
Thank you.

SLURM needs some time to process the jobs list and decide which job should be the next to run, specially if the backfill scheduler is in place and there are lots of jobs in the queue. You are not losing one minute to schedule a job due to you using a job array, is SLURM that needs one minute to decide, and it will need the same minute for any other job of any other user, with or without job arrays.
By using your approach your jobs are also losing priority: everytime one of your jobs finishes, you launch a new one, and that new job will be the last in the queue. Also, SLURM will have to manage some hundreds of independent jobs instead of only one that accounts for the 60000 that you need.
If you are alone in the cluster, maybe there's no big difference in both approaches, but if your cluster is full, you manual approach will give a slightly higher load to SLURM and you jobs will finish quite a lot later compared to the job array approximation (just because with the job array, once the array gets to be first in line, the 60000 are first in line, compared to being last in line everytime one of your jobs finishes).

Related

slurm limit number of spwaned processes

I am a newbie trying to install/administer slurm. I want to limit the amount of forking a slurm job can do. I used stress command to see the CPU utilization by slurm.
When I run this batch script
#SBATCH -p Test -c 1
stress -c 1
The job runs fine with one core used 100 percent. But this script
#SBATCH -p Test -c 1
stress -c 20
also runs but the top command gives list of 20PIDs forked with cpu utilization of 5 percent each. This makes sense as the total utilization is 1 CPU core 100 percent. This makes load averages go crazy which I learned by googling, are not a correct view of system load. I have 2 questions
Is it possible in slurm to limit such a behavior from the admin config by killing the second run. My various attempts have so far yielded nothing. The slurm is configured with cgroup and kills over memory jobs fine. No MPI is used or configured.
Does this behavior cause inefficiency because of process waiting times ?
I tried setting these drastic params to check if something happens.
MaxStepCount=1
MaxTasksPerNode=2
But surprisingly nothing happens and I can submit many more jobs after this.
Slurm's job is to allocate computational resources to user jobs. The lowest unit of computation manageable is referred to in the documentation as the CPU. This refers to processing threads/ execution cores, not physical cores. Slurm does not oversee how those resources are managed by the job. So no, nothing in Slurm can kill a job with too many userland threads.
Running that many threads would probably affect efficiency, yes. All those threads will cause increased context switching unless the job has enough cpu threads to handle them.
MaxStepCount and MaxTasksPerNode are for jobs. "Tasks" in this context are not userland threads but separate processes launched by a job step.
I hope that helps.

How to resume failed execution in MapReduce?

I have one requirement saying that -
a. Lets say i have 100GB of file/data
b. I have written Map Reduce job to process this data on certain logic.
c. I fired Map Reduce job, but it failed after reading 50GB
So my question is -
Can i resume the Map Reduce job from the 51th GB?
Please let me know if anybody have idea on how to do it, i don't want to reprocess the data which i processed before point of failure.
Thanks in advance
Brief answer: no.
And that's why working with large batch processing systems such as Hadoop or MPI is hard. Not only restarts of large jobs are inefficient from resource consumption point of view, but are also very psychologically depressive. That's why your primary goal is to reduce running time of single job to no more than couple of hours. Maybe it would be possible some day to implement "pausing" of jobs and "hot fixing" code, but currently it is not supported to my knowledge.
Solution #1. Split your job into error-prone parallelizable job and final error-free non-parallelizable job. Consider following example: you have hundreds of gigabytes of textual access logs from web server and you want to write job that will print how popular different browsers are. If you combine parsing and aggregating (summing) to a single huge job, then it's running time will be of order of days, and also chances that it will fail are very high because textual logs are usually hard to parse due to disambiguity. Much better idea is to split this job into two separate jobs:
First job is solely responsible for parsing log files. It prints only browser string as its output and even doesn't need to have any reducers. This job is the place for 99% of all errors because here is where parsing of "wild" data occurs. This job is parallelizable in the sense that you may split your input into chunks and process each chunk separately, so that each chunk is processed in 10-30 minutes. If job fails for some chunk, you fix it and restart; 30 minutes is not a big loss.
Second job is grand job that takes outputs from instances of first jobs and performs aggregation. Because aggregation code is very simple, this job is not likely to fail.
chunk(20G)->parse-job(20G)->browsers(0.5G)
chunk(20G)->parse-job(20G)->browsers(0.5G)
input(1T)->chunk(20G)->parse-job(20G)->browsers(0.5G)->aggregate-job->output
... .... ...
chunk(20G)->parse-job(20G)->browsers(0.5G)
Solution #2. Sometimes you may be satisfied with result even if parts of input data are dropped out. In this case you may set options mapred.max.map.failures.percent and/or mapred.max.reduce.failures.percent to non-zero values.
If your entire job fails, the output gets cleared, so you loose whatever you processed. However, Hadoop retries failed tasks of a job. So as long as your failure is recoverable within preconfigured number of attempts, a job will not fail and you are not going to loose output from already completed tasks.
If your failure is not recoverable, then in most cases it is your fault, and you might need to do one or more of the following:
Fix your code, even simple bug may cause all your tasks to consistently fail
Use less resources (e.g. care of available memory)
Better partition the problem (see if some tasks are fed more data than others or make sure task input is getting split into smaller chunks)
Have a bigger cluster capacity.

What is a good way to design and build a task scheduling system with lots of recurring tasks?

Imagine you're building something like a monitoring service, which has thousands of tasks that need to be executed in given time interval, independent of each other. This could be individual servers that need to be checked, or backups that need to be verified, or just anything at all that could be scheduled to run at a given interval.
You can't just schedule the tasks via cron though, because when a task is run it needs to determine when it's supposed to run the next time. For example:
schedule server uptime check every 1 minute
first time it's checked the server is down, schedule next check in 5 seconds
5 seconds later the server is available again, check again in 5 seconds
5 seconds later the server is still available, continue checking at 1 minute interval
A naive solution that came to mind is to simply have a worker that runs every second or so, checks all the pending jobs and executes the ones that need to be executed. But how would this work if the number of jobs is something like 100 000? It might take longer to check them all than it is the ticking interval of the worker, and the more tasks there will be, the higher the poll interval.
Is there a better way to design a system like this? Are there any hidden challenges in implementing this, or any algorithms that deal with this sort of a problem?
Use a priority queue (with the priority based on the next execution time) to hold the tasks to execute. When you're done executing a task, you sleep until the time for the task at the front of the queue. When a task comes due, you remove and execute it, then (if its recurring) compute the next time it needs to run, and insert it back into the priority queue based on its next run time.
This way you have one sleep active at any given time. Insertions and removals have logarithmic complexity, so it remains efficient even if you have millions of tasks (e.g., inserting into a priority queue that has a million tasks should take about 20 comparisons in the worst case).
There is one point that can be a little tricky: if the execution thread is waiting until a particular time to execute the item at the head of the queue, and you insert a new item that goes at the head of the queue, ahead of the item that was previously there, you need to wake up the thread so it can re-adjust its sleep time for the item that's now at the head of the queue.
We encountered this same issue while designing Revalee, an open source project for scheduling triggered callbacks. In the end, we ended up writing our own priority queue class (we called ours a ScheduledDictionary) to handle the use case you outlined in your question. As a free, open source project, the complete source code (C#, in this case) is available on GitHub. I'd recommend that you check it out.

How to estimate the number of instances in Amazon EMR?

I have a map-reduce job to be run on the Amazon EMR. I would like to have up to 400 mappers and reducers and I would like to use either Medium or Large instances. How can I estimate the number of instances I need.
Besides, if one job ends within 2 minutes, let's say, and I run another job which take 4 minutes, will I be charged for 2 hours or that's considered 1 hour?
I know if you use the CLI tool to create your Job Flow and add the steps, then you can run both of the steps one after another on the same job flow and they will be counted within the same hour.
I believe if you use the GUI then you can not re-use the job flow and so you may get charged one hour for each job. I haven't tried this though so may be wrong there.
Check this article which is where I got the information:
https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce

Actual processing time of hadoop job

My cluster is currently occupied by a job A that takes long time and has VERY_LOW priority.
I started another job B yesterday while A was already running and I think it should have ran quite fast.
However, I saw it took 47 minutes at the job details.
I don't think this is the actual processing time.
I'm trying to find out when the job really started.
Where can I look?
I cant seem to find anywhere which states exactly what you're after, but you could look into the job in the job tracker on port 50030 and look at the individual mapper and reducer details. On there you can see how long each individual mapper and reducer took to complete their tasks from their start and end times.
If there weren't any mappers or reducers free when you started the second job, the second job wouldnt be able to make any progress until the first job released them, which might explain why it claimed to take so long, as they might not have actually been running simultaneously. The time of the job being started and the first actual mapper starting should give you an indication of whether it was just waiting around for resources, which means you can deduct the period of time between the job and mapper's start times from the overall 47 minutes.

Resources