I am submitting a job (via qsub) that is not in anyway the worse off for being killed part way through, and the more time it runs for the better. Results are output at it goes along.
It will be submitted to a large cluster that is well managed. It is safe to assume that whoever has control over the cluster has set it up in a reasonable and sensible way.
From my point of view it is more useful to tell it to loop more times than will fit inside walltime and let it be killed, than to tell it to loop less and have it finish before walltime. If it finished before walltime it did not do as many loops as it possibly could have done.
Is there any problem or annoyance caused by this approach? It's working well but I'm worried I could be upsetting someone.
So I put this question to the administrators of my cluster. The answer I got was;
Thanks for your query. I can see no reason that this would be an issue - simply ensuring you get as much run from you job as possible in the requested time.
Related
We have a batch process that executes every day. This week, a job that usually does not past 18 minutes of execution time (real time, as you can see), now is taking more than 45 minutes to finish.
Fullstimmer option is already active, but we don't know why only the real time was increased.
In old documentation there are Fullstimmer stats that could help identify the problem but they do not appear in batch log. (The stats are those down below: Page Faults, Context Switches, Block Operation and so on, as you can see)
It might be an I/O issue. Does anyone know how we can identify if it is really an I/O problem or if it could be some other issue (network, for example)?
To be more specific, this is one of the queries that have increased in time dramatically. As you can see, it is reading from a data base (SQL Server, VAULT schema) and work and writing in work directory.
Number of observations its almost the same:
We asked customer about any change in network traffic, and they said still the same.
Thanks in advance.
For a process to complete, much more needs to be done than the actual calculations on the CPU.
Your data has te be read and your results have to be written.
You might have to wait for other processes to finish first, and if your process includes multiple steps, writing to and reading from disk each time, you will have to wait for the CPU each time too.
In our situation, if real time is much larger than cpu time, we usually see much trafic to our Network File System (nfs).
As a programmer, you might notice that storing intermediate results in WORK is more efficient then on remote libraries.
You might safe much time by creating intermediate results as views instead of tables, IF you only use them once. That is not only possible in SQL, but also in data steps like this
data MY_RESULT / view=MY_RESULT;
set MY_DATA;
where transaction_date between '1jan2022'd and 30jun2022'd;
run;
I've been running some tests on an HPC. I have a code and if it's executed in serial, the run times are completely consistent. This wasn't always the case, but I included commands in my batch files so that it reserves an entire node and all its memory. Doing this allowed for almost perfectly consistent code execution times.
However, now that I am doing small scale parallel tests, the code execution times seem random. I would expect there to be some variation now that parallelization has been introduced, but the scale of randomness seems quite bizarre.
No other jobs are performed on the node so it should be fine - when in serial it is very consistent, so it must be something to do with the parallelization.
Does anyone know what could cause this? I've included a graph showing the execution times - there is a pretty clear average, but also major outliers. All results produced are identical and correct.
I'm under an NDA so cannot include much info about my code. Please feel free to ask questions and I'll see if I can help. Apologies if I'm not allowed to answer!
I'm using Fortran 90 as the main code language, and the HPC uses Slurm. NTASKS = 8 for these tests, however the randomness is there if NTASKS > 1. Number of tasks and randomness don't seem particularly linked, except if it is in parallel then the randomness occurs. I'm using Intel's autoparallelization feature, rather than OpenMP/MPI.
Thanks in advance
SOLVED!!!! Thanks for your help everyone!
I did small scale tests 100 times to get to the root of the problem.
As the execution times were rather small, I noticed that the larger outliers (longer run times) often occurred when a lot of new jobs from other users were submitted to the HPC. This made a lot of sense and wasn't particularly surprising.
The main reason these results really confused me was because of the smaller outliers (much quicker run times). It made sense that sometimes it would take longer to run if it was busy, but I just couldn't figure out how sometimes it ran much quicker, but still giving the same results!
Probably a bit of a rookie error, but it turns out not all nodes are equal on our HPC! About 80% of the nodes are identical, giving roughly the same run times (or longer if busy). BUT, the newest 20% (i.e. the highest node numbers with Node0XX) must be higher performance.
I checked the 'sacct' Slurm data and job run times, and can confirm all faster execution time outliers occurred on these newer nodes.
Very odd situation and something I hadn't been made aware of. Hopefully this might be able to help someone out if they're in a similar situation. I spent so long checking source codes/batchfiles/code timings that I hadn't even considered the HPC hardware itself. Something to keep in mind.
I did much longer (about an hour) tests and the longer execution time outliers didn't really exist (because the small queuing penalty was now relatively small in comparison to total execution time). But, the much quicker execution time outliers still occurred. Again, I checked the account data and these outliers always occurred on the high node numbers.
Hopefully this can help at least one person with a similar headache.
Cheers!
Let's say I've got 2 or more executors in a Spark Streaming application.
I've set the batch time of 10 seconds, so a job is started every 10 seconds reading input from my HDFS.
If the every job lasts for more than 10 seconds, the new job that is started is assigned to a free executor right?
Even if the previous one didn't finish?
I know it seems like a obvious answer but I haven't found anything about job scheduling in the website or on the paper related to Spark Streaming.
If you know some links where all of those things are explained, I would really appreciate to see them.
Thank you.
Actually, in the current implementation of Spark Streaming and under default configuration, only job is active (i.e. under execution) at any point of time. So if one batch's processing takes longer than 10 seconds, then then next batch's jobs will stay queued.
This can be changed with an experimental Spark property "spark.streaming.concurrentJobs" which is by default set to 1. Its not currently documented (maybe I should add it).
The reason it is set to 1 is that concurrent jobs can potentially lead to weird sharing of resources and which can make it hard to debug the whether there is sufficient resources in the system to process the ingested data fast enough. With only 1 job running at a time, it is easy to see that if batch processing time < batch interval, then the system will be stable. Granted that this may not be the most efficient use of resources under certain conditions. We definitely hope to improve this in the future.
There is a little bit of material regarding the internals of Spark Streaming in this meetup slides (sorry, about the shameless self advertising :) ). That may be useful to you.
From the point of view of the system administration of an SGE node, is it possible to force users to run long-running programs through qsub instead of running it stand-alone?
The problem is that the same machine is acting as the control node and the computation node. So, I can't distinguish a long-running program from a user who is compiling with "gcc". Ideally, I would like to force users to submit long-running jobs (i.e., more than an hour) through qsub. I don't even mind being a bit mean and killing jobs that have run longer than an hour but weren't submitted through qsub.
Until now, all that I can do is send e-mails out asking users to "Please use qsub!"...
I've looked through the SGE configuration and nothing seems relevant. But maybe I've just missed something...any help would be appreciated! Thanks!
I'm a little confused about your setup, but I'm assuming users are submitting jobs by logging into what is also a computation node. Here are some ideas, best to worst:
Obviously, the best thing is to have a separate control node for users.
Barring that, run a resource-limited VM as the control node.
Configure user-level resource limits (e.g. ulimit) on the nodes. You can restrict CPU, memory, and process usage, which are probably what you care about rather than clock time.
It sounds like the last one may be best for you. It's not hard, either.
Following reading http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html we want to experiment with mapred.reduce.parallel.copies.
The blog mentions "looking very carefully at the logs". How would we know we've reached the sweet spot? what should we look for? how can we detect that we're over-parallelizing?
In order to do that you should basically look for 4 things : CPU, RAM, Disk and Network. If your setup is crossing the threshold of these metrics you can deduce that you are pushing the limits. For example, if you have set the value of "mapred.reduce.parallel.copies" to a value much higher than the number of cores available, you'll end up with too many threads in waiting state, as based on this property Threads will be created to fetch the Map output. In addition to that network might get overwhelmed. Or, if there is too much intermediate output to be shuffled , your job will become slow as you will need disk based shuffle in such a case, which will be slower than RAM based shuffle. Choose a wise value for "mapred.job.shuffle.input.buffer.percent" based on your RAM(defaults to 70% of Reducer heap, which is normally good). So, these are kinda things which will tell you whether you are over-parallelizing or not. There are a lot of other things as well which you should consider. I would recommend you to go through the Chapter 6 of "Hadoop Definitve Guide".
Some of the measures which you could take, in order to make your jobs efficient, are like using a combiner to limit the data transfer, enable intermediate compression etc.
HTH
P.S : The answer is not very specific to just "mapred.reduce.parallel.copies". It tells you about tuning your job in general. Actually speaking setting only this property is not gonna help you much. You should consider other important properties as well.
Reaching the "sweet spot" is really just finding the parameters that give you the best result for whichever metric you consider the most important, usually overall job time. To figure out what parameters are working I would suggest using the following profiling tools that Hadoop comes with, MrBench, TestDFSIO, and NNBench. These are found in the hadoop-mapreduce-client-jobclient-*.jar.
By running this command you will see a long list of benchmark programs that you can use besides the ones I mentioned above.
hadoop ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*.jar
I would suggest running with the default parameters, run tests to give baseline benchmarks, then changing one parameter and rerunning. A bit time consuming but worth it, especially if you use a script to change parameters and run the benchmarks.