Random/Inconsistent Code Run Times - Parallel HPC

Random/Inconsistent Code Run Times - Parallel HPC - debugging

I've been running some tests on an HPC. I have a code and if it's executed in serial, the run times are completely consistent. This wasn't always the case, but I included commands in my batch files so that it reserves an entire node and all its memory. Doing this allowed for almost perfectly consistent code execution times.
However, now that I am doing small scale parallel tests, the code execution times seem random. I would expect there to be some variation now that parallelization has been introduced, but the scale of randomness seems quite bizarre.
No other jobs are performed on the node so it should be fine - when in serial it is very consistent, so it must be something to do with the parallelization.
Does anyone know what could cause this? I've included a graph showing the execution times - there is a pretty clear average, but also major outliers. All results produced are identical and correct.
I'm under an NDA so cannot include much info about my code. Please feel free to ask questions and I'll see if I can help. Apologies if I'm not allowed to answer!
I'm using Fortran 90 as the main code language, and the HPC uses Slurm. NTASKS = 8 for these tests, however the randomness is there if NTASKS > 1. Number of tasks and randomness don't seem particularly linked, except if it is in parallel then the randomness occurs. I'm using Intel's autoparallelization feature, rather than OpenMP/MPI.
Thanks in advance

SOLVED!!!! Thanks for your help everyone!
I did small scale tests 100 times to get to the root of the problem.
As the execution times were rather small, I noticed that the larger outliers (longer run times) often occurred when a lot of new jobs from other users were submitted to the HPC. This made a lot of sense and wasn't particularly surprising.
The main reason these results really confused me was because of the smaller outliers (much quicker run times). It made sense that sometimes it would take longer to run if it was busy, but I just couldn't figure out how sometimes it ran much quicker, but still giving the same results!
Probably a bit of a rookie error, but it turns out not all nodes are equal on our HPC! About 80% of the nodes are identical, giving roughly the same run times (or longer if busy). BUT, the newest 20% (i.e. the highest node numbers with Node0XX) must be higher performance.
I checked the 'sacct' Slurm data and job run times, and can confirm all faster execution time outliers occurred on these newer nodes.
Very odd situation and something I hadn't been made aware of. Hopefully this might be able to help someone out if they're in a similar situation. I spent so long checking source codes/batchfiles/code timings that I hadn't even considered the HPC hardware itself. Something to keep in mind.
I did much longer (about an hour) tests and the longer execution time outliers didn't really exist (because the small queuing penalty was now relatively small in comparison to total execution time). But, the much quicker execution time outliers still occurred. Again, I checked the account data and these outliers always occurred on the high node numbers.
Hopefully this can help at least one person with a similar headache.
Cheers!

Related

Why real time is much higher than "user" and "system" CPU TIME combined?

We have a batch process that executes every day. This week, a job that usually does not past 18 minutes of execution time (real time, as you can see), now is taking more than 45 minutes to finish.
Fullstimmer option is already active, but we don't know why only the real time was increased.
In old documentation there are Fullstimmer stats that could help identify the problem but they do not appear in batch log. (The stats are those down below: Page Faults, Context Switches, Block Operation and so on, as you can see)
It might be an I/O issue. Does anyone know how we can identify if it is really an I/O problem or if it could be some other issue (network, for example)?
To be more specific, this is one of the queries that have increased in time dramatically. As you can see, it is reading from a data base (SQL Server, VAULT schema) and work and writing in work directory.
Number of observations its almost the same:
We asked customer about any change in network traffic, and they said still the same.
Thanks in advance.

For a process to complete, much more needs to be done than the actual calculations on the CPU.
Your data has te be read and your results have to be written.
You might have to wait for other processes to finish first, and if your process includes multiple steps, writing to and reading from disk each time, you will have to wait for the CPU each time too.
In our situation, if real time is much larger than cpu time, we usually see much trafic to our Network File System (nfs).
As a programmer, you might notice that storing intermediate results in WORK is more efficient then on remote libraries.
You might safe much time by creating intermediate results as views instead of tables, IF you only use them once. That is not only possible in SQL, but also in data steps like this
data MY_RESULT / view=MY_RESULT;
set MY_DATA;
where transaction_date between '1jan2022'd and 30jun2022'd;
run;

Parallelization with highly unreliable workers?

Suppose I've got a pool of workers (computers), around 1000 of them, but they're highly unreliable. I expect each to go down multiple times a day, sometimes for extended periods of time. Fyi these are volunteer computers running BOINC (not my botnet, I swear!)
Are there any tools that exist to facilitate using them to do parallel computations, (mostly trivially parallelizable)? I'm thinking something like an IPython parallel where maybe when a node dies the calculation is restarted elsewhere, and maybe where when a new node joins its brought up to speed to the current working environment.

Hybrid : OpenMPI + OpenMP on a cluster

I solve numerically some Ordinary Differential Equations.
I have a very simple (conceptually), but very long computations. There is a very long array (~2M cells) and for each cell I need to perform numerical integration. This procedure should be repeated 1000 times. By using OpenMP parallelism and one 24-core machine, it takes around a week to do this (which is not acceptable).
I have a cluster of 20 such (24-core) machines and think about Hybrid implementation. I want to use MPI to pass over these 20 nodes and at each node use regular OpenMP parallelism.
Basically, I need to split my very long array to 20(nodes)X24(proccs) working units.
Are there any suggestion of better implementation or better ideas? I've read a lot on this subject and I've got impression, that sometimes such hybrid implementation does not necessarily bring a real speed up.
May be I should create a "pool of workers" and "feed" them with my array or something else.
Any suggestion and useful links are welcome!

If your computation is as embarrassingly parallel as you indicate you should expect good speedup by spreading the load across all 20 of your machines. By good I mean close to 20 and by close to 20 I mean any number which you actually get which leaves you thinking that the effort has been worthwhile.
Your proposed hybrid solution is certainly feasible and you should get good speedup if you implement it.
One alternative to a hybrid MPI+OpenMP program would be a job script (written in your favourite scripting language) which simply splits your large array into 20 pieces and starts 20 jobs, one on each machine running an instance of your program. When they've all finished have another script ready to recombine the results. This would avoid having to write any MPI code at all.
If your computer has an installation of Grid Engine you can probably write a job submission script to submit your work as an array job and let Grid Engine take care of parcelling the work out to the individual machines/tasks. I expect that other job management systems have similar facilities but I'm not familiar with them.
Another alternative would be an all-MPI code, that is drop the OpenMP altogether and modify your code to use whatever processors it finds available when you run it. Again, if your program requires little or no inter-process communication you should get good speedup.
Using MPI on a shared memory computer is sometimes a better (in performance terms) approach than OpenMP, sometimes worse. Trouble is, it's difficult to be certain about which approach is better for a particular program on a particular architecture with RAM and cache and interconnects and buses and all the other variables to consider.
One factor I've ignored, largely because you've provided no data to consider, is the load-balancing of your program. If you split your very large dataset into 20 equal-sized pieces do you end up with 20 equal-duration jobs ? If not, and if you have an idea how job time varies with inputs, you might do something more sophisticated in splitting the job up than simply chopping your into those 20 equal pieces. You might, for instance, chop it into 2000 equal pieces and serve them one at a time to the machinery for execution. In this case what you gain in load-balancing might be at risk of being lost to the time costs of job management. You pays yer money and you takes yer choice.
From your problem statement I wouldn't be making a decision about which solution to go for on the basis of expected performance, because I'd expect any of the approaches to get into the same ballpark performance-wise, but on the time to develop a working solution.

Optimum number of threads for a highly parallelizable problem

I parallelized a simulation engine in 12 threads to run it on a cluster of 12 nodes(each node running one thread). Since chances of availability of 12 systems is generally less, I also tweaked it for 6 threads(to run on 6 nodes), 4 threads(to run on 4 nodes), 3 threads(to run on 3 nodes), and 2 threads(to run on 2 nodes). I have noticed that more the number of nodes/threads, more is the speedup. But obviously, the more nodes I use, the more expensive(in terms of cost and power) the execution becomes.
I want to publish these results in a journal so I want to know if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?
Thanks,
Akshey

How have you parallelised your program and what is inside each of your nodes ?
For instance, on one of my clusters I have several hundred nodes each containing 4 dual-core Xeons. If I were to run an OpenMP program on this cluster I would place a single execution on one node and start up no more than 8 threads, one for each processor core. My clusters are managed by Grid Engine and used for batch jobs, so there is no contention while a job is running. In general there is no point in asking for more than one node on which to run an OpenMP job since the shared-memory approach doesn't work on distributed-memory hardware. And there's not much to be gained by asking for fewer than 8 threads on an 8-core node, I have enough hardware available not to have to share it.
If you have used a distributed-memory programming approach, such as MPI, then you are probably working with a number of processes (rather than threads) and may well be executing these processes on cores on different nodes, and be paying the costs in terms of communications traffic.
As #Blank has already pointed out the most efficient way to run a program, if by efficiency one means 'minimising total cpu-hours', is to run the program on 1 core. Only. However, for jobs of mine which can take, say, a week on 256 cores, waiting 128 weeks for one core to finish its work is not appealing.
If you are not already familiar with the following terms, Google around for them or head for Wikipedia:
Amdahl's Law
Gustafson's Law
weak scaling
strong scaling
parallel speedup
parallel efficiency
scalability.

"if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?"
There's no such general laws, because every problem has slightly different characteristics.
You can make a mathematical model of the performance of your problem on different number of nodes, knowing how much computational work has to be done, and how much communications has to be done, and how long each takes. (The communications times can be estimated by the amount of commuincations, and typical latency/bandwidth numbers for your nodes' type of interconnect). This can guide you as to good choices.
These models can be valuable for understanding what is going on, but to actually determine the right number of nodes to run on for your code for some given problem size, there's really no substitute for running a scaling test - running the problem on various numbers of nodes and actually seeing how it performs. The numbers you want to see are:
Time to completion as a function of number of processors: T(P)
Speedup as a function of number of processors: S(P) = T(1)/T(P)
Parallel efficiency: E(P) = S(P)/P
How do you choose the "right" number of nodes? It depends on how many jobs you have to run, and what's an acceptable use of computational resources.
So for instance, in plotting your timing results you might find that you have a minimum time to completion T(P) at some number of processors -- say, 32. So that might seem like the "best" choice. But when you look at the efficiency numbers, it might become clear that the efficiency started dropping precipitously long before that; and you only got (say) a 20% decrease in run time over running at 16 processors - that is, for 2x the amount of computational resources, you only got a 1.25x increase in speed. That's usually going to be a bad trade, and you'd prefer to run at fewer processors - particularly if you have a lot of these simulations to run. (If you have 2 simulations to run, for instance, in this case you could get them done in 1.25 time units insetad of 2 time units by running the two simulations each on 16 processors simultaneously rather than running them one at a time on 32 processors).
On the other hand, sometimes you only have a couple runs to do and time really is of the essence, even if you're using resources somewhat inefficiently. Financial modelling can be like this -- they need the predictions for tomorrow's markets now, and they have the money to throw at computational resources even if they're not used 100% efficiently.
Some of these concepts are discussed in the "Introduction to Parallel Performance" section of any parallel programming tutorials; here's our example, https://support.scinet.utoronto.ca/wiki/index.php/Introduction_To_Performance

Increasing the number of nodes leads to diminishing returns. Two nodes is not twice as fast as one node; four nodes even less so than two. As such, the optimal number of nodes is always one; it is with a single node that you get most work done per node.

How to create a system with 1500 servers that deliver results instantaneously?

I want to create a system that delivers user interface response within 100ms, but which requires minutes of computation. Fortunately, I can divide it up into very small pieces, so that I could distribute this to a lot of servers, let's say 1500 servers. The query would be delivered to one of them, which then redistributes to 10-100 other servers, which then redistribute etc., and after doing the math, results propagate back again and are returned by a single server. In other words, something similar to Google Search.
The problem is, what technology should I use? Cloud computing sounds obvious, but the 1500 servers need to be prepared for their task by having task-specific data available. Can this be done using any of the existing cloud computing platforms? Or should I create 1500 different cloud computing applications and upload them all?
Edit: Dedicated physical servers does not make sense, because the average load will be very, very small. Therefore, it also does not make sense, that we run the servers ourselves - it needs to be some kind of shared servers at an external provider.
Edit2: I basically want to buy 30 CPU minutes in total, and I'm willing to spend up to $3000 on it, equivalent to $144,000 per CPU-day. The only criteria is, that those 30 CPU minutes are spread across 1500 responsive servers.
Edit3: I expect the solution to be something like "Use Google Apps, create 1500 apps and deploy them" or "Contact XYZ and write an asp.net script which their service can deploy, and you pay them based on the amount of CPU time you use" or something like that.
Edit4: A low-end webservice provider, offering asp.net at $1/month would actually solve the problem (!) - I could create 1500 accounts, and the latency is ok (I checked), and everything would be ok - except that I need the 1500 accounts to be on different servers, and I don't know any provider that has enough servers that is able to distribute my accounts on different servers. I am fully aware that the latency will differ from server to server, and that some may be unreliable - but that can be solved in software by retrying on different servers.
Edit5: I just tried it and benchmarked a low-end webservice provider at $1/month. They can do the node calculations and deliver results to my laptop in 15ms, if preloaded. Preloading can be done by making a request shortly before the actual performance is needed. If a node does not respond within 15ms, that node's part of the task can be distributed to a number of other servers, of which one will most likely respond within 15ms. Unfortunately, they don't have 1500 servers, and that's why I'm asking here.

[in advance, apologies to the group for using part of the response space for meta-like matters]
From the OP, Lars D:
I do not consider [this] answer to be an answer to the question, because it does not bring me closer to a solution. I know what cloud computing is, and I know that the algorithm can be perfectly split into more than 300,000 servers if needed, although the extra costs wouldn't give much extra performance because of network latency.
Lars,
I sincerely apologize for reading and responding to your question at a naive and generic level. I hope you can see how both the lack of specifity in the question itself, particularly in its original form, and also the somewhat unusual nature of the problem (1) would prompt me respond to the question in like fashion. This, and the fact that such questions on SO typically emanate from hypotheticals by folks who have put but little thought and research into the process, are my excuses for believing that I, a non-practionner [of massively distributed systems], could help your quest. The many similar responses (some of which had the benefits of the extra insight you provided) and also the many remarks and additional questions addressed to you show that I was not alone with this mindset.
(1) Unsual problem: An [apparently] mostly computational process (no mention of distributed/replicated storage structures), very highly paralellizable (1,500 servers), into fifty-millisecondish-sized tasks which collectively provide a sub-second response (? for human consumption?). And yet, a process that would only be required a few times [daily..?].
Enough looking back!
In practical terms, you may consider some of the following to help improve this SO question (or move it to other/alternate questions), and hence foster the help from experts in the domain.
re-posting as a distinct (more specific) question. In fact, probably several questions: eg. on the [likely] poor latency and/or overhead of mapreduce processes, on the current prices (for specific TOS and volume details), on the rack-awareness of distributed processes at various vendors etc.
Change the title
Add details about the process you have at hand (see many questions in the notes of both the question and of many of the responses)
in some of the questions, add tags specific to a give vendor or technique (EC2, Azure...) as this my bring in the possibly not quite unbuyist but helpful all the same, commentary from agents at these companies
Show that you understand that your quest is somewhat of a tall order
Explicitly state that you wish responses from effective practionners of the underlying technologies (maybe also include folks that are "getting their feet wet" with these technologies as well, since with the exception of the physics/high-energy folks and such, who BTW traditionnaly worked with clusters rather than clouds, many of the technologies and practices are relatively new)
Also, I'll be pleased to take the hint from you (with the implicit non-veto from other folks on this page), to delete my response, if you find that doing so will help foster better responses.
-- original response--
Warning: Not all processes or mathematical calculations can readily be split in individual pieces that can then be run in parallel...
Maybe you can check Wikipedia's entry from Cloud Computing, understanding that cloud computing is however not the only architecture which allows parallel computing.
If your process/calculation can efficitively be chunked in parallelizable pieces, maybe you can look into Hadoop, or other implementations of MapReduce, for an general understanding about these parallel processes. Also, (and I believe utilizing the same or similar algorithms), there also exist commercially available frameworks such as EC2 from amazon.
Beware however that the above systems are not particularly well suited for very quick response time. They fare better with hour long (and then some) data/number crunching and similar jobs, rather than minute long calculations such as the one you wish to parallelize so it provides results in 1/10 second.
The above frameworks are generic, in a sense that they could run processes of most any nature (again, the ones that can at least in part be chunked), but there also exist various offerings for specific applications such as searching or DNA matching etc. The search applications in particular can have very short response times (cf Google for example) and BTW this is in part tied to fact that such jobs can very easily and quickly be chunked for parallel processing.

Sorry, but you are expecting too much.
The problem is that you are expecting to pay for processing power only. Yet your primary constraint is latency, and you expect that to come for free. That doesn't work out. You need to figure out what your latency budgets are.
The mere aggregating of data from multiple compute servers will take several milliseconds per level. There will be a gaussian distribution here, so with 1500 servers the slowest server will respond after 3σ. Since there's going to be a need for a hierarchy, the second level with 40 servers , where again you'll be waiting for the slowest server.
Internet roundtrips also add up quickly; that too should take 20 to 30 ms of your latency budget.
Another consideration is that these hypothethical servers will spend much of their time idle. That means they're powered on, drawing electricity yet not generating revenue. Any party with that many idle servers would turn them off, or at the very least in sleep mode just to conserve electricity.

MapReduce is not the solution! Map Reduce is used in Google, Yahoo and Microsoft for creating the indexes out of the huge data (the whole Web!) they have on their disk. This task is enormous and Map Reduce was built to make it happens in hours instead of years, but starting a Master controller of Map Reduce is already 2 seconds, so for your 100ms this is not an option.
Now, from Hadoop you may get advantages out of the distributed file system. It may allow you to distribute the tasks close to where the data is physically, but that's it. BTW: Setting up and managing an Hadoop Distributed File System means controlling your 1500 servers!
Frankly in your budget I don't see any "cloud" service that will allow you to rent 1500 servers. The only viable solution, is renting time on a Grid Computing solution like Sun and IBM are offering, but they want you to commit to hours of CPU from what I know.
BTW: On Amazon EC2 you have a new server up in a couple of minutes that you need to keep for an hour minimum!
Hope you'll find a solution!

I don't get why you would want to do that, only because "Our user interfaces generally aim to do all actions in less than 100ms, and that criteria should also apply to this".
First, 'aim to' != 'have to', its a guideline, why would u introduce these massive process just because of that. Consider 1500 ms x 100 = 150 secs = 2.5 mins. Reducing the 2.5 mins to a few seconds its a much more healthy goal. There is a place for 'we are processing your request' along with an animation.
So my answer to this is - post a modified version of the question with reasonable goals: a few secs, 30-50 servers. I don't have the answer for that one, but the question as posted here feels wrong. Could even be 6-8 multi-processor servers.

Google does it by having a gigantic farm of small Linux servers, networked together. They use a flavor of Linux that they have custom modified for their search algorithms. Costs are software development and cheap PC's.

It would seem that you are indeed expecting at least 1000-fold speedup from distributing your job to a number of computers. That may be ok. Your latency requirement seems tricky, though.
Have you considered the latencies inherent in distributing the job? Essentially the computers would have to be fairly close together in order to not run into speed of light issues. Also, the data center in which the machines would be would again have to be fairly close to your client so that you can get your request to them and back in less than 100 ms. On the same continent, at least.
Also note that any extra latency requires you to have many more nodes in the system. Losing 50% of available computing time to latency or anything else that doesn't parallelize requires you to double the computing capacity of the parallel portions just to keep up.
I doubt a cloud computing system would be the best fit for a problem like this. My impression at least is that the proponents of cloud computing would prefer to not even tell you where your machines are. Certainly I haven't seen any latency terms in the SLAs that are available.

You have conflicting requirements. You're requirement for 100ms latency is directly at odds with your desire to only run your program sporadically.
One of the characteristics of the Google-search type approach you mentioned in your question is that the latency of the cluster is dependent on the slowest node. So you could have 1499 machines respond in under 100ms, but if one machine took longer, say 1s - whether due to a retry, or because it needed to page you application in, or bad connectivity - your whole cluster would take 1s to produce an answer. It's inescapable with this approach.
The only way to achieve the kinds of latencies you're seeking would be to have all of the machines in your cluster keep your program loaded in RAM - along with all the data it needs - all of the time. Having to load your program from disk, or even having to page it in from disk, is going to take well over 100ms. As soon as one of your servers has to hit the disk, it is game over for your 100ms latency requirement.
In a shared server environment, which is what we're talking about here given your cost constraints, it is a near certainty that at least one of your 1500 servers is going to need to hit the disk in order to activate your app.
So you are either going to have to pay enough to convince someone to keep you program active and in memory at all times, or you're going to have to loosen your latency requirements.

Two trains of thought:
a) if those restraints are really, absolutely, truly founded in common sense, and doable in the way you propose in the nth edit, it seems the presupplied data is not huge. So how about trading storage for precomputation to time. How big would the table(s) be? Terabytes are cheap!
b) This sounds a lot like a employer / customer request that is not well founded in common sense. (from my experience)
Let´s assume the 15 minutes of computation time on one core. I guess thats what you say.
For a reasonable amount of money, you can buy a system with 16 proper, 32 hyperthreading cores and 48 GB RAM.
This should bring us in the 30 second range.
Add a dozen Terabytes of storage, and some precomputation.
Maybe a 10x increase is reachable there.
3 secs.
Are 3 secs too slow? If yes, why?

Sounds like you need to utilise an algorithm like MapReduce: Simplified Data Processing on Large Clusters
Wiki.

Check out Parallel computing and related articles in this WikiPedia-article - "Concurrent programming languages, libraries, APIs, and parallel programming models have been created for programming parallel computers." ... http://en.wikipedia.org/wiki/Parallel_computing

Although Cloud Computing is the cool new kid in town, your scenario sounds more like you need a cluster, i.e. how can I use parallelism to solve a problem in a shorter time.
My solution would be:
Understand that if you got a problem that can be solved in n time steps on one cpu, does not guarantee that it can be solved in n/m on m cpus. Actually n/m is the theoretical lower limit. Parallelism is usually forcing you to communicate more and therefore you'll hardly ever achieve this limit.
Parallelize your sequential algorithm, make sure it is still correct and you don't get any race conditions
Find a provider, see what he can offer you in terms of programming languages / APIs (no experience with that)

What you're asking for doesn't exist, for the simple reason that doing this would require having 1500 instances of your application (likely with substantial in-memory data) idle on 1500 machines - consuming resources on all of them. None of the existing cloud computing offerings bill on such a basis. Platforms like App Engine and Azure don't give you direct control over how your application is distributed, while platforms like Amazon's EC2 charge by the instance-hour, at a rate that would cost you over $2000 a day.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio