MPI Large Data all to all transfer - parallel-processing

My application of MPI has some process that generate some large data. Say we have N+1 process (one for master control, others are workers), each of worker processes generate large data, which is now simply write to normal file, named file1, file2, ..., fileN. The size of each file may be quite different. Now I need to send all fileM to rank M process to do the next job, So it's just like all to all data transfer.
My problem is how should I use MPI API to send these files efficiently? I used to use windows share folder to transfer these before, but I think it's not a good idea.
I have think about MPI_file and MPI_All_to_all, but these functions seems not to be so suitable for my case. Simple MPI_Send and MPI_Recv seems hard to be used because every process need to transfer large data, and I don't want to use distributed file system for now.

It's not possible to answer your question precisely without a lot more data, data that only you have right now. So here are some generalities, you'll have to think about them and see if and how to apply them in your situation.
If your processes are generating large data sets they are unlikely to be doing so instantaneously. Instead of thinking about waiting until the whole data set is created, you might want to think about transferring it chunk by chunk.
I don't think that MPI_Send and _Recv (or the variations on them) are hard to use for large amounts of data. But you need to give some thought to finding the right amount to transfer in each communication between processes. With MPI it is not a simple case of there being a message startup time plus a message transfer rate which apply to all messages sent. Some IBM implementations, for example, on some of their hardware had different latencies and bandwidths for small and large messages. However, you have to figure out for yourself what the tradeoffs between bandwidth and latency are for your platform. The only general advice I would give here is to parameterise the message sizes and experiment until you maximise the ratio of computation to communication.
As an aside, one of the tests you should already have done is measured message transfer rates for a wide range of sizes and communications patterns on your platform. That's kind of a basic shake-down test when you start work on a new system. If you don't have anything more suitable, the STREAMS benchmark will help you get started.
I think that a all-to-all transfers of large amounts of data is an unusual scenario in the kinds of programs for which MPI is typically used. You may want to give some serious thought to redesigning your application to avoid such transfers. Of course, only you know if that is feasible or worthwhile. From what little information your provide it seems as if you might be implementing some kind of pipeline; in such cases the usual pattern of communication is from process 0 to process 1, process 1 to process 2, 2 to 3, etc.
Finally, if you happen to be working on a computer with shared memory (such as a multicore PC) you might think about using a shared memory approach, such as OpenMP, to avoid passing large amounts of data around.

Related

Putting memory limits with .NET core

I am building a ML application for binary classification using ML.NET. It will have multiple ML models of varying sizes (built using different training data) which will be stored in SQL server database as Blob. Clients will send items for classification to this app in random order and based on client ID, corresponding model is to be used for classification. To classify item, model needs be read from database and then loaded into memory. Loading model in memory is taking considerable time depending on size and I don't see any way to optimize it. Hence I am planning to cache models in memory. If I cache many heavy models, it may put pressure on memory hampering performance of other processes running on server. So there is no straightforward way to limit caching. So looking for suggestions to handle this.
Spawn a new process
In my opinion this is the only viable option to accomplish what you're trying to do. Spawn a complete new process that communicates (via IPC?) with your "main application". You could set a memory limit using this property https://learn.microsoft.com/en-us/dotnet/api/system.gcmemoryinfo.totalavailablememorybytes?view=net-5.0 or maybe even use a 3rd-party-library (e.g. https://github.com/lowleveldesign/process-governor), that kills your process if it reaches a specific amount of RAM. Both of these approaches are quite rough and will basically kill your process.
If you have control over your side car application running, it might make sense to really monitor the RAM usage with something like this Getting a process's ram usage and gracefully stop the process.
Do it yourself solution (not recommended)
Basically there is no built in way of limiting memory usage by thread or similar.
What counts towards the memory limit?
Shared resources
Since you have a running process, you need to define what exactly counts towards the memory limit. For example if you have some static Dictionary that is manipulated by the running thread - what did it occupy? Only the diff between the old value and the new value? The whole new value? The key and the value?
There are many more cases like this you'll have to take into consideration.
The actual measuring
You need some kind of way to count the actual memory usage. This will probably be hard/near impossible to "implement":
Reference counting needed?
If you have a hostile thread, it might spawn an infinite amount of references to one object, no new keyword used. For each reference you'd have to count 32/64 bits.
What about built in types?
It might be "easy" to measure a byte[] included in your own type definition, but what about built in classes? If someone initializes a string with 100MB this might be an amount you need to keep track of.
... and many more ...
As you maybe noticed with previous samples, there is no easy definition of "RAM used by a thread". This is the reason there also is no easy to get the value of it.
In my opinion it's insanely complex to do such a thing and needs a lot of definition work to do on your side. It might be feasable with lots of effort but I'm not sure if that really is what you want. Even if you manage to - what will do you about it? Only killing the thread might not clean up the ressources.
Therefore I'd really think about having a OS managed, independent, process, that you can kill whenever you feel like it.
How big are your models? Even large models 100meg+ load pretty quickly off of fast/SSD storage. I would consider caching them on fast drives/SSDs, because pulling off of SQL Server is going to be much slower than raw disk. See if this helps your performance.

Upper bound on speedup

My MPI experience showed that the speedup as does not increase linearly with the number of nodes we use (because of the costs of communication). My experience is similar to this:.
Today a speaker said: "Magically (smiles), in some occasions we can get more speedup than the ideal one!".
He meant that ideally, when we use 4 nodes, we would get a speedup of 4. But in some occasions we can get a speedup greater than 4, with 4 nodes! The topic was related to MPI.
Is this true? If so, can anyone provide a simple example on that? Or maybe he was thinking about adding multithreading to the application (he went out of time and then had to leave ASAP, thus we could not discuss)?
Parallel efficiency (speed-up / number of parallel execution units) over unity is not at all uncommon.
The main reason for that is the total cache size available to the parallel program. With more CPUs (or cores), one has access to more cache memory. At some point, a large portion of the data fits inside the cache and this speeds up the computation considerably. Another way to look at it is that the more CPUs/cores you use, the smaller the portion of the data each one gets, until that portion could actually fit inside the cache of the individual CPU. This is sooner or later cancelled by the communication overhead though.
Also, your data shows the speed-up compared to the execution on a single node. Using OpenMP could remove some of the overhead when using MPI for intranode data exchange and therefore result in better speed-up compared to the pure MPI code.
The problem comes from the incorrectly used term ideal speed-up. Ideally, one would account for cache effects. I would rather use linear instead.
Not too sure this is on-topic here, but here goes nothing...
This super-linearity in speed-up can typically occur when you parallelise your code while distributing the data in memory with MPI. In some cases, by distributing the data across several nodes / processes, you end-up having sufficiently small chunks of data to deal with for each individual process that it fits in the cache of the processor. This cache effect might have a huge impact on the code's performance, leading to great speed-ups and compensating for the increased need of MPI communications... This can be observed in many situations, but this isn't something you can really count for for compensating a poor scalability.
Another case where you can observe this sort of super-linear scalability is when you have an algorithm where you distribute the task of finding a specific element in a large collection: by distributing your work, you can end up in one of the processes/threads finding almost immediately the results, just because it happened to be given range of indexes starting very close to the answer. But this case is even less reliable than the aforementioned cache effect.
Hope that gives you a flavour of what super-linearity is.
Cache has been mentioned, but it's not the only possible reason. For instance you could imagine a parallel program which does not have sufficient memory to store all its data structures at low node counts, but foes at high. Thus at low node counts the programmer may have been forced to write intermediate values to disk and then read them back in again, or alternatively re-calculate the data when required. However at high node counts these games are no longer required and the program can store all its data in memory. Thus super-linear speed-up is a possibility because at higher node counts the code is just doing less work by using the extra memory to avoid I/O or calculations.
Really this is the same as the cache effects noted in the other answers, using extra resources as they become available. And this is really the trick - more nodes doesn't just mean more cores, it also means more of all your resources, so as speed up really measures your core use if you can also use those other extra resources to good effect you can achieve super-linear speed up.

Reading in parallel from multiple hard drives

I am writing an application that deals with lots of data (gigabytes). I am considering splitting the data onto multiple hard drives and reading it in parallel. I am wondering what kind of limitations I will run into--for example, is it possible to read from 4 or 8 hard drives in parallel, and will I get approximately 4 or 8 times the performance if disk I/O is the limiting factor? What should I look out for? Pointers to relevant docs are also appreciated--Google didn't turn up much.
EDIT: I should point out that I've looked at RAID, but the performance wasn't as good as I was hoping for. I am planning on writing this myself in C/C++.
Well splitting data and reading from 4 to 8 drives in parallel would not jump the throughput by 4 to 8 times. There are other factors which you need to consider.
If you reading data in the application, then threads might be required to read data from different harddisks.
Windows provide overlapped and non-overlapped method of reading and writing data to hdd. See if using that increases the throughput. Same way *nux would also have read/write methods.
On a single core/processor threads appear to run in parallel but its sequentially underlying. With multicore multiple threads can be read in parallel but generally OS decides what to run and when to run. So having so many threads to read might decrease performance than increase.
If you check specs of any harddisk, you would see it gives random access time and sequential access time. So based on you data you may want to check these parameters.
When you spliting data into different drives you need to keep in mind that your application would require syncronization of how to populate data into meaningful information. If you using threads, additionally threads should be in sync.
You may get state of the art harddisk with high data read/write speeds but you other hardware may be the weak link. So you may be using a low-end motherboard or RAM which may not let you get the best of the speeds.
If you're not going to use real RAID, you better at least use multiple hard drive controllers, otherwise you won't see much performance gain at all. One controller can't do lots of concurrent IO so it will quickly become the bottleneck.
It sounds like you are talking about the concept of data striping. This is commonly used for RAID implementations. You may want to look into one of the software RAID solutions available for most operating systems. An advantage is if you can use raid to your advantage and add parity (ability to lose a drive and not your data)
This would give you the benefits of RAID without having to try to deal with it yourself. You could do it on a database level as well with data files spread across the drives, but this adds complexity.
You will stream data faster. Drives are only so fast and if your I/O channel can handle more go for it. There's also seek times to take into account... Probably not a big deal based on your app description.
As you seem to be OK with looking at reconfiguring the drives, how about SSDs?
They run rings around any mechanical drives (up around 200+GB/sec read, 150+GB/sec write).
Are you sequentially reading the data, or randomly?
How many GB are you expecting?

Why does multithreaded file transfer improve performance?

RichCopy, a better-than-robocopy-with-GUI tool from Microsoft, seems to be the current tool of choice for copying files. One of it's main features, hightlighted in the TechNet article presenting the tool, is that it copies multiple files in parallel. In its default setting, three files are copied simultaneously, which you can see nicely in the GUI: [Progress: xx% of file A, yy% of file B, ...]. There are a lot of blog entries around praising this tool and claiming that this speeds up the copying process.
My question is: Why does this technique improve performance? As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network. My assumption would be that copying multiple files at once makes the whole process slower, since the HDD needs to jump back and forth between different files rather than just sequentially streaming one file. Since RichCopy is faster, there must be some mistake in my assumptions...
The tool is making use improvements in hardware which can optimise multiple read and write requests much better.
When copying one file at a time the hardware isn't going to know that the block of data that currently is passing under the read head (or near by) will be needed of a subsquent read since the software hasn't queued that request yet.
A single file copy these days is not very taxing task for modern disk sub-systems. By giving these hardware systems more work to do at once the tool is leveraging its improved optimising features.
A naive "copy multiple files" application will copy one file, then wait for that to complete before copying the next one.
This will mean that an individual file CANNOT be copied faster than the network latency, even if it is empty (0 bytes). Because it probably does several file server calls, (open,write,close), this may be several x the latency.
To efficiently copy files, you want to have a server and client which use a sane protocol which has pipelining; that's to say - the client does NOT wait for the first file to be saved before sending the next, and indeed, several or many files may be "on the wire" at once.
Of course to do that would require a custom server not a SMB (or similar) file server. For example, rsync does this and is very good at copying large numbers of files despite being single threaded.
So my guess is that the multithreading helps because it is a work-around for the fact that the server doesn't support pipelining on a single session.
A single-threaded implementation which used a sensible protocol would be best in my opinion.
It's a network tool, so the bottleneck is the network, not the HDD. Up to a (low) point you can get more throughput out of a TCP link by using a few connections in parallel. This (a) parallelizes the TCP handshakes; (b) can make better use of the bandwidth-delay product if that is high; and (c) doesn't make one arbitrarily slow connection the critical path if for some reason it encounters a high RTT or failure rate.
Another way to do (b) is to use an enormous TCP socket receive buffer but that's not always convenient.
Several of the other answers about HDD are incorrect. Practically any HDD will do some read-ahead on the assumption of sequential access, and any intelligent OS cache will also do that.
My gues is that the hdd read write heads spend most of their time idle and wait for the correct memory block of the disk to apear under them, the more memory being copied means less time in idle and most modern disk schedulers should take care of the jumping (for a low number of files/fragments)
As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network.
I think those assumptions are overly simplistic.
First, while LANs run at 100Mb / 1Gbit. Long haul networks have a maximum data rate that is less than the max rate of the slowest link.
Second, the effective throughput of TCP/IP stream over the internet is often dominated by the time taken to round-trip messages and acknowledgments. For example, I have a 8+Mbit link, but my data rate on downloads is rarely above 1-2Mbits per second when I'm downloading from the USA. So if you can run multiple streams in parallel one stream can be waiting for an acknowledgment while another is pumping packets. (But if you try to send too much, you start getting congestion, timeouts, back-off and lower overall transfer rates.)
Finally, operating systems are good at doing a variety of I/O tasks in parallel with other work. If you are downloading 2 or more files in parallel, the O/S may be reading / processing network packets for one download and writing to disc for another one ... at the same time.
Over long distances, networks can write much faster than they can read. With multithreading, having additional "readers" means the data can be transmitted more efficiently and not bogged down in buffers.

How to determine the optimum number of worker threads

I wrote a C program which reads a dataset from a file and then applies a data mining algorithm to find the clusters and classes in the data. At the moment I am trying to rewrite this sequential program multithreaded with PThreads and I am newbie to a parallel programming and I have a question about the number of worker threads which struggled my mind:
What is the best practice to find the number of worker threads when you do parallel programming and how do you determine it? Do you try different number of threads and see its results then determine or is there a procedure to find out the optimum number of threads. Of course I'm investigating this question from the performance point of view.
There are a couple of issues here.
As Alex says, the number of threads you can use is application-specific. But there are also constraints that come from the type of problem you are trying to solve. Do your threads need to communicate with one another, or can they all work in isolation on individual parts of the problem? If they need to exchange data, then there will be a maximum number of threads beyond which inter-thread communication will dominate, and you will see no further speed-up (in fact, the code will get slower!). If they don't need to exchange data then threads equal to the number of processors will probably be close to optimal.
Dynamically adjusting the thread pool to the underlying architecture for speed at runtime is not an easy task! You would need a whole lot of additional code to do runtime profiling of your functions. See for example the way FFTW works in parallel. This is certainly possible, but is pretty advanced, and will be hard if you are new to parallel programming. If instead the number of cores estimate is sufficient, then trying to determine this number from the OS at runtime and spawning your threads accordingly will be a much easier job.
To answer your question about technique: Most big parallel codes run on supercomputers with a known architecture and take a long time to run. The best number of processors is not just a function of number, but also of the communication topology (how the processors are linked). They therefore benefit from a testing phase where the best number of processors is determined by measuring the time taken on small problems. This is normally done by hand. If possible, profiling should always be preferred to guessing based on theoretical considerations.
You basically want to have as many ready-to-run threads as you have cores available, or at most 1 or 2 more to ensure no core that's available to you will ever be left idle. The trick is in estimating how many threads will typically be blocked waiting for something else (mostly I/O), as that is totally dependent on your application and even on external entities beyond your control (databases, other distributed services, etc, etc).
In the end, once you've determined about how many threads should be optimal, running benchmarks for thread pool sizes around your estimated value, as you suggest, is good practice (at the very least, it lets you double check your assumptions), especially if, as it appears, you do need to get the last drop of performance out of your system!

Resources