How does MPI communication work? - parallel-processing

I am trying to play with parallel computing, and I've started to study the MPI standard.
I tried to find any information about low level implementation, but unfortunately still searching for it.
I am able to understand all this great high level stuff like rank, communicator and other things. It is not hard, but when I try to understand something I always look for low level details in order to get understanding how does it work under the hood.
So could someone explain me what low level protocols are used for communication ? Is it done over the LAN, shared memory, domain sockets or any other communication means ?
I would be grateful for any details, especially for low level.

Related

Interview question: How do you scale or optimize a microservice which receives millions of requests?

I was asked a question during an interview:
How do you optimize a microservice which receives millions of requests?
How do you optimize the latency/frequency of a service response which accessed multiple times?
My answer was:
I would check the DB query which makes the response slow and then configure the cache.
Can anyone let me in what are the ways a service can be optimized other than these? if there is anything cloud side?
It is a vast and complex questions which can have a lot of different (and very long) answers based on the context and the structure of your environment.
There are a lot of patterns and concepts which fit different scenario and architecture.
I would suggest you to start here: https://microservices.io/patterns/index.html
The guy behind the site (Chris Richardson) advocates microservices since a long time. You can find numerous talks of this guy on Youtube. It is a great way to start your journey in the microservice world.
And off course: https://martinfowler.com/articles/microservices.html
Here are my ideas on this question:
Caching, as you mentioned, is a good optimization point when around the data access layer. I would check to see if there is any opportunity to add a cache without breaking the consistency or other hard requirements there may be for the application.
I would analyze the CPU and memory usage and adjust them progressively while monitoring the latency closely. The objective here is to find the point when more resources does not decrease the latency significantly.
The above point has to consider the adjustment of the number of threads in the application, and also making sure that you optimize the synchronization scheme.
If the microservice is built in a JVM-based language, the GC is one component that may introduce latency, mostly when it kicks in, so if the latency spikes correlates with GC cycles, then I would search for optimizations there.
Making sure the application is reusing connections to external services efficiently is another optimization point that may be considered

Real-time capability comparison of single board computers

In my thesis, I plan on writing a section of real-time capability comparison of single board computers:
the factors (if they really have a real time clock, even if they don't have one, can real-time frameworks or RTOS be used to utilize them with real-time properties and how)
what scheduling is used in their out-of-the-box kernel? (for example, if Round-robin is used, then AFAIK real-time scheduling cannot be achieved)
Comparison between Pandaboard, Beagleboard, Beaglebone, and Especially Raspberry Pi
If you have a resource or idea regarding this, I would really appreciate it. In case I have missed an information, please do say and I'd be happy to provide that.
Thanks in advance.
EDIT:
I found a good answer here, but I can always appreciate any better guidance.
What makes a kernel/OS real-time?
First an observation. Scheduling is an OS concept. Why would it matter which scheduler is used in out-of-the-box kernel? If indeed there is such a thing as out-of-the-box kernel. Having said that, realtimeness is affected by scheduler and hardware. But when comparing boards, I would keep scheduler constant (or may be pick a few) and then compare boards. Choosing scheduler(s) is a separate topic on its own. Couple of things to take into account are that it should be pre-emptive and be able to deal with issues like priority inversion.
Note that all these boards have MMU which will bring in latency. That shouldn't really matter though, as long as that latency is bounded. I'd also compare accuracy of crystals on which the clocks are based. Note also SoCs have low power modes, they also tend to switch clocks. Whenever they come out of LP mode, they switch from some internal oscillator to more accurate clock source like external crystal. That requires time to for crystal to stabilise before it can continue normal operations. Comparison of latency involved in switching between power mode will also be a useful determinant.

Using ZMQ for bidirectional inter-thread communication

I am new to ZeroMQ. I have spent the last couple of months reading the documentation and experimenting with the library. I am currently developing a multi-threaded c++ application and want to use ZeroMQ instead of mutexes to exchange data between my main thread and one of its child.
The child thread is handling the communication with an external application. Therefore, I will need to queue/sockets between the main thread and its child. One for outgoing messages and one for incoming messages.
Which zmq socket should I use in order to achieve this.
Thanks in advance
By moving from using shared memory and mutexes to using ZeroMQ, you are entering the realm of Actor model programming.
This, in my opinion, is a fairly good thing. However, there are some things to be aware of.
The only reason mutexes are no longer needed is because you are copying data, not sharing it. The 'cost' is that copying a lot of data takes a lot longer than locking a mutex that points to shared data. So you can end up with a nice looking Actor model program that runs like a dog in comparison to an equivalent program that uses shared memory / mutexes.
A caveat is that on complicated architectures like Intel Xeons with multiple CPUs, accessing shared memory can, conceivably, take just as long as copying it. This is because this may (depending on how lucky you've been) mean transactions across the QPI bus. Actor model programming is ideal for NUMA hardware architectures. Modern Intel and AMD architectures are, partially/fundamentally, NUMA, but the protocols they run over QPI / Hypertransport "fake" an SMP environment.
I would avoid ZMQ_PAIR sockets wherever practicable. They don't work across network connections. This means that if, for any reason, your application needs to scale across multiple computers you have to re-write your code. However, if you use different socket types from the very beginning, a scale-up of your application is nothing more than a matter of redeploying your code, not changing it. FYI nanomsg PAIRs do not have this restriction.
Don't for one moment assume that Actor model programming is going to solve all your problems. It brings in a whole suite of problems all of it's own. You can still deadlock, livelock, spinlock, etc. The problem with Actor model programmes is that these problems can be lurking in your code for years and never happen, until one day the network is just a little bit busier and -bam- your program stops running...
However, there is a development of Actor model programming called "Communicating Sequential Processes". This doesn't solve those problems, but if you've written your program with these problems they are guaranteed to happen every single time. So you discover the problem during development and testing, not five years later. There's also a process calculi for it, i.e. you can algebraically prove that your design is problem free before you ever write a single line of code. ZeroMQ is not CSP. Interestingly CSP is making something of a comeback - the Rust and Go languages both do CSP. However, they do not do CSP across network connections - it's all in-process stuff. Erlang does CSP too, and AFAIK does it across network connections.
Assuming you've read all that about CSP and are still going to use ZeroMQ, think carefully about what it is you are planning on sending across the ZeroMQ sockets. If it's all within one program on the same machine, then sending copies of, for example, arrays of integers is fine. They'll still be interpretable as integers at the receiving end. However, if you have aspirations to send data through ZMQ sockets to another computer it's well worth considering some sort of serialisation technology. ZeroMQ delivers messages. Why not make those messages the byte stream from an object serialiser? Then you can guarantee that the received message will, after de-serialisation, mean something appropriate at the receiving end, instead of having to solve problems with endianness, etc.
Favourite serialisers for me include Google Protocol Buffers. It is language / operating system agnostic, giving lots of options for a heterogeneous system. ASN.1 is another really good option, it can be got for most of the important languages, and it has a rich set of wire formats (including XML and, now/soon, JSON, which gives some interesting inter-op options), and does Constraints (something Google PBufs don't do), but does tend to cost money if one wants really good tools for it. XML can be understood by almost anything, but is bloated. Basically it's worth picking one that doesn't tie you down to using, say, C#, or Python everywhere.
Good luck!

Boost.MPI vs Boost.Asio

Good day!
What difference between these libraries?
I read MPI's docs and have small experience with asio. For me it's different
implementations of network communication and no more.
But each of them introduces different abstractions ( I'm not sure about same level
of these abstractions ) which leads to different application design.
When I should use one library or another? What I must to know for choosing right
decision in each separate situation?
Yes, Asio is good for several nodes (and very generic framework in general), but why MPI is less better for such tasks? I don't think that dependency on MPI C library is restrictive or MPI is hard to understand and what about scalability? With Asio we can implement things like broadcasting and others and from another hand MPI doesn't forbid to write simple network applications. Is it conceptually difficult to rewrite Asio-specific logic with MPI if needed?
What about socket-like communications: if it's mandatory, we can encapsulate such one in module on Asio or any other framework and still use MPI for other communications.
For me sokets and MPI standart are different network services and it's not clear what is fundamental in real world, where distance from simple client-server pair to some medium computations is one step. Also I don't think that MPI has notable overhead in comparison with Asio.
Maybe it's bad question and all we need it's something like ICE (Internet Communications Engine)? Different languages support and again (as assures ZeroC) great performance.
And, of course, I never seen in any documentation topic like 'don't use this library for it!'.
I simply can't take such disunity: in one case it's sockets, in another - asynchronous messages and finally heavy middleware platform. Where is clarity in lifecycle of development? Maybe it's not fair question, but for starting to reduce this zoo we need some point.
Each library solves different problems, they don't really overlap. It also depends what you are trying to solve, and the communication patterns of your application. Use Boost.MPI for scalability, such as scaling to thousands, or tens of thousands of nodes. Depending on the underlying network architecture, MPI also excels at collective operations: gather, scatter, broadcast, etc.
Use Boost.Asio for a socket abstraction layer if you only need a handful of nodes, such as a single server and some clients. I'd suggest using Boost.Asio if you aren't already using an MPI distribution in some fashion.
I haven't used both of them, but Boost.ASIO is more an abstraction layer for networking on a low level, whereas Boost.MPI implements the MPI standard which let's you create distributed computing systems.
So if you need some, say, socket-like communication, I'd go with ASIO. If you want to do distributed computing and maybe even interoperate with MPI programs written in other languages/for other platforms, go with Boost.MPI.

Where are memory management algorithms used?

There are a set of memory management algorithms used in operating system construction, like pagination, segmentation, paged segmentation (paginación segmentada), segment pagination (segmentación paginada) and others.
Do you know if they are used besides that area, in not so low level software? They are used in bussiness applications?
These algoritms are for translating the program memory addresses onto the physical memory addresses. You will very rarely ever have to think of it in an application. In some extreme cases of applications working on very large datasets you may have to create a driver-like module to tune memory translation, but all the rest is still up to the operating system.
You might never write an OS yourself, but if you ever find yourself having to write a device driver, it will be imperitive that you understand these issues. So it is still quite useful to understand how these algorithms work.
Now you might be in school thinking, "Yuck, I'll just avoid that stuff". But you really have no idea where a 40-year carreer in the industry might take you.

Resources