MPI client server connection with Singleton MPI_INIT - parallel-processing

I want to implement (in C++) a feature, using MPI, in an existing (non-MPI) application. I am thinking of using mpich-3.4.1 for this.
I am planning to create a .so file for that feature, which the original application can link to. I initially thought to have a function in the .so file that starts with an MPI_Init() and ends with MPI_Finalize() and, in between, calls all required MPI apis to do the parallel job. As part of the MPI job, the new feature makes the current application an MPI server by calling APIs like 'MPI_Open_port' and 'MPI_Comm_accept'. Other worker processes (possibly running on different machines) connect to this server, send/receive messages, and complete a heavy computation in parallel. The application then resumes its other non-mpi work.
It seems to me that Singleton MPI_INIT mechanism will be useful for this. I found the following page on Singleton Init:
https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node254.htm
This page says, "A high-quality implementation will allow any process (including those not started with a ``parallel application'' mechanism) to become an MPI process by calling MPI_INIT. Such a process can then connect to other MPI processes...".
However, the comments in mpich-3.4.1/src/mpi/init/init.c says, "The MPI standard does not say what a program can do before an 'MPI_INIT' or after an 'MPI_FINALIZE'. In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output."
Based on the above comments, it seems we should not have MPI_Init(NULL, NULL) and MPI_Finalize() as part of any implementation in a library. In that case, I am thinking to have the init and finalize APIs in the original application's main function, and have rest of the API calls made from the .so file. My original application is a working large software, and may not need to execute my mpi feature at all, in some situations.
My questions are:
(1) Does it make sense to have MPI_Init(NULL, NULL) and MPI_Finalize() called in the main function of this application, and rest of the MPI functionalities in a .so file?
(2) Once MPI_Init(NULL, NULL) is called in the main, would it interfere with the normal execution of the software in any way? Would there be any performance impact on the existing application?
(3) Is there an MPI implementation that handles this better?
(4) Is MPI a good approach to handle this requirement, or other mechanisms like ZeroMQ better? In the comments made by Wesley Bland in the following link, he says that "MPI may not be right for you if you're looking for a client/server model. Yes, it's possible, but it's not really optimized for that use case and you might have better luck using a different communication mechanism". Is that true in 2022?
client relationship within MPI server

Related

EXE size bloats while using Websocketpp

I've built a very basic EXE which uses Websocketpp client, which just connects to a Websocket server, and sends and receives a mesage.
I've used VS 2013.
I'm noticing that the size of the EXE is mammoth. It's like 2.3 MB for Release and 6 MB for Debug.
Any ideas as to how I can reduce the size of EXE??
WebSocket++ author here. The sizes you quote seem about the right ballpark. Keep in mind that a "very basic sample" like the echo_server (which produces a ~1MB executable on linux) does a lot more than you might think based on the ~50 lines in the program source.
Out of the box any WebSocket++/Asio based program is a high performance event based client/server system and includes code for DNS resolution, IPv4 and IPv6, timers, SHA1/MD5 hashing, base64 encoding, UTF8 validation, logging, thread safety, and parsers for URIs, HTTP, and multiple WebSocket protocol versions. Just because you only use these capabilities to echo back messages doesn't make this a trivial program.
Some observations/notes on the topic:
Due to the way templates work, the code for WebSocket++, ASIO, and the STL is compiled into your program rather than sitting in an externally linked library. This may make a WebSocket++ or Asio program look artificially larger than a program that links to an external library.
The situation described in #1 can sometimes end up more efficient than an external library because this program will only include the parts of the library that your code actually uses, rather than all parts. I.e. If you don't instantiate a client endpoint no client code will be included. If your config disables TLS encryption, logging, or the thread safety features they will also not be included. Again due to the way templates work this can go both ways. For example: A program that includes both a client and a server will have some potentially unnecessary duplication.
The size of WebSocket++'s code is largely correlated to the number of different endpoint configs that you use and the options enabled in each of those configs. These represent a fixed size no matter what else your program does. If your program does little, they will make up a large proportion of the code. If your program does a lot that proportion will shrink.
WebSocket++ is fairly modular (though this is less well documented right now). If you are really concerned about code size (small embedded systems perhaps?) and don't actually need all the features that Asio and WebSocket++ bring out of the box, you can set up a custom config that either removes many features or replaces them with your own space optimized implementations.
Say you only ever need to service one non-TLS connection with no DNS lookup and no security timeouts in a guaranteed single threaded program with no logging. You can implement your own network transport policy based on your native OS socket library that doesn't include all the stuff that Asio does. You can also stub out the locking/concurrency and logger policies you don't need.

Avoid Application[process] switching for shared resource in linux

Shared resource is used in two application process A and in process B. To avoid race condition, decided that when executing portion of code dealing with shared resource disable context switching and again enable process switching after exiting shared portion of process.
But don't know how to avoid process switching to another process, when executing shared resource part and again enable process switching after exiting shared portion of process.
Or is there any better method to avoid race condition?
Regards,
Learner
But don't know how to avoid process switching to another process, when executing shared resource part and again enable process switching after exiting shared portion of process.
You can't do this directly. You can do what you want with kernel help. For example, waiting on a Mutex, or one of the other ways to do IPC (interprocess communication).
If that's not "good enough", you could even make your own kernel driver that has the semantics you want. The kernel can move processes between "sleeping" and "running". But you should have good reasons why existing methods don't work before thinking about writing your own kernel driver.
Or is there any better method to avoid race condition?
Avoiding race conditions is all about trade-offs. The kernel has many different IPC methods, each with different characteristics. Get a good book on IPC, and look into how things like Postgres scale to many processors.
For all user space application, and vast majority of kernel code, it is valid that you can't disable context switching. The reason for this is that context switching is not responsibility of application, but operations system.
In scenario that you mentioned, you should use a mutex. All processes must follow convention that before accessing shared resource, they acquire mutex, and after they are done with accessing shared resource, they release the mutex.
Lets say an application accessing the shared resource acquired mutex, and is doing some processing of shared resource, and that operating system performed context switch, thus stopping the application from processing shared resource. OS can schedule other processes wanting to access shared resource, but they will be in waiting state, waiting for mutex to be released, and none of such processes will not do anything with shared resource. After certain number of context switches, OS will again schedule original application, that will continue processing of shared resource. this will continue until original application finally releases the mutex. And then, some other process will start accessing shared resource in orderly fashion, as designed.
If you want more authoritative and detailed explanations of whats and whys of similar scenarios, you can watch this MIT lesson, for example.
Hope this helps.
I would suggest looking into named semaphores. sem_overview (7). This will allow you to ensure mutual exclusion in your critcal sections.

Long running task in WebAPI

Here's my problem: I need to call multiple 3rd party methods inside an ApiController. The signature for those methods is Task DoSomethingAsync(SomeClass someData, SomeOtherClass moreData). I want those calls to continue running in the background, after the ApiController has sent the data back to the client. When DoSomethingAsync completes I want to do some logging and maybe save some data to the file system. How can I do that? I'd prefer to use the asyny/await syntax.
Great news, there is a new solution in .NET 4.5.2 called the QueueBackgroundWorkItem API. It's really simple to use:
HostingEnvironment.QueueBackgroundWorkItem(ct => DoSomething(a, b, c));
Here's an article that describes it in detail.
https://blogs.msdn.microsoft.com/webdev/2014/06/04/queuebackgroundworkitem-to-reliably-schedule-and-run-background-processes-in-asp-net/
And here's anohter article that mentions a few other approaches not mentioned in this thread.
http://www.hanselman.com/blog/HowToRunBackgroundTasksInASPNET.aspx
You almost never want to do this. It is almost always a big mistake.
ASP.NET (and most other servers) work on the assumption that it's safe to tear down your service once all requests have completed. So you have no guarantee that your logging will be done, or that your data will be written to disk. Particularly with the disk writes, it's entirely possible that your writes will be corrupted.
That said, if you are absolutely sure that you want to implement this extremely dangerous design, you can use the BackgroundTaskManager from my blog.
Update: I've written a blog series that goes into detail on a proper solution for request-extrinsic code. In summary, what you really want to do is move the request-extrinsic code out of ASP.NET. Introduce a durable queue and an independent processor; the ASP.NET controller action will place a request onto the queue, and the independent processor will read requests and execute them. This "processor" can be an Azure Function/WebJob, Win32 Service, etc.
Stephen described why starting essentially long running fire-and-forget tasks inside an ApiController is a bad idea.
Perhaps you should create a separate service to execute those fire-and-forget tasks. That service could be a different ApiController, a worker behind a queue, anything that can be hosted on its own and have an independent lifetime.
This would make management of the different task lifetimes much easier and separate the concerns of the long-running tasks from the ApiController's core responsibilities.
As pointed out by others, it is not recommended. However, whenever there is a need there is a way, so take a look at IRegisteredObject
See also
http://haacked.com/archive/2011/10/16/the-dangers-of-implementing-recurring-background-tasks-in-asp-net.aspx/
Though the question is several years old, best possible solution now is to use Singal R in this case.
https://github.com/Myrmex/signalr-notify-progress

Multi-threaded Windows Service - Erlang

I am going to tell the problem that I have to solve and I need some suggestions if i am in the right path.
The problem is:
I need to create a Windows Service application that receive a request and do some action. (Socket communication) This action is to execute a script (maybe in lua or perl).This script models te bussiness rules of the client, querying in Databases, making request in websites and then send a response to the client.
There are 3 mandatory requirements:
The service will receive a lot of request at the same time. So I think to use the worker's thread model.
The service must have a high throughput. I will have many of requests at the same second.
Low Latency: I must response these requests very quickly.
Every request will generate a log entries. I cant write these log entries in the physical disk at same time the scripts execute because the big I/O time. Probably I will make a queue in memory and others threds will consume this queue and write on disk.
In the future, is possible that two woker's thread have to change messages.
I have to make a protocol to this service. I was thinking to use Thrift, but i don't know the overhead involved. Maybe i will make my own protocol.
To write the windows service, i was thinking in Erlang. Is it a good idea?
Does anyone have suggestions/hints to solve this problem? Which is the better language to write this service?
Yes, Erlang is a good choice if you're know it or ready to learn. With Erlang you don't need any worker thread, just implement your server in Erlang style and you'll receive multithreaded solution automatically.
Not sure how to convert Erlang program to Windows service, but probably it's doable.
Writing to the same log file from many threads are suboptimal because requires locking. It's better to have a log-entries queue (lock-free?) and a separate thread (Erlang process?) that writes them to the file. BTW, are you sure that executing external script in another language is much faster than writing a log-record to the file?
It's doubtfully you'll receive much better performance with your own serialization library than Thrift provides for free. Another option is Google Protocol Buffers, somebody claimed that it's faster.
Theoretically (!) it's possible that Erlang solution won't provide you required performance. In this case consider a compilable language, e.g. C++ and asynchronous networking, e.g. Boost.Asio. But be ready that it's much more complicated than Erlang way.

Using gevent and multiprocessing together to communicate with a subprocess

Question:
Can I use the multiprocessing module together with gevent on Windows in an efficient way?
Scenario:
I have a gevent based Python application doing asynchronous I/O on Windows. The application is mostly I/O bound, but there are spikes of higher CPU load as well. This application would need to control a console application via its stdin and stdout. I cannot modify this console application and the user will be able to use his own custom one, only the text (line) based communication protocol is fixed.
I have a working implementation using subprocess and threads, but I would rather move the whole subprocess based communication code together with those threads into a separate process to turn the main application back to single-threaded. I plan to use the multiprocessing module for this.
Prior reading:
I have been searching the Web a lot and read some source code, so I know that the multiprocessing module is using a Pipe implementation based on named pipes on Windows. A pair of multiprocessing.queue.Queue objects would be used to communicate with the second Python process. These queues are based on that Pipe implementation, e.g. the IPC would be done via named pipes.
The key question is, whether calling the incoming Queue's get method would block gevent's main loop or not. There's a timeout for that method, so I could make it into a loop with a small timeout, but that's not a good solution, since it would still block gevent for small time periods hurting its low I/O latency.
I'm also open to suggestions on how to circumvent the whole problem of using pipes on Windows, which is known to be hard and sometimes fragile. I'm not sure whether shared memory based IPC is possible on Windows or not. Maybe I could wrap the console application in a way which would allow communicating with the child process using network sockets, which is known to work well with gevent.
Please don't question my primary use case, if possible. Thanks.
The Queue's get method is really blocking. Using it with timeout could potentially solve your problem, but it definitely won't be a cleanest solution and, which is the most important, will introduce extra latency for no good reason. Even if it wasn't blocking, that won't be a good solution either. Just because non-blocking itself is not enough, the good asynchronous call/API should smoothly integrate into the I/O framework in use. Be that gevent for Python, libevent for C or Boost ASIO for C++.
The easiest solution would be to use simple I/O by spawning your console applications and attaching to its console in and out descriptors. There are at two major factors to consider:
It will be extremely easy for your clients to write client applications. They will not have to work with any kind of IPC, socket or other code, which could be very hard thing for many. With this approach, application will just read from stdin and write to stdout.
It will be extremely easy to test console applications using this approach as you can manually start them, enter text into console and see results.
Gevent is a perfect fit for async read/write here.
However, the downside is that you will have to start this application, there will be no support for concurrent communication with it, and there will be no support for communication over network. There is even a good example for starters.
To keep it simple but more flexible, you can use TCP/IP sockets. If both client and server are running on the same machine. Also, a good operating system will use IPC as an underlying implementation, so it will be fast. And, if you are worrying about performance of this case, you probably should not use Python at all and look at other technologies.
Even fancies solution – use ZeroC ICE. It is very modern technology allowing almost seamless inter-process communication. It is a CORBA killer, very easy to use. It is heavily used by many, proven to be fastest in its class and rock stable. The beauty of this solution is that you can seamlessly integrate programs in many different languages, like Python, Java, C++ etc. But this will require some of your time to get familiar with a concept. If you decide to go this way, just spend a day reading trough documentation.
Hope it helps. Good luck!
Your question is already quite old. Nevertheless, I would like to recommend http://gehrcke.de/gipc which -- I believe -- would tackle the outlined challenge in a very straight-forward fashion. Basically, it allows you to integrate multiprocessing-based child processes anywhere in your application (also on Windows). Interaction with Process objects (such as calling join()) is gevent-cooperative. Via its pipe management, it allows for cooperatively blocking inter-process communication. However, on Windows, IPC currently is much less efficient than on POSIX-compliant systems (since non-blocking I/O is imitated through a thread pool). Depending on the IPC messaging volume of your application, this might or might not be of significance.

Resources