MPI: master-slave with the master also doing work - parallel-processing

I'm implementing a standard MPI master/slave system: there is a master that distributes work, and there are slaves who ask for chunks and process data.
However... if implemented in a naive way (rank==0 is master, the rest are slaves), the master ends up doing no real work, but still takes one core for what needs practically no real computing power. So I tried to implement a separate "scheduler" thread in the master, but that involved sending MPI messages to itself, and didn't really work...
Do you have any ideas how to solve this?

As I realized after some googling: you can send messages to yourself using tags. Tags are a kind of filter: if you do a recv for only tag==1, then you'll receive only those, with later messages being able to overtake eariler ones.
So, as for the solution:
tag the "scheduler to worker" and "worker to scheduler" messages with a different id
if rank==0: start a scheduler thread
afterwards, regardless of the rank, request work.
This way, the rank 0 worker won't receive its own "let's give me work" messages, because they will have a "to be received by the scheduler only" tag.
Edit: this thing doesn't really seem to be thread-safe though... (= it sometimes crashes in "free()" even though it's written in Python...) so I'd be still interested in the real & proven solution :)

Related

Cancel last sent message ZeroMQ (python) (dealer/router and pushh/pull)

How would one cancel the last sent message ?
I have this set up
The idea is that the client can ask for different types of large data.
The server reads the request from the client and answers an acknowledgement.
Once its data is ready, it pushes it through the other socket.
This enables queueing task on the server side when multiple clients are connected.
However, if the client decides that it does not need the data anymore, it can send a cancel message to the server.
I'm using asyncio.Queue for queueing messages, so I can easily empty the queue, however, I don't know how to drop a message that is in the push/pull pipe to free up the channel?
The kill switch example (Figure 19 - Parallel Pipeline with Kill Signaling) in https://zguide.zeromq.org/docs/chapter2/ is used to end the process. I just want to cancel it.
My idea was to close the socket on the server side and reopen it, but even with linger set to 0, the messages are not dropped.
EDIT: The messages are indeed dropped, but I feel the solution is wrong.
It doesn't really make any sense for ZeroMQ itself to have such a feature.
Suppose that it did have a cancel message feature. For it to operate as expected, you would be critically dependent on the speed of the network. You might develop on a slow network and their you have the time available to decide to cancel, submit the request and for that to take effect before anything has moved anywhere. But on a fast network you won't.
ZeroMQ is a bit like the post office. Once you have posted a letter, they are going to deliver it.
Other issues for a library developer would include how messages are identified, who can cancel a message, etc? It would get very complex for the library to do it and cater for all possible use cases, so it's not unreasonable that they've left such things as an exercise for the application developers.
Chop the Responses Up
You could divide the responses up into smaller messages, send them at some likely rate (proportionate to the network throughput) and check to see if a cancellation has been received before sending each chunk.
It's a bit fiddly, you'd need to know what kind of rate to send the smaller messages so that you don't starve the network, but don't over do it either.
Or, Convert to CSP
The problem lies in ZeroMQ implementing Actor Model, where the transport buffers messages. What you need is Communicating Sequential Processes, which does not buffer messages. You can implement this quite easily on top of ZeroMQ, basically all you need to do is have a two way message exchange going on basically like:
Peer1->Peer2: I'd like to send you a message
time passes
Peer2->Peer1: Okay send a message
Peer1->Peer2: Here is the message
time passes
Peer2->Peer1: I have received the message
end
And in doing this the peers would block, ie peer 1 does nothing else until it gets peer 2's final response.
This feels clunky, but it's what you have to do to reign in an Actor Model system and control where your messages are at any point in time. It's slower because there's more too-ing and fro-ing going on between the peers (in systems like Transputers, this was all done down at the electronic level, so it wasn't an encumberance on software).
The blocking can be a blessing, if throughput matters. Basically, if you find the sender is being blocked too much, that just means you haven't got enough receivers for the tasks they're performing. Actor Model can deceive, because buffering in the network / actor model implementation can temporarily soak up an excess of messages, adding a bit of latency that goes unnoticed.
Anyway, this way you can have a mechanism whereby the flow of messages is fully managed within the application, and not within the ZeroMQ library. If a client does send a "cancel my last request" message (using the above mechanism to send it), that either arrives before the reponse has started to be sent, or after the response has already been delivered to the client (using the mechanism above to send it). There is no intermediate state where a response is already on the way, but out of control of the applications.
CSP is a mode that I'd dearly like ZeroMQ to implement natively. It nearly does, in that you can control the socket high water marks. Unfortunately, a high water mark of 0 means "inifinite", not zero.
CSP itself is a 1970s idea, that saw some popularity and indeed silicon in the 1980s, early 1990s (Inmos, Transputers, Occam, etc) but has recently made something of a comeback in languages like Rust, Go, Erlang. There's even a MS-supplied library for .NET that does it too (not that they call it CSP).
The really big benefit of CSP is that it is algebraically analysable - a design can be analysed and proven to be free of deadlock, without having to do any testing. However, with Actor model systems you cannot do that, and testing will not confirm a lack of problems either. Complex, circular message flows in Actor model can easily lead to deadlock, but that might not occur until the network between computers becomes just a tiny bit busier. Deadlock can happen in CSP too, but it's basically guaranteed to happen every time, if the system has accidentally been architected to deadlock. This shows up in testing quite readily (so at least you know early on!).
As I alluded to early, CSP also doesn't deceive you into thinking there is enough compute resources in a system. If a sender has a strict schedule to keep, and the recipient(s) aren't keeping up, the sender ends up being blocked trying to send instead of waiting for fresh input. It's easy to detect that the real time requirement has not been met. Whereas with Actor model, the send launches messages off into some buffer, and so long as the receiver(s) on average keeps up, all appears to be OK. However, you have no visibility of whether messages are building up inside the (in this case) ZeroMQ's own buffers, so there is little notice of a trending problem in the overall system.

How get a data without polling?

This is more of a theorical question.
Well, imagine that I have two programas that work simultaneously, the main one only do something when he receives a flag marked with true from a secondary program. So, this main program has a function that will keep asking to the secondary for the value of the flag, and when it gets true, it will do something.
What I learned at college is that the polling is the simplest way of doing that. But when I started working as an developer, coworkers told me that this method generate some overhead or it's waste of computation, by asking every certain amount of time for a value.
I tried to come up with some ideas for doing this in a different way, searched on the internet for something like this, but didn't found a useful way about how to do this.
I read about interruptions and passive ways that can cause the main program to get that data only if was informed by the secondary program. But how this happen? The main program will need a function to check for interruption right? So it will not end the same way as before?
What could I do differently?
There is no magic...
no program will guess when it has new information to be read, what you can do is decide between two approaches,
A -> asks -> B
A <- is informed <- B
whenever use each? it depends in many other factors like:
1- how fast you need the data be delivered from the moment it is generated? as far as possible? or keep a while and acumulate
2- how fast the data is generated?
3- how many simoultaneuos clients are requesting data at same server
4- what type of data you deal with? persistent? fast-changing?
If you are building something like a stocks analyzer where you need to ask the price of stocks everysecond (and it will change also everysecond) the approach you mentioned may be the best
if you are writing a chat based app like whatsapp where you need to check if there is some new message to the client and most of time wont... publish subscribe may be the best
but all of this is a very superficial look into a high impact architecture decision, it is not possible to get the best by just looking one factor
what i want to show is that
coworkers told me that this method generate some overhead or it's
waste of computation
it is not a right statement, it may be in some particular scenario but overhead will always exist in distributed systems
The typical way to prevent polling is by using the Publish/Subscribe pattern.
Your client program will subscribe to the server program and when an event occurs, the server program will publish to all its subscribers for them to handle however they need to.
If you flip the order of the requests you end up with something more similar to a standard web API. Your main program (left in your example) would be a server listening for requests. The secondary program would be a client hitting an endpoint on the server to trigger an event.
There's many ways to accomplish this in every language and it doesn't have to be tied to tcp/ip requests.
I'll add a few links for you shortly.
Well, in most of languages you won't implement such a low level. But theorically speaking, there are different waiting strategies, you are talking about active waiting. Doing this you can easily eat all your memory.
Most of languages implements libraries to allow you to start a process as a service which is at passive waiting and it is triggered when a request comes.

How to design and structure a program that uses Actors

From Joe Armstrong's dissertation, he specified that an Actor-based program should be designed by following three steps. The thing is, I don't understand how the steps map to a real world problem or how to apply them. Here's Joe's original suggestion.
We identify all the truly concurrent activities in our real world activity.
We identify all message channels between the concurrent activities.
We write down all the messages which can flow on the different message channels.
Now we write the program. The structure of the program should exactly follow the structure of the problem. Each real world concurrent activity should be mapped onto exactly one concurrent process in our programming language. If there is a 1:1 mapping of the problem onto the program we say that the program is isomorphic to the problem.
It is extremely important that the mapping is exactly 1:1. The reason for this is that it minimizes the conceptual gap between the problem and the solution. If this mapping is not 1:1 the program will quickly degenerate, and become difficult to understand. This degeneration is often observed when non-CO languages are used to solve concurrent problems. Often the only way to get the program to work is to force several independent activities to be controlled by the same language thread or process. This leads to an inevitable loss of clarity, and makes the programs subject to complex and irreproducible interference errors.
I think #1 is fairly easy to figure out. It's #2 (and 3) where I get lost. To illustrate my frustration I stubbed out a small service available in this gist (Ruby service with callbacks).
Looking at that example service I can see how to answer #1. We have 5 concurrent services.
Start
LoginGateway
LogoutGateway
Stop
Subscribe
Some of those services don't work (or shouldn't) depending on the state the service is in. If the service hasn't been Started, then Login/Logout/Subscribe make no sense. Does this kind of state information have any relevance to Joe's 3 steps?
Anyway, given the example/mock service in that gist, I'm wondering how someone would go about designing a program to wrap this service up in an Actory fashion. I would just like to see a list of guidelines on how to apply Joe's 3 steps. Bonus points for writing some code (any language).
Generally, when structuring an application to use actors you have to identify the concurrent features of your application, which can be tricky to get the hang of. You identify 5 concurrent "services":
Start
LoginGateway
LogoutGateway
Stop
Subscribe
1, 4 and 5 seem to be types of messages that can flow through the system, 2 and 3 I'm not sure how to describe. Your gist is rather large and not super clear to me, but it looks like you've got some kind of message queue system. The actions a User can take are:
Log in to the system
Log out of the system
Subscribe to a Queue of messages
I'll assume logging in and out requires some auth step. I'll assume further that if the user fails the auth step their connection is broken but that creating a connection is not sufficient authentication.
The actions the System takes are:
Handling User actions
Routing messages to subscribers of a Queue
If that's not broadly true, let me know and I'll change this answer. (I'll assume that the messages that get sent to users are not generated by users but are an intrinsic part of the System; maybe we're discussing a monitoring service.) Anyhow, what is concurrent here? A few things:
Users act independently of one another
Queues have separate states
An actor based architecture represents each concurrent entity as its own process. The User is a finite state machine which authenticates, subscribes to a queue, alternatively receives messages and subscribes to more queues and eventually disconnects. In Erlang/OTP we'd represent this by a gen_fsm. The User process carries all the state needed to interact with the client which, if we're exposing a service over a network, would be a socket.
Authentication implies that the System is itself a 'process', though, more likely than not it's really a collection of processes which in Erlang/OTP we call an application. I digress. For simplification we'll assume that System is itself a single process which has some well-defined protocol and a state that keeps user credentials. User logins are, then, a well-defined message from a User process to the System process and the response therefrom. If there were no authentication we'd have no need for a System process as the only state related to a User would be a socket.
The careful reader will ask where do we accept socket connections for each User? Ah, good question. There's another concurrent entity in not mentioned, which we'll call here the Listener. It's another process that only listens for connections, creates a User for each new established socket and hands over ownership to the new User process, then loops back to listen.
The Queue is also a finite state machine. From its start state it accepts User subscription requests via a well-defined protocol, broadcasts messages to subscribers or accepts unsubscribe requests from User processes. This implies that the Queue has an internal store of User processes, the details of which are very dependent on language and need. In Erlang/OTP, for example, each Queue process would be a gen_server which stored User process ids--or PIDs--in a list and for each message to transmit simply did a multi-send to each User process in the list.
(In Erlang/OTP we'd user supervisors to ensure that processes stay alive and are restarted on death, which simplifies greatly the amount of work an Erlang developer has to do to ensure reliability in an actor-based architecture.)
Basically, to restate what Joe wrote, actor based architecture boils down to these points:
identify concurrent entities in the system and represent them in the implementation by processes,
decide how your processes will send messages (a primitive operation in Erlang/OTP, but something that has to be implemented explicitly in C or Ruby) and
create well-defined protocols between entities in the system which hide state modification.
It's been said that the Internet is the world's most successful actor based architecture and, really, that's not far off.

How can I monitor/manage queue in ZeroMQ?

First of all, I'm new to ZeroMQ and message queue systems, so what I'm trying to do may be solved through a different approach. I'm designing a messaging system that does the following:
Multiple clients connect to a broker and send the id of an item that needs to be processed. The client disconnects immediately and does not wait for a response.
The broker sends items to workers, one item per worker, to perform some processing. Each return returns a signal that the processing was completed.
I have a rudimentary system setup which is processing requests/replies correctly, but I'd also like to be able to do the following:
Query the broker to see how many processes are actually running on the workers and how many are simply waiting to be run.
Have the broker ensure that only one process per id is running - if a duplicate id arrives and that item is not currently being processed by a worker, do not add it to the queue.
I'm using a poll setup with broker/dealer sockets. The code I'm using is very similar to this example from Ian Barber.
My first inclination (although I'm not sure how to implement it in zmq) is to have the broker keep track of the ids that have been received, and those that are actively being processed by workers. It seems that the broker forwards requests to workers immediately, regardless of whether or not they are available to actually run the processing. The workers then queue up the ids and process them in order. This isn't ideal since I'm looking to be able to monitor and control what is going on in the system centrally to achieve reliability.
Anyways, any hints, tips or examples of this type of setup would be greatly appreciated.
ZeroMQ is, in my opinion, best used in broker-less designs, for which the library is designed. If you want to monitor the number of items in a queue, or throughput, or whatever, you're going to have to build that into the application/device/producer yourself. Since you're new to messaging, that could get out of hand real quick. Given this, I'd suggest looking into RabbitMQ (or a similar broker), which would provide these services for you out of the box. If you do adopt RabbitMQ (or rather, AMQP), I'd suggest using a fanout exchange for the scenario you describe above.
The Python library for ZeroMQ seems to come with a pattern for dealing with this: http://zeromq.github.com/pyzmq/devices.html#monitoredqueue

Usage of non-blocking send and blocking receive in MPI?

I am trying to implement master-worker program.
My master has jobs that the workers are going to do. Every time a worker completes a job, he asks for a new job from the master, and the master sends it to him. The workers are calculating minimal paths. When a worker finds a minimum that is better than the global minimum he got, he sends it to everyone including the master.
I plan for the workers and masters to send data using MPI_ISEND. Also, I think that the receive should be blocking. The master has nothing to do when no one has asked for work or has updated the best result, so he should block waiting for a receive. Also, each worker should, after he has done his work, wait on a receive to get a new one.
Nevertheless, I'm not sure of the impact of using non-blocking asynchronous send, and blocking synchronous receive.
An alternative I think is using MPI_IPROBE, but I'm not sure that this will give me any optimization.
Please help me understand whether what I'm doing is right. Is this the right solution?
You can match blocking sends with nonblocking receives and vice versa, that won't cause any problems. However, if the master really has nothing to do while the workers work, and the workers should block after completing their work unit, then there's no reason for non-blocking communication on that front. The master can post a blocking receive with MPI_ANY_SOURCE, and the workers can just use a blocking send to post back their results, since the matching receive at the master will already have been posted.
So, I'd have Send-Recv for exchanging work units between master and worker, and Isend-Irecv for broadcasting the new global minima.

Resources