Is MPI_Gather compulsory after MPI_Scatter? - parallel-processing

Is MPI_Gather compulsory after MPI_Scatter or can we just scatter and leave data on the nodes.
I scattered a 2D array and counted the evens and odds. the program is working fine without gather.
I think since gather only returns the scattered items, it would be fine if I do not gather it in my case.

No, it is not compulsory (or MPI doesn't even care). In theory, there is no relation between MPI_Scatter and MPI_Gather. Both are separate, independent collective operations. Because of its opposite behaviour scatter scatters and gather gathers data, it can be used after one another to send and collect data.
Only point to point communication needs a corresponding receive for a send because point-to-point communication sends messages between two different MPI processes.
Collectives on the other hand involves all the processes and hence it doesn't need a pair.

Related

validation_epoch_end with DDP Pytorch Lightning

What is your question?
I am trying to implement a metric which needs access to whole data. So instead of updating the metric in *_step() methods, I am trying to collect the outputs in the *_epoch_end() methods. However, the outputs contain only the output of the partition of the data each device gets. Basically if there are n devices, then each device is getting 1/n of the total outputs.
What's your environment?
OS: ubuntu
Packaging: conda
Version [1.0.4
Pytorch: 1.6.0
See the pytorch-lightningmanual. I think you are looking for training_step_end/validation_step_end (assuming you are using DP/DDP2).
...So, when Lightning calls any of the training_step, validation_step, test_step you will only be operating on one of those pieces. (...) For most metrics, this doesn’t really matter. However, if you want to add something to your computational graph (like softmax) using all batch parts you can use the training_step_end step.
When using the DDP backend, there's a separate process running for every GPU. They don't have access to each other's data, but there are a few special operations (reduce, all_reduce, gather, all_gather) that make the processes synchronize. When you use such operations on a tensor, the processes will wait for each other to reach the same point and combine their values in some way, for example take the sum from every process.
In theory it's possible to gather all data from all processes and then calculate the metric in one process, but this is slow and prone to problems, so you want to minimize the data that you transfer. The easiest approach is to calculate the metric in pieces and then for example take the average. self.log() calls will do this automatically when you use sync_dist=True.
If you don't want to take the average over the GPU processes, it's also possible to update some state variables at each step, and after the epoch synchronize the state variables and calculate your metric from those values. The recommended way is to create a class that uses the Metrics API, which recently moved from PyTorch Lightning to the TorchMetrics project.
If it's not enough to store a set of state variables, you can try to make your metric gather all data from all the processes. Derive your own metric from the Metric base class, overriding the update() and compute() methods. Use add_state("data", default=[], dist_reduce_fx="cat") to create a list where you collect the data that you need for calculating the metric. dist_reduce_fx="cat" will cause the data from different processes to be combined with torch.cat(). Internally it uses torch.distributed.all_gather. The tricky part here is that it assumes that all processes create identically-sized tensors. If the sizes don't match, syncing will hang indefinitely.

Using data structures and algorithms, other than queue and FIFO, for handling and prioritizing requests

Using queue and first-in-first-out algorithm seems like a standard way of dealing with requests to servers/databases/services. But could any other data structures and algorithms be used for dealing with large quantities of requests?
There is an article at http://queue.acm.org/detail.cfm?id=2991130 "Scaling Synchronization in Multicore Programs" which goes into the detail of reducing the locking costs when multiple cores are trying to update the same FIFO queue.
If different users or different requests have different priorities then a FIFO queue won't do, and you need something that takes account of the priority. I have often seen ordered maps used for this, such as TreeMap and std::map or std::multimap. It's not the use case these are designed for, but people know how to use them, and they are typically highly optimised, so they can be faster than most people's attempt at implementing a more specialised data structure.

What are the differences between sequential consistency and quiescent consistency?

Can anyone explain me the definitions and differences between sequential consistency and quiescent consistency? In the most dumb form possible :|
I did read this: Example of execution which is sequentially consistent but not quiescently consistent
But I am not able to understand Sequential and quiescent consistency itself :(
Sequential consistency requires that the operations should appear to take effect in the order they are specified in each program. Basically it enforces program order within each individual process, and allows all processes to assume they are observing the same order of operations. Let's say we have 2 processes enqueuing and dequeuing items on a queue q:
P1 -- q.enq(x) -----------------------------
P2 -------------- q.enq(y) ---- q.deq():y --
This is not the expected behaviour from a FIFO queue. We'd expect to dequeue x because P1 enqueues x before P2 enqueues y. However this scenario is allowed in the sequential consistency model because sequential consistency doesn't require that the order seen by all processes is correct (real-time order). There's at least one sequential execution that can explain these results and one is:
P2:q.enq(y) P1:q.enq(x) P2:q.deq():y
In this execution each process executes operations in program order meaning each process executes its operations in the order they're specified in each process.
Quiescent consistency requires non-overlapping operations to appear to take effect in their real-time order, but overlapping operations might be reordered. Therefore, the same scenario is not allowed in the quiescent consistency model because we expect q.enq(x) to appear to take effect before q.enq(y), and q.deq() to return x instead of y. Also quiescent consistency doesn't necessarily preserve program order. If q.enq(x) and q.enq(y) would be concurrent (overlapping) operations, they could be reordered and q.deq():y would be quiescently consistent.
Basically some executions are sequentially consistent but not quiescently consistent, and vice versa.
First you should understand what is program order, it is literally how you expect your program runs in the order of the appearance of instructions.
But a program order is only for a single thread program, if you have multithreads, then problem comes as the program order may not hold or even not exist as sometimes you cannot tell which thread's method call happens first.
A quiescent consistency describes a clear program order of all threads' behaviors. that is no overlaps are allowed since it is required a quiescent period between two method calls.
A sequential consistency allows overlaps, but requires one can find a program order in which all the method calls can be put in a place and still returns correct value and behaves correctly.

Growing hash-of-queues beyond main memory limits

I have a cluster application, which is divided into a controller and a bunch of workers. The controller runs on a dedicated host, the workers phone in over the network and get handed jobs, so far so normal. (Basically the "divide-and-conquer pipeline" from the zeromq manual, with job-specific wrinkles. That's not important right now.)
The controller's core data structure is unordered_map<string, queue<string>> in pseudo-C++ (the controller is actually implemented in Python, but I am open to the possibility of rewriting it in something else). The strings in the queues define jobs, and the keys of the map are a categorization of the jobs. The controller is seeded with a set of jobs; when a worker starts up, the controller removes one string from one of the queues and hands it out as the worker's first job. The worker may crash during the run, in which case the job gets put back on the appropriate queue (there is an ancillary table of outstanding jobs). If it completes the job successfully, it will send back a list of new job-strings, which the controller will sort into the appropriate queues. Then it will pull another string off some queue and send it to the worker as its next job; usually, but not always, it will pick the same queue as the previous job for that worker.
Now, the question. This data structure currently sits entirely in main memory, which was fine for small-scale test runs, but at full scale is eating all available RAM on the controller, all by itself. And the controller has several other tasks to accomplish, so that's no good.
What approach should I take? So far, I have considered:
a) to convert this to a primarily-on-disk data structure. It could be cached in RAM to some extent for efficiency, but jobs take tens of seconds to complete, so it's okay if it's not that efficient,
b) using a relational database - e.g. SQLite, (but SQL schemas are a very poor fit AFAICT),
c) using a NoSQL database with persistency support, e.g. Redis (data structure maps over trivially, but this still appears very RAM-centric to make me feel confident that the memory-hog problem will actually go away)
Concrete numbers: For a full-scale run, there will be between one and ten million keys in the hash, and less than 100 entries in each queue. String length varies wildly but is unlikely to be more than 250-ish bytes. So, a hypothetical (impossible) zero-overhead data structure would require 234 – 237 bytes of storage.
Ultimately, it all boils down on how you define efficiency needed on part of the controller -- e.g. response times, throughput, memory consumption, disk consumption, scalability... These properties are directly or indirectly related to:
number of requests the controller needs to handle per second (throughput)
acceptable response times
future growth expectations
From your options, here's how I'd evaluate each option:
a) to convert this to a primarily-on-disk data structure. It could be
cached in RAM to some extent for efficiency, but jobs take tens of
seconds to complete, so it's okay if it's not that efficient,
Given the current memory hog requirement, some form of persistent storage seems a reaonsable choice. Caching comes into play if there is a repeatable access pattern, say the same queue is accessed over and over again -- otherwise, caching is likely not to help.
This option makes sense if 1) you cannot find a database that maps trivially to your data structure (unlikely), 2) for some other reason you want to have your own on-disk format, e.g. you find that converting to a database is too much overhead (again, unlikely).
One alternative to databases is to look at persistent queues (e.g. using a RabbitMQ backing store), but I'm not sure what the per-queue or overall size limits are.
b) using a relational database - e.g. SQLite, (but SQL schemas are a
very poor fit AFAICT),
As you mention, SQL is probably not a good fit for your requirements, even though you could surely map your data structure to a relational model somehow.
However, NoSQL databases like MongoDB or CouchDB seem much more appropriate. Either way, a database of some sort seems viable as long as they can meet your throughput requirement. Many if not most NoSQL databases are also a good choice from a scalability perspective, as they include support for sharding data across multiple machines.
c) using a NoSQL database with persistency support, e.g. Redis (data
structure maps over trivially, but this still appears very RAM-centric
to make me feel confident that the memory-hog problem will actually go
away)
An in-memory database like Redis doesn't solve the memory hog problem, unless you set up a cluster of machines that each holds a part of the overall data. This makes sense only if keeping all data in-memory is needed due to low response times requirements. Yet, given the nature of your jobs, taking tens of seconds to complete, response times, respective to workers, hardly matter.
If you find, however, that response times do matter, Redis would be a good choice, as it handles partitioning trivially using either client-side consistent-hashing or at the cluster level, thus also supporting scalability scenarios.
In any case
Before you choose a solution, be sure to clarify your requirements. You mention you want an efficient solution. Since efficiency can only be gauged against some set of requirements, here's the list of questions I would try to answer first:
*Requirements
how many jobs are expected to complete, say per minute or per hour?
how many workers are needed to do so?
concluding from that:
what is the expected load in requestes/per second, and
what response times are expected on part of the controller (handing out jobs, receiving results)?
And looking into the future:
will the workload increase, i.e. does your solution need to scale up (more jobs per time unit, more more data per job?)
will there be a need for persistency of jobs and results, e.g. for auditing purposes?
Again, concluding from that,
how will this influence the number of workers?
what effect will it have on the number of requests/second on part of the controller?
With these answers, you will find yourself in a better position to choose a solution.
I would look into a message queue like RabbitMQ. This way it will first fill up the RAM and then use the disk. I have up to 500,000,000 objects in queues on a single server and it's just plugging away.
RabbitMQ works on Windows and Linux and has simple connectors/SDKs to about any kind of language.
https://www.rabbitmq.com/

What distributed message queues support millions of queues?

I'm looking for a distributed message queue that will support millions of queues, with each queue handling tens of messages per second.
The messages will be small (tens of bytes), and I don't expect the queues to get very long--on the order of tens of messages per queue at maximum, but when the system is humming along, the queues should stay fairly empty.
I'm not sure how many nodes to expect in the cluster--probably depends on the specific solution, but if I had to guess, I would say ten nodes. I would prefer that queues were relatively resilient to individual node failures within the cluster, but a few lost messages here and there won't make me lose sleep.
Does such a message queue exist? Seems like most of the field is optimized toward handling hundreds of queues with high throughput. But what is SQS built on? Surely not magic.
Update:
By request, it may indeed help to shed light on my problem domain. (I'd left details out before so as not to muddy the waters.) I'm experimenting with distributed cellular automata, with an initial target of a million cells in simulation. In some CA models, it's useful to add an event model, so that a cell can send events to its neighbors. Hence, a million queues, each with one consumer and 8 or so producers.
Costs are a concern for now, as I'm funding the experiments myself. (Thus Amazon's SQS is probably out of reach.)
From your description, it looks like OMG's Data Distribution Service could be a good fit. It is related to message queueing technologies, but I would rather call it a distributed data management infrastructure. It is completely distributed and supports advanced features that give you a lot of control over how the data is distributed, by means of a rich set of Quality of Service settings.
Not knowing much about your problem, I could guess what an approach might be. DDS is about distributing the state of strongly-typed data-items, as structures with typed attributes. You could create a data-type describing the state of an automaton. One of its attributes could be an ID uniquely identifying the automaton in the system. If possible, that would be assigned according to a scheme such that every automaton knows what the ID's of its neighbors are (if they are present). Each automaton would publish its state as needed, resulting in a distributed data-space containing the current state of all automatons. DDS supports so-called partitioning of that data-space. If you took advantage of that, then each of the nodes in your machine would be responsible for a well-defined subset of all automatons. Communication over the wire would only happen for those automatons neighboring a different partition. Since automatons know the ID's of their neighbors, they would be able to query the data-space for the states of the automatons it's interested in.
It is a bit hard to explain without a white board, but the end-result would be a single instance (which are a sort of very light-weight message queues) for most automatons, and two or three instances for those automatons at the border of a partition. If you had ten nodes and one million automata, then each node would have to be able to hold administration for approximately hundred thousand automata. I have seen systems being built with DDS of that scale, and larger, with tens of updates per second for each instance. The nice thing is that this technology scales well with the number of nodes, so you could bring down the resource load per node by adding more nodes.
If this is a research project, then you might even be able to use a commercial product without charge. Just google on dds research license.

Resources